Home Credit Default Risk Project
Table of Contents:¶
- Definition
1.1 Project Overview
1.2 Problem Statement
1.3 metrics - Analysis
2.1 Data Exploration
2.2 Exploratory Visualization - Algorithms and Techniques
3.1 Benchmark: Logistic Regression
3.2 Random Forest
3.3 Boosting - Data Preprocessing
4.1 Find Anamoly
4.2 Missing Values
4.3 Replace XNA & XAP
4.4 Replace Outliers
4.5 Scale and Encoding - Models
5.1 Data Preparation
5.2 Logistic Regression
5.3 Random Forest
5.4 XGBoost
5.5 Light GBM
1. Definition¶
1.1 Project Overview ¶
we will take an initial look at the Home Credit default risk machine learning competition currently hosted on Kaggle. The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification task:
Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features.
Classification: The label is a binary variabl, 0 (will repay loan on t), 1(will have difficulty repaying loan)
1.2 Problem Statement ¶
In this study, we will attempt to solve the following problem statement is: Can we predict how capable each applicant is of repaying a loan?
The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification problem where the label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan). In this study, our target variable Y is the probability associated with the lender paying back their loan. Therefore, this is a regression supervised learning problem.
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output, in other words Y = f(X). The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
1.3 Metrics ¶
The metrics chosen to measure performance of the results provided by the model in this project will be similar to the onse used in the Kaggle Competition, namely the results will be on evaluated on area under the ROC curve between the predicted probability and the observed target.
The ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. An ROC curve plots True Positive Rate vs. False Positive Rate at different classification thresholds.
The Area Under the ROC Curve, also known as AUC, measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
2. Analysis ¶
We are using a typical data science stack: numpy, pandas, sklearn, matplotlib
let's get into exploring the data..
2.1 Data Exploration¶
Data The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. Predicting whether or not a client will repay a loan or have difficulty is a critical business need, and Home Credit is hosting this competition on Kaggle to see what sort of models the machine learning community can develop to help them in this task.
There are 7 different sources of data:
- application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.
- bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
- bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
- previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
- POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
- credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
- installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
The diagram shows how all the data is related:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import gc
# suppress warnings
import warnings
warnings.filterwarnings('ignore')
Import Data
# Training and Test data
app_train = pd.read_csv("./home-credit-default-risk/application_train.csv")
app_test = pd.read_csv('./home-credit-default-risk/application_test.csv')
app_train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
print("Training data shape: ", app_train.shape)
Training data shape: (307511, 122)
The training data has 307511 observations (each one a separate loan) and 122 features (variables) including the TARGET (the label we want to predict).
app_test.head()
SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
print('Testing data shape: ', app_test.shape)
Testing data shape: (48744, 121)
The test data has 48744 observations (each one a separate loan) and 121 features (variables) excluding the TARGET (the label we want to predict).
Exploratory Visualization ¶
The target is what we are asked to predict:
0
for the loan was repaid on time
1
indicating the client had payment difficulties.
We can first examine the number of loans falling into each category.
df = pd.read_csv('./home-credit-default-risk/application_train.csv')
def plot_stats(df,feature, title, label_rotation=False,horizontal_layout=True):
temp = df[feature].value_counts()
df1 = pd.DataFrame({feature: temp.index,'Number of contracts': temp.values})
# Calculate the percentage of target=1 per category value
cat_perc = df[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
cat_perc.sort_values(by='TARGET', ascending=False, inplace=True)
if(horizontal_layout):
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
else:
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(12,14))
sns.color_palette("flare")
s = sns.barplot(ax=ax1, x = feature, y="Number of contracts",data=df1)
if(label_rotation):
s.set_xticklabels(s.get_xticklabels(),rotation=90)
s = sns.barplot(ax=ax2, x = feature, y='TARGET', order=cat_perc[feature], data=cat_perc)
if(label_rotation):
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.ylabel('Percent of target with value 1 [%]', fontsize=10)
plt.tick_params(axis='both', which='major', labelsize=10)
plt.suptitle(title)
plt.show();
count = app_train['TARGET'].value_counts()
count_df = pd.DataFrame({'labels': count.index, 'values':count.values})
plt.figure(figsize=(6,6))
plt.title("Application loans dataset")
sns.barplot(x = 'labels', y="values", data=count_df)
plt.show()
From the above we can see that there is an imbalance of target value.There are more loan paid on time than loan not paid.
Contract Type
plot_stats(df, 'NAME_CONTRACT_TYPE', "Loan Type")
Gender
plot_stats(df, "CODE_GENDER", "Gender", True, True)
Client Accompanied By:
plot_stats(df, 'NAME_TYPE_SUITE', "Accompanied By", True, True, )
Most of clients are unaccompanied while applying for the loan. In terms of percentage of no repayment of loan, clients accompanied by Other_B and Other_A are less likely to repay.
Family Status of Client
plot_stats(df, 'NAME_FAMILY_STATUS', "Family Status", True, True, )
plot_stats(df, 'CNT_CHILDREN', "Number of Childern")
- Majority of loans are taken by clients with no children.
- Percent of loan not paid by client with 9 or more children is 100%
- Over 25% for client with 6 children and reduces with lesser children.
Income Type
plot_stats(df, 'NAME_INCOME_TYPE', "Income Type", True, True )
Most of applicants for loans are income from Working, followed by Commercial associate, Pensioner and State servant.
- The applicants with the type of income Maternity leave have almost 40% of not returning loans.
- No repayment by Unemployed is 37%.
- The rest of types of incomes are under the average of 10% for not returning loans.
Occupation
plot_stats(df, "OCCUPATION_TYPE", True, True)
- Most of the loans are taken by Laborers, followed by Sales staff.
- IT staff take the lowest amount of loans.
- Highest percent of not repaid loans are Low-skill Laborers (above 15%), followed by Drivers and Waiters/barmen staff, Security staff, Laborers and Cooking staff.
Client Housing
plot_stats(df, "NAME_HOUSING_TYPE", "Client Housing", True, True)
Over 250,000 applicants for credits registered their housing as House/apartment. Following categories have a very small number of clients (With parents, Municipal appartment).
From these categories, Rented apartment and With parents have higher than 10% of no loan repayment.
Client's Education
plot_stats(df, 'NAME_EDUCATION_TYPE', "Clients Education", True, True)
Majority of the clients have Secondary / secondary special education, followed by clients with Higher education. Only a very small number having an academic degree.
The Lower secondary category, although rare, have the largest rate of not returning the loan (11%). The people with Academic degree have less than 2% not-repayment rate.
Organization Type
plot_stats(df, 'ORGANIZATION_TYPE', 'Organization Type', True, False)
Organizations with highest percent of loans not repaid are Transport: type 3 (16%), Industry: type 13 (13.5%), Industry: type 8 (12.5%) and Restaurant (less than 12%).
Days from birth distribution
def plot_distribution(feature, color):
plt.figure(figsize=(10,6))
plt.title("Distribution of %s" % feature)
sns.distplot(df[feature].dropna(),color=color, kde=True,bins=100)
plt.show()
plot_distribution('DAYS_BIRTH', 'red')
Negative means value is number of days passed since birth up to current application date.
app_train['DAYS_BIRTH'] = abs(app_train["DAYS_BIRTH"])
# plot distribution
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor='k', bins=30, color='red')
plt.title('Age of Client')
plt.xlabel('Age(years)')
plt.ylabel('Count')
plt.show()
There are no outliers in age distribution. Let's see how age have effect on the target variable.
plt.figure(figsize=(10, 8))
sns.kdeplot(app_train.loc[df['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')
sns.kdeplot(app_train.loc[df['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')
plt.legend()
<matplotlib.legend.Legend at 0x1e40443f290>
Target == 1 curve is skewed towards left side between 20-35 years ages but this variable is likely to be useful in the model because of its effect.
City registered not live city and not work city
plot_stats(app_train,'REG_CITY_NOT_LIVE_CITY', "Not Live in City", False, True)
plot_stats(app_train, 'REG_CITY_NOT_WORK_CITY', "Not Work in City")
Those who register in a city other than their workplace or residence exhibit a higher likelihood of not repaying loans compared to those who register in the same city (with repayment rates of 11% for work-related registrations and 12% for residence-related registrations).
numeric_cols = app_train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = app_train.select_dtypes('object').columns.tolist()
3. Algorithms ¶
3.1 Benchmark: Logistic Regression ¶
In this section, we will provide a defined benchmark result or threshold for comparing across performances obtained by your solution.
Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. Logistic Regression is classification algorithm that is not as sophisticated as the ensemble methods or boosted decision trees method discussed previously. Hence, it provides us with a good benchmark.
Binary logistic regression requires the dependent variable to be binary. For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome. Also, the features should be independent of each other. The model should have little or no multicollinearity.
3.2 Random Forest ¶
In this section, we will provide a defined benchmark result or threshold for comparing across performances obtained by your solution.
Random Forest is a popular and versatile machine learning method that is capable of solving both regression and classification. Random Forest is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It aggregates Classification (or Regression) Trees. A decision tree is composed of a series of decisions that can be used to classify an observation in a dataset.
Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a large number of features, and is helpful for estimating which of your variables are important in the underlying data being modeled.
Random Forest can be implemented using Python's sklearn library (ie sklearn.ensemble.RandomForestClassifier). A number of parameters are provided by defaults by sklearn (see below) such as:
number of estimators (ie number of trees in the forest) max_features (ie number of features to consider when looking for the best split) min_samples_leaf (ie minimum number of samples required to be at a leaf node)
3.3 Boosting ¶
XGBoost
Light GBM
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
Light GBM splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms.
Light GBM uses leaf wise splitting over depth wise splitting which enables it to converge much faster but also leads to overfitting. Key parameters in Light GBM are:
- num_iterations: number of boosting iterations to be performed ; default=100; type=int
- num_leaves : number of leaves in one tree ; default = 31 ; type =int
- min_data_in_leaf : Min number of data in one leaf.
- max_depth: Specify the max depth to which tree will grow. This parameter is used to deal with overfitting.
LightGBM can be implemented using the latest release on Microsoft's GitHub portal: https://github.com/Microsoft/LightGBM
4. Data Preprocessing ¶
4.1 Find Anamoly ¶
(app_train['DAYS_BIRTH'] / 365).describe()
count 307511.000000 mean 43.936973 std 11.956133 min 20.517808 25% 34.008219 50% 43.150685 75% 53.923288 max 69.120548 Name: DAYS_BIRTH, dtype: float64
(app_train['DAYS_EMPLOYED']).describe()
count 307511.000000 mean 63815.045904 std 141275.766519 min -17912.000000 25% -2760.000000 50% -1213.000000 75% -289.000000 max 365243.000000 Name: DAYS_EMPLOYED, dtype: float64
The maximum value (365243) is positive which is not correct given the fact that the days of employment is recorded relative to the current loan application date and therefore should be negative like the majority of other values in that column.
app_train['DAYS_EMPLOYED'].plot.box();
we can see that there is issue in data and needs to be corrected
# app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
# app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram', color="aquamarine");
# plt.xlabel('Days Employment');
4.2 Missing Values ¶
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
4.1 Replace XNA & XAP ¶
def replace_XNA_XAP(table):
cols = table.columns.to_list()
for col in cols:
# Check if the column exists in the DataFrame
if col in ['CODE_GENDER', 'ORGANIZATION_TYPE']:
# Check if the column contains string values before applying .str accessor
if table[col].dtype == 'O':
table[col] = table[col].str.strip().replace({'XNA': np.nan, 'XAP': np.nan})
# Replace all values of 'XNA', 'XAP' with np.nan
# table.replace(to_replace = {'XNA': np.nan, 'XAP': np.nan}, value = np.nan, regex=True, inplace = True)
return table
# https://www.kaggle.com/code/jamesdellinger/home-credit-putting-all-the-steps-together?scriptVersionId=5486249&cellId=11
def preprocess_main(df, flag):
if flag == 1:
# Separate target data from training dataset.
y_train = df['TARGET']
X = df.drop('TARGET', axis = 1)
# Replace all entries of 365243 in 'DAYS_EMPLOYED' with nan
X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
# Replace all entries of 0 in 'DAYS_LAST_PHONE_CHANGE' with nan
X['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
# Replace all entries of 'XNA' or 'XAP' in main data table with np.nan
# (Such entries should be confined to the features 'CODE_GENDER' and 'ORGANIZATION_TYPE'.)
X = replace_XNA_XAP(X)
# Two rows in training table have a value of 'Unknown' for
# 'NAME_FAMILY_STATUS', but no rows in test table do.
X['NAME_FAMILY_STATUS'].replace('Unknown', np.nan, inplace=True)
# Five rows in training table have a value of 'Maternity leave' for
# 'NAME_INCOME_TYPE', but no rows in test table do.
X['NAME_INCOME_TYPE'].replace('Maternity leave', np.nan, inplace=True)
# No rows in training table have -1 for 'REGION_RATING_CLIENT_W_CITY'
# but at least one row in test table does.
X['REGION_RATING_CLIENT_W_CITY'].replace(-1, np.nan, inplace=True)
return X, y_train
else:
X = df
# Replace all entries of 365243 in 'DAYS_EMPLOYED' with nan
X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
# Replace all entries of 0 in 'DAYS_LAST_PHONE_CHANGE' with nan
X['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
# Replace all entries of 'XNA' or 'XAP' in main data table with np.nan
# (Such entries should be confined to the features 'CODE_GENDER' and 'ORGANIZATION_TYPE'.)
X = replace_XNA_XAP(X)
# Two rows in training table have a value of 'Unknown' for
# 'NAME_FAMILY_STATUS', but no rows in test table do.
X['NAME_FAMILY_STATUS'].replace('Unknown', np.nan, inplace=True)
# Five rows in training table have a value of 'Maternity leave' for
# 'NAME_INCOME_TYPE', but no rows in test table do.
X['NAME_INCOME_TYPE'].replace('Maternity leave', np.nan, inplace=True)
# No rows in training table have -1 for 'REGION_RATING_CLIENT_W_CITY'
# but at least one row in test table does.
X['REGION_RATING_CLIENT_W_CITY'].replace(-1, np.nan, inplace=True)
return X
X_train, y_train = preprocess_main(app_train, 1)
X_test = preprocess_main(app_test, 0)
X_train.shape, y_train.shape, X_test.shape
((307511, 121), (307511,), (48744, 121))
4.4 Replace Outliers ¶
def replace_day_outliers(df):
"""Replace 365243 with np.nan in any columns with DAYS"""
for col in df.columns:
if "DAYS" in col:
df[col] = df[col].replace({365243: np.nan})
return df
app_train = replace_day_outliers(X_train)
app_test = replace_day_outliers(X_test)
table = missing_values_table(app_train)
table = missing_values_table(app_test)
# table[table['% of Total Values'] > 60]
Your selected dataframe has 121 columns. There are 72 columns that have missing values. Your selected dataframe has 121 columns. There are 68 columns that have missing values.
Remove missing values greater that 60%
def remove_missing_col(df):
miss_data = pd.DataFrame((df.isnull().sum())*100/df.shape[0])
miss_data_col=miss_data[miss_data[0]>60].index
data_new = df[[i for i in df.columns if i not in miss_data_col]]
return data_new
app_train = remove_missing_col(app_train)
app_test = remove_missing_col(app_test)
table = missing_values_table(app_train)
table = missing_values_table(app_test)
Your selected dataframe has 104 columns. There are 55 columns that have missing values. Your selected dataframe has 104 columns. There are 51 columns that have missing values.
# Create imputer function
# (https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn)
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
Columns of dtype object are imputed with the most frequent value
in column.
Columns of other types are imputed with mean of column.
"""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
df_train = pd.get_dummies(app_train)
df_test = pd.get_dummies(app_test)
df_train.shape, df_test.shape
((307511, 221), (48744, 221))
4.5 Scaling and Encoding ¶
# Drop the SK_ID_CURR from training data
temp = df_train['SK_ID_CURR']
train = df_train.drop(columns=['SK_ID_CURR'])
# Features
features = df_train.columns.to_list()
features = features[1:]
# Scale each features to 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
# Median imputation
train = DataFrameImputer().fit_transform(train)
## Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
base_train = pd.DataFrame(data=train, columns=features)
base_train['SK_ID_CURR'] = temp
print('Data shape: ', base_train.shape)
Data shape: (307511, 221)
# Drop the SK_ID_CURR from test data
temp = df_test['SK_ID_CURR']
test = df_test.drop(columns=['SK_ID_CURR'])
# Features
features = df_test.columns.to_list()
features = features[1:]
# Scale each features to 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
# Median imputation
test = DataFrameImputer().fit_transform(test)
## Repeat with the scaler
scaler.fit(test)
test = scaler.transform(test)
test = pd.DataFrame(data=test, columns=features)
Since we encoded categorical data we should check if all the columns in train and test data match.
list_train = base_train.columns.tolist()
list_test = test.columns.tolist()
# Find values in train that are not in test
not_in_test = set(list_train) - set(list_test)
print("Values in train but not in test:", not_in_test)
Values in train but not in test: {'SK_ID_CURR'}
Align Data
# Align the Training and Testing data, keep only columns present in both dataframes
train, test = base_train.align(test, join = 'inner', axis = 1)
print('Training Features size: ', train.shape)
print('Testing Features size: ', test.shape)
Training Features size: (307511, 220) Testing Features size: (48744, 220)
Check for values missing values
missing_values_table(train)
Your selected dataframe has 220 columns. There are 0 columns that have missing values.
Missing Values | % of Total Values |
---|
missing_values_table(test)
Your selected dataframe has 220 columns. There are 0 columns that have missing values.
Missing Values | % of Total Values |
---|
5. Models ¶
#Import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,roc_auc_score,classification_report,roc_curve,auc, f1_score
# Model function
def model_base(algorithm,dtrain_X,dtrain_Y,dtest_X,cols=None):
# Extract feature names
feature_names = list(dtrain_X.columns)
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
algorithm.fit(dtrain_X[cols],dtrain_Y)
predictions = algorithm.predict(dtest_X[cols])
prediction_probabilities = algorithm.predict_proba(dtest_X[cols])[:,1]
return prediction_probabilities
5.1 Data Preparation ¶
#separating dependent and independent variables
train_X = train[[i for i in train.columns if i not in ['SK_ID_CURR', 'TARGET']]]
train_Y = y_train
test_X = test[[i for i in test.columns if i not in ['SK_ID_CURR']]]
train_X.shape, train_Y.shape, test_X.shape
((307511, 220), (307511,), (48744, 220))
5.2 Logistic Regression¶
logit = LogisticRegression()
prediction_probabilities = model_base(logit,train_X,train_Y,test_X,train_X.columns)
The Receiver Operating Characteristic Area under curve (ROC AUC) is a metric for which can be suitably applied for imbalanced datasets since it does not generate 0 or 1 predictions, but rather a probability between 0 and 1.
prediction_probabilities
array([0.05003637, 0.20782622, 0.04599527, ..., 0.05561639, 0.04693359, 0.12761884])
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': prediction_probabilities})
df.to_csv('./logreg_baseline.csv', index = False)
Submitting the output to kaggle resulted in AUC score of 0.70 for base logistic regression.
5.3 Improved Model: Random Forest¶
from sklearn.ensemble import RandomForestClassifier
# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
prediction_probabilities = model_base(random_forest,train_X,train_Y,test_X,train_X.columns)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 20.2s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 1.0min finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 26 tasks | elapsed: 0.1s [Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed: 0.4s finished [Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers. [Parallel(n_jobs=12)]: Done 26 tasks | elapsed: 0.0s [Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed: 0.4s finished
prediction_probabilities
array([0.1 , 0.17, 0.09, ..., 0.14, 0.1 , 0.24])
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': prediction_probabilities})
df.to_csv('./rf.csv', index = False)
# pred = pd.read_csv('./rf.csv')
5.4 XGBoost¶
from xgboost import XGBClassifier
clf = XGBClassifier(learning_rate =0.01,
n_estimators=1000,
max_depth=4,
min_child_weight=4,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=2,
seed=27)
clf
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.01, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=4, missing=nan, monotone_constraints=None, n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None, predictor=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.01, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=4, missing=nan, monotone_constraints=None, n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None, predictor=None, ...)
clf.fit(train_X, train_Y)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.01, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=4, missing=nan, monotone_constraints=None, n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None, predictor=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.8, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.01, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=4, missing=nan, monotone_constraints=None, n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None, predictor=None, ...)
XGB_clf_pred = clf.predict_proba(test_X)[:, 1]
XGB_clf_pred
array([0.07996661, 0.20139933, 0.0383848 , ..., 0.08075171, 0.08277728, 0.2158363 ], dtype=float32)
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': XGB_clf_pred})
df.to_csv('./xgb.csv', index = False)
5.4 Light GBM¶
from lightgbm import LGBMClassifier
LGB_clf = LGBMClassifier(n_estimators=100,
boosting_type='gbdt',
objective='binary',
metric='binary_logloss',
force_col_wise=True)
LGB_clf.fit(train_X, train_Y)
--------------------------------------------------------------------------- LightGBMError Traceback (most recent call last) Cell In[93], line 9 1 from lightgbm import LGBMClassifier 3 LGB_clf = LGBMClassifier(n_estimators=100, 4 boosting_type='gbdt', 5 objective='binary', 6 metric='binary_logloss', 7 force_col_wise=True) ----> 9 LGB_clf.fit(train_X, train_Y) File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\sklearn.py:1142, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, feature_name, categorical_feature, callbacks, init_model) 1139 else: 1140 valid_sets.append((valid_x, self._le.transform(valid_y))) -> 1142 super().fit( 1143 X, 1144 _y, 1145 sample_weight=sample_weight, 1146 init_score=init_score, 1147 eval_set=valid_sets, 1148 eval_names=eval_names, 1149 eval_sample_weight=eval_sample_weight, 1150 eval_class_weight=eval_class_weight, 1151 eval_init_score=eval_init_score, 1152 eval_metric=eval_metric, 1153 feature_name=feature_name, 1154 categorical_feature=categorical_feature, 1155 callbacks=callbacks, 1156 init_model=init_model 1157 ) 1158 return self File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\sklearn.py:842, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, feature_name, categorical_feature, callbacks, init_model) 839 evals_result: _EvalResultDict = {} 840 callbacks.append(record_evaluation(evals_result)) --> 842 self._Booster = train( 843 params=params, 844 train_set=train_set, 845 num_boost_round=self.n_estimators, 846 valid_sets=valid_sets, 847 valid_names=eval_names, 848 feval=eval_metrics_callable, # type: ignore[arg-type] 849 init_model=init_model, 850 feature_name=feature_name, 851 callbacks=callbacks 852 ) 854 self._evals_result = evals_result 855 self._best_iteration = self._Booster.best_iteration File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\engine.py:255, in train(params, train_set, num_boost_round, valid_sets, valid_names, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks) 253 # construct booster 254 try: --> 255 booster = Booster(params=params, train_set=train_set) 256 if is_valid_contain_train: 257 booster.set_train_data_name(train_data_name) File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:3200, in Booster.__init__(self, params, train_set, model_file, model_str) 3193 self.set_network( 3194 machines=machines, 3195 local_listen_port=params["local_listen_port"], 3196 listen_time_out=params.get("time_out", 120), 3197 num_machines=params["num_machines"] 3198 ) 3199 # construct booster object -> 3200 train_set.construct() 3201 # copy the parameters from train_set 3202 params.update(train_set.get_params()) File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:2276, in Dataset.construct(self) 2269 self._set_init_score_by_predictor( 2270 predictor=self._predictor, 2271 data=self.data, 2272 used_indices=used_indices 2273 ) 2274 else: 2275 # create train -> 2276 self._lazy_init(data=self.data, label=self.label, reference=None, 2277 weight=self.weight, group=self.group, 2278 init_score=self.init_score, predictor=self._predictor, 2279 feature_name=self.feature_name, categorical_feature=self.categorical_feature, 2280 params=self.params, position=self.position) 2281 if self.free_raw_data: 2282 self.data = None File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:1959, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, feature_name, categorical_feature, params, position) 1957 raise TypeError(f'Wrong predictor type {type(predictor).__name__}') 1958 # set feature names -> 1959 return self.set_feature_name(feature_name) File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:2639, in Dataset.set_feature_name(self, feature_name) 2637 raise ValueError(f"Length of feature_name({len(feature_name)}) and num_feature({self.num_feature()}) don't match") 2638 c_feature_name = [_c_str(name) for name in feature_name] -> 2639 _safe_call(_LIB.LGBM_DatasetSetFeatureNames( 2640 self._handle, 2641 _c_array(ctypes.c_char_p, c_feature_name), 2642 ctypes.c_int(len(feature_name)))) 2643 return self File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:242, in _safe_call(ret) 234 """Check the return value from C API call. 235 236 Parameters (...) 239 The return value from C API calls. 240 """ 241 if ret != 0: --> 242 raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8')) LightGBMError: Do not support special JSON characters in feature name.
Running LGB_clf.fit(train_X, train_Y) gave LightGBMError: Do not support special JSON characters in feature name.
. I assume its is because of special characters in the columns name. I have remove unwanted characters from columns names below and then run the fit on the training data.
# Remove unwanted characters from column names
train_X.columns = [col.replace(',', '').replace(']', '').replace('[', '').replace('{', '').replace('}', '')
.replace('"', '').replace(':', '').replace('/', '').replace(':', '').replace(' ', '').replace('_', '') for col in train_X.columns]
test_X.columns = [col.replace(',', '').replace(']', '').replace('[', '').replace('{', '').replace('}', '')
.replace('"', '').replace(':', '').replace('/', '').replace(':', '').replace(' ', '').replace('_', '') for col in test_X.columns]
LGB_clf.fit(train_X, train_Y)
[LightGBM] [Info] Number of positive: 24825, number of negative: 282686 [LightGBM] [Info] Total Bins 8872 [LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 215 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486 [LightGBM] [Info] Start training from score -2.432486
LGBMClassifier(force_col_wise=True, metric='binary_logloss', objective='binary')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMClassifier(force_col_wise=True, metric='binary_logloss', objective='binary')
LGB_clf_pred = LGB_clf.predict_proba(test_X)[:, 1]
LGB_clf_pred
array([0.034586 , 0.10051357, 0.01817225, ..., 0.03762378, 0.03859904, 0.10736544])
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': LGB_clf_pred})
df.to_csv('./lgbm.csv', index = False)
## In progress