Home Credit Default Risk Project

Table of Contents:¶

  1. Definition
    1.1 Project Overview
    1.2 Problem Statement
    1.3 metrics
  2. Analysis
    2.1 Data Exploration
    2.2 Exploratory Visualization
  3. Algorithms and Techniques
    3.1 Benchmark: Logistic Regression
    3.2 Random Forest
    3.3 Boosting
  4. Data Preprocessing
    4.1 Find Anamoly
    4.2 Missing Values
    4.3 Replace XNA & XAP
    4.4 Replace Outliers
    4.5 Scale and Encoding
  5. Models
    5.1 Data Preparation
    5.2 Logistic Regression
    5.3 Random Forest
    5.4 XGBoost
    5.5 Light GBM

1. Definition¶

1.1 Project Overview ¶

we will take an initial look at the Home Credit default risk machine learning competition currently hosted on Kaggle. The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification task:

Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features.
Classification: The label is a binary variabl, 0 (will repay loan on t), 1(will have difficulty repaying loan)

1.2 Problem Statement ¶

In this study, we will attempt to solve the following problem statement is: Can we predict how capable each applicant is of repaying a loan?

The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification problem where the label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan). In this study, our target variable Y is the probability associated with the lender paying back their loan. Therefore, this is a regression supervised learning problem.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output, in other words Y = f(X). The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

1.3 Metrics ¶

The metrics chosen to measure performance of the results provided by the model in this project will be similar to the onse used in the Kaggle Competition, namely the results will be on evaluated on area under the ROC curve between the predicted probability and the observed target.

The ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. An ROC curve plots True Positive Rate vs. False Positive Rate at different classification thresholds.

The Area Under the ROC Curve, also known as AUC, measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

2. Analysis ¶

We are using a typical data science stack: numpy, pandas, sklearn, matplotlib

let's get into exploring the data..

2.1 Data Exploration¶

Data The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. Predicting whether or not a client will repay a loan or have difficulty is a critical business need, and Home Credit is hosting this competition on Kaggle to see what sort of models the machine learning community can develop to help them in this task.

There are 7 different sources of data:

  • application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.
  • bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

The diagram shows how all the data is related:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import gc

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

Import Data

In [3]:
# Training and Test data
app_train = pd.read_csv("./home-credit-default-risk/application_train.csv")
app_test = pd.read_csv('./home-credit-default-risk/application_test.csv')
In [4]:
app_train.head()
Out[4]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

In [5]:
print("Training data shape: ", app_train.shape)
Training data shape:  (307511, 122)

The training data has 307511 observations (each one a separate loan) and 122 features (variables) including the TARGET (the label we want to predict).

In [6]:
app_test.head()
Out[6]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

In [7]:
print('Testing data shape: ', app_test.shape)
Testing data shape:  (48744, 121)

The test data has 48744 observations (each one a separate loan) and 121 features (variables) excluding the TARGET (the label we want to predict).

Exploratory Visualization ¶

The target is what we are asked to predict:

0 for the loan was repaid on time
1 indicating the client had payment difficulties.

We can first examine the number of loans falling into each category.

In [8]:
df = pd.read_csv('./home-credit-default-risk/application_train.csv')
In [9]:
def plot_stats(df,feature, title, label_rotation=False,horizontal_layout=True):
    temp = df[feature].value_counts()
    df1 = pd.DataFrame({feature: temp.index,'Number of contracts': temp.values})

    # Calculate the percentage of target=1 per category value
    cat_perc = df[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
    cat_perc.sort_values(by='TARGET', ascending=False, inplace=True)
    
    if(horizontal_layout):
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
    else:
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(12,14))
        
    sns.color_palette("flare")
    
    s = sns.barplot(ax=ax1, x = feature, y="Number of contracts",data=df1)
    
    if(label_rotation):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    s = sns.barplot(ax=ax2, x = feature, y='TARGET', order=cat_perc[feature], data=cat_perc)
    
    if(label_rotation):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    plt.ylabel('Percent of target with value 1 [%]', fontsize=10)
    plt.tick_params(axis='both', which='major', labelsize=10)
    plt.suptitle(title)
    plt.show();
In [10]:
count = app_train['TARGET'].value_counts()
count_df = pd.DataFrame({'labels': count.index, 'values':count.values})

plt.figure(figsize=(6,6))
plt.title("Application loans dataset")
sns.barplot(x = 'labels', y="values", data=count_df)
plt.show()
No description has been provided for this image

From the above we can see that there is an imbalance of target value.There are more loan paid on time than loan not paid.

Contract Type

In [11]:
plot_stats(df, 'NAME_CONTRACT_TYPE', "Loan Type")
No description has been provided for this image

Gender

In [12]:
plot_stats(df, "CODE_GENDER", "Gender", True, True)
No description has been provided for this image

Client Accompanied By:

In [13]:
plot_stats(df, 'NAME_TYPE_SUITE', "Accompanied By", True, True, )
No description has been provided for this image

Most of clients are unaccompanied while applying for the loan. In terms of percentage of no repayment of loan, clients accompanied by Other_B and Other_A are less likely to repay.

Family Status of Client

In [14]:
plot_stats(df, 'NAME_FAMILY_STATUS', "Family Status", True, True, )
No description has been provided for this image
In [15]:
plot_stats(df, 'CNT_CHILDREN', "Number of Childern")
No description has been provided for this image
  • Majority of loans are taken by clients with no children.
  • Percent of loan not paid by client with 9 or more children is 100%
  • Over 25% for client with 6 children and reduces with lesser children.

Income Type

In [16]:
plot_stats(df, 'NAME_INCOME_TYPE', "Income Type", True, True )
No description has been provided for this image

Most of applicants for loans are income from Working, followed by Commercial associate, Pensioner and State servant.

  • The applicants with the type of income Maternity leave have almost 40% of not returning loans.
  • No repayment by Unemployed is 37%.
  • The rest of types of incomes are under the average of 10% for not returning loans.

Occupation

In [17]:
plot_stats(df, "OCCUPATION_TYPE", True, True)
No description has been provided for this image
  • Most of the loans are taken by Laborers, followed by Sales staff.
  • IT staff take the lowest amount of loans.
  • Highest percent of not repaid loans are Low-skill Laborers (above 15%), followed by Drivers and Waiters/barmen staff, Security staff, Laborers and Cooking staff.

Client Housing

In [18]:
plot_stats(df, "NAME_HOUSING_TYPE", "Client Housing", True, True)
No description has been provided for this image
  • Over 250,000 applicants for credits registered their housing as House/apartment. Following categories have a very small number of clients (With parents, Municipal appartment).

  • From these categories, Rented apartment and With parents have higher than 10% of no loan repayment.

Client's Education

In [19]:
plot_stats(df, 'NAME_EDUCATION_TYPE', "Clients Education", True, True)
No description has been provided for this image

Majority of the clients have Secondary / secondary special education, followed by clients with Higher education. Only a very small number having an academic degree.

The Lower secondary category, although rare, have the largest rate of not returning the loan (11%). The people with Academic degree have less than 2% not-repayment rate.

Organization Type

In [20]:
plot_stats(df, 'ORGANIZATION_TYPE', 'Organization Type', True, False)
No description has been provided for this image

Organizations with highest percent of loans not repaid are Transport: type 3 (16%), Industry: type 13 (13.5%), Industry: type 8 (12.5%) and Restaurant (less than 12%).

Days from birth distribution

In [21]:
def plot_distribution(feature, color):
    plt.figure(figsize=(10,6))
    plt.title("Distribution of %s" % feature)
    sns.distplot(df[feature].dropna(),color=color, kde=True,bins=100)
    plt.show()  
In [22]:
plot_distribution('DAYS_BIRTH', 'red')
No description has been provided for this image

Negative means value is number of days passed since birth up to current application date.

In [23]:
app_train['DAYS_BIRTH'] = abs(app_train["DAYS_BIRTH"])

# plot distribution
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor='k', bins=30, color='red')
plt.title('Age of Client')
plt.xlabel('Age(years)')
plt.ylabel('Count')
plt.show()
No description has been provided for this image

There are no outliers in age distribution. Let's see how age have effect on the target variable.

In [24]:
plt.figure(figsize=(10, 8))
sns.kdeplot(app_train.loc[df['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')
sns.kdeplot(app_train.loc[df['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')

plt.legend()
Out[24]:
<matplotlib.legend.Legend at 0x1e40443f290>
No description has been provided for this image

Target == 1 curve is skewed towards left side between 20-35 years ages but this variable is likely to be useful in the model because of its effect.

City registered not live city and not work city

In [25]:
plot_stats(app_train,'REG_CITY_NOT_LIVE_CITY', "Not Live in City", False, True)
plot_stats(app_train, 'REG_CITY_NOT_WORK_CITY', "Not Work in City")
No description has been provided for this image
No description has been provided for this image

Those who register in a city other than their workplace or residence exhibit a higher likelihood of not repaying loans compared to those who register in the same city (with repayment rates of 11% for work-related registrations and 12% for residence-related registrations).

In [26]:
numeric_cols = app_train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = app_train.select_dtypes('object').columns.tolist()

3. Algorithms ¶

3.1 Benchmark: Logistic Regression ¶

In this section, we will provide a defined benchmark result or threshold for comparing across performances obtained by your solution.

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. Logistic Regression is classification algorithm that is not as sophisticated as the ensemble methods or boosted decision trees method discussed previously. Hence, it provides us with a good benchmark.

Binary logistic regression requires the dependent variable to be binary. For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome. Also, the features should be independent of each other. The model should have little or no multicollinearity.

3.2 Random Forest ¶

In this section, we will provide a defined benchmark result or threshold for comparing across performances obtained by your solution.

Random Forest is a popular and versatile machine learning method that is capable of solving both regression and classification. Random Forest is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It aggregates Classification (or Regression) Trees. A decision tree is composed of a series of decisions that can be used to classify an observation in a dataset.

Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a large number of features, and is helpful for estimating which of your variables are important in the underlying data being modeled.

Random Forest can be implemented using Python's sklearn library (ie sklearn.ensemble.RandomForestClassifier). A number of parameters are provided by defaults by sklearn (see below) such as:

number of estimators (ie number of trees in the forest) max_features (ie number of features to consider when looking for the best split) min_samples_leaf (ie minimum number of samples required to be at a leaf node)

3.3 Boosting ¶

XGBoost

Light GBM

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.

Light GBM splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms.

Light GBM uses leaf wise splitting over depth wise splitting which enables it to converge much faster but also leads to overfitting. Key parameters in Light GBM are:

  • num_iterations: number of boosting iterations to be performed ; default=100; type=int
  • num_leaves : number of leaves in one tree ; default = 31 ; type =int
  • min_data_in_leaf : Min number of data in one leaf.
  • max_depth: Specify the max depth to which tree will grow. This parameter is used to deal with overfitting.

LightGBM can be implemented using the latest release on Microsoft's GitHub portal: https://github.com/Microsoft/LightGBM

4. Data Preprocessing ¶

4.1 Find Anamoly ¶

In [27]:
(app_train['DAYS_BIRTH'] / 365).describe()
Out[27]:
count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64
In [28]:
(app_train['DAYS_EMPLOYED']).describe()
Out[28]:
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64

The maximum value (365243) is positive which is not correct given the fact that the days of employment is recorded relative to the current loan application date and therefore should be negative like the majority of other values in that column.

In [29]:
app_train['DAYS_EMPLOYED'].plot.box();
No description has been provided for this image

we can see that there is issue in data and needs to be corrected

In [30]:
# app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
# app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram', color="aquamarine");
# plt.xlabel('Days Employment');

4.2 Missing Values ¶

In [31]:
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

4.1 Replace XNA & XAP ¶

In [32]:
def replace_XNA_XAP(table):
    
    cols = table.columns.to_list()
    
    for col in cols:
        # Check if the column exists in the DataFrame
        if col in ['CODE_GENDER', 'ORGANIZATION_TYPE']:
            # Check if the column contains string values before applying .str accessor
            if table[col].dtype == 'O':
                table[col] = table[col].str.strip().replace({'XNA': np.nan, 'XAP': np.nan})
    
    # Replace all values of 'XNA', 'XAP' with np.nan
    # table.replace(to_replace = {'XNA': np.nan, 'XAP': np.nan}, value = np.nan, regex=True, inplace = True)
    
    return table
In [33]:
# https://www.kaggle.com/code/jamesdellinger/home-credit-putting-all-the-steps-together?scriptVersionId=5486249&cellId=11
def preprocess_main(df, flag):
    
    if flag == 1:
        # Separate target data from training dataset.
        y_train = df['TARGET']
        X = df.drop('TARGET', axis = 1)

        # Replace all entries of 365243 in 'DAYS_EMPLOYED' with nan
        X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)

        # Replace all entries of 0 in 'DAYS_LAST_PHONE_CHANGE' with nan
        X['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)

        # Replace all entries of 'XNA' or 'XAP' in main data table with np.nan
        # (Such entries should be confined to the features 'CODE_GENDER' and 'ORGANIZATION_TYPE'.)
        X = replace_XNA_XAP(X)

        # Two rows in training table have a value of 'Unknown' for 
        # 'NAME_FAMILY_STATUS', but no rows in test table do.
        X['NAME_FAMILY_STATUS'].replace('Unknown', np.nan, inplace=True)

        # Five rows in training table have a value of 'Maternity leave' for 
        # 'NAME_INCOME_TYPE', but no rows in test table do.
        X['NAME_INCOME_TYPE'].replace('Maternity leave', np.nan, inplace=True)

        # No rows in training table have -1 for 'REGION_RATING_CLIENT_W_CITY' 
        # but at least one row in test table does.
        X['REGION_RATING_CLIENT_W_CITY'].replace(-1, np.nan, inplace=True)

        return X, y_train
        
    else:
        X = df
        
        # Replace all entries of 365243 in 'DAYS_EMPLOYED' with nan
        X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)

        # Replace all entries of 0 in 'DAYS_LAST_PHONE_CHANGE' with nan
        X['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)

        # Replace all entries of 'XNA' or 'XAP' in main data table with np.nan
        # (Such entries should be confined to the features 'CODE_GENDER' and 'ORGANIZATION_TYPE'.)
        X = replace_XNA_XAP(X)

        # Two rows in training table have a value of 'Unknown' for 
        # 'NAME_FAMILY_STATUS', but no rows in test table do.
        X['NAME_FAMILY_STATUS'].replace('Unknown', np.nan, inplace=True)

        # Five rows in training table have a value of 'Maternity leave' for 
        # 'NAME_INCOME_TYPE', but no rows in test table do.
        X['NAME_INCOME_TYPE'].replace('Maternity leave', np.nan, inplace=True)

        # No rows in training table have -1 for 'REGION_RATING_CLIENT_W_CITY' 
        # but at least one row in test table does.
        X['REGION_RATING_CLIENT_W_CITY'].replace(-1, np.nan, inplace=True)

        return X
In [34]:
X_train, y_train = preprocess_main(app_train, 1)
X_test = preprocess_main(app_test, 0)
In [35]:
X_train.shape, y_train.shape, X_test.shape
Out[35]:
((307511, 121), (307511,), (48744, 121))

4.4 Replace Outliers ¶

In [36]:
def replace_day_outliers(df):
    """Replace 365243 with np.nan in any columns with DAYS"""
    for col in df.columns:
        if "DAYS" in col:
            df[col] = df[col].replace({365243: np.nan})
    return df
In [37]:
app_train = replace_day_outliers(X_train)
app_test = replace_day_outliers(X_test)
table = missing_values_table(app_train)
table = missing_values_table(app_test)
# table[table['% of Total Values'] > 60]
Your selected dataframe has 121 columns.
There are 72 columns that have missing values.
Your selected dataframe has 121 columns.
There are 68 columns that have missing values.

Remove missing values greater that 60%

In [38]:
def remove_missing_col(df):
    miss_data = pd.DataFrame((df.isnull().sum())*100/df.shape[0])
    miss_data_col=miss_data[miss_data[0]>60].index
    data_new  = df[[i for i in df.columns if i not in miss_data_col]]
    return data_new
In [39]:
app_train = remove_missing_col(app_train)
app_test = remove_missing_col(app_test)
table = missing_values_table(app_train)
table = missing_values_table(app_test)
Your selected dataframe has 104 columns.
There are 55 columns that have missing values.
Your selected dataframe has 104 columns.
There are 51 columns that have missing values.
In [40]:
# Create imputer function

# (https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn)

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)
In [41]:
df_train = pd.get_dummies(app_train)
df_test = pd.get_dummies(app_test)
In [42]:
df_train.shape, df_test.shape
Out[42]:
((307511, 221), (48744, 221))

4.5 Scaling and Encoding ¶

In [43]:
# Drop the SK_ID_CURR from training data
temp = df_train['SK_ID_CURR']
train = df_train.drop(columns=['SK_ID_CURR'])

# Features 
features = df_train.columns.to_list()
features = features[1:]

# Scale each features to 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))

# Median imputation
train = DataFrameImputer().fit_transform(train)

## Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)

base_train = pd.DataFrame(data=train, columns=features)

base_train['SK_ID_CURR'] = temp
print('Data shape: ', base_train.shape)
Data shape:  (307511, 221)
In [44]:
# Drop the SK_ID_CURR from test data
temp = df_test['SK_ID_CURR']
test = df_test.drop(columns=['SK_ID_CURR'])

# Features 
features = df_test.columns.to_list()
features = features[1:]

# Scale each features to 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))

# Median imputation
test = DataFrameImputer().fit_transform(test)

## Repeat with the scaler
scaler.fit(test)
test = scaler.transform(test)

test = pd.DataFrame(data=test, columns=features)

Since we encoded categorical data we should check if all the columns in train and test data match.

In [45]:
list_train = base_train.columns.tolist()
list_test = test.columns.tolist()

# Find values in train that are not in test
not_in_test = set(list_train) - set(list_test)

print("Values in train but not in test:", not_in_test)
Values in train but not in test: {'SK_ID_CURR'}

Align Data

In [46]:
# Align the Training and Testing data, keep only columns present in both dataframes
train, test = base_train.align(test, join = 'inner', axis = 1)

print('Training Features size: ', train.shape)
print('Testing Features size: ', test.shape)
Training Features size:  (307511, 220)
Testing Features size:  (48744, 220)

Check for values missing values

In [47]:
missing_values_table(train)
Your selected dataframe has 220 columns.
There are 0 columns that have missing values.
Out[47]:
Missing Values % of Total Values
In [48]:
missing_values_table(test)
Your selected dataframe has 220 columns.
There are 0 columns that have missing values.
Out[48]:
Missing Values % of Total Values

5. Models ¶

In [92]:
#Import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,roc_auc_score,classification_report,roc_curve,auc, f1_score
In [49]:
# Model function

def model_base(algorithm,dtrain_X,dtrain_Y,dtest_X,cols=None):
    
    # Extract feature names
    feature_names = list(dtrain_X.columns)
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    algorithm.fit(dtrain_X[cols],dtrain_Y)
    predictions = algorithm.predict(dtest_X[cols])
    prediction_probabilities = algorithm.predict_proba(dtest_X[cols])[:,1]
    
    return prediction_probabilities

5.1 Data Preparation ¶

In [50]:
#separating dependent and independent variables

train_X = train[[i for i in train.columns if i not in ['SK_ID_CURR', 'TARGET']]]
train_Y = y_train

test_X  = test[[i for i in test.columns if i not in ['SK_ID_CURR']]]
In [51]:
train_X.shape, train_Y.shape, test_X.shape
Out[51]:
((307511, 220), (307511,), (48744, 220))

5.2 Logistic Regression¶

In [59]:
logit = LogisticRegression()
prediction_probabilities = model_base(logit,train_X,train_Y,test_X,train_X.columns)

The Receiver Operating Characteristic Area under curve (ROC AUC) is a metric for which can be suitably applied for imbalanced datasets since it does not generate 0 or 1 predictions, but rather a probability between 0 and 1.

In [60]:
prediction_probabilities
Out[60]:
array([0.05003637, 0.20782622, 0.04599527, ..., 0.05561639, 0.04693359,
       0.12761884])
In [ ]:
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': prediction_probabilities})
df.to_csv('./logreg_baseline.csv', index = False)

Submitting the output to kaggle resulted in AUC score of 0.70 for base logistic regression.

5.3 Improved Model: Random Forest¶

In [61]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
In [62]:
prediction_probabilities = model_base(random_forest,train_X,train_Y,test_X,train_X.columns)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   20.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.0min finished
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.1s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    0.4s finished
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    0.4s finished
In [63]:
prediction_probabilities
Out[63]:
array([0.1 , 0.17, 0.09, ..., 0.14, 0.1 , 0.24])
In [ ]:
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': prediction_probabilities})
df.to_csv('./rf.csv', index = False)
In [ ]:
# pred = pd.read_csv('./rf.csv')

5.4 XGBoost¶

In [65]:
from xgboost import XGBClassifier

clf = XGBClassifier(learning_rate =0.01,
                    n_estimators=1000,
                    max_depth=4,
                    min_child_weight=4,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    objective= 'binary:logistic',
                    nthread=4,
                    scale_pos_weight=2,
                    seed=27)
clf
Out[65]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=4, missing=nan, monotone_constraints=None,
              n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None,
              predictor=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=4, missing=nan, monotone_constraints=None,
              n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None,
              predictor=None, ...)
In [66]:
clf.fit(train_X, train_Y)
Out[66]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=4, missing=nan, monotone_constraints=None,
              n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None,
              predictor=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=4, missing=nan, monotone_constraints=None,
              n_estimators=1000, n_jobs=None, nthread=4, num_parallel_tree=None,
              predictor=None, ...)
In [68]:
XGB_clf_pred = clf.predict_proba(test_X)[:, 1]
In [69]:
XGB_clf_pred
Out[69]:
array([0.07996661, 0.20139933, 0.0383848 , ..., 0.08075171, 0.08277728,
       0.2158363 ], dtype=float32)
In [ ]:
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': XGB_clf_pred})
df.to_csv('./xgb.csv', index = False)

5.4 Light GBM¶

In [93]:
from lightgbm import LGBMClassifier

LGB_clf = LGBMClassifier(n_estimators=100, 
                         boosting_type='gbdt', 
                         objective='binary', 
                         metric='binary_logloss',
                        force_col_wise=True)

LGB_clf.fit(train_X, train_Y)
---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
Cell In[93], line 9
      1 from lightgbm import LGBMClassifier
      3 LGB_clf = LGBMClassifier(n_estimators=100, 
      4                          boosting_type='gbdt', 
      5                          objective='binary', 
      6                          metric='binary_logloss',
      7                         force_col_wise=True)
----> 9 LGB_clf.fit(train_X, train_Y)

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\sklearn.py:1142, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, feature_name, categorical_feature, callbacks, init_model)
   1139         else:
   1140             valid_sets.append((valid_x, self._le.transform(valid_y)))
-> 1142 super().fit(
   1143     X,
   1144     _y,
   1145     sample_weight=sample_weight,
   1146     init_score=init_score,
   1147     eval_set=valid_sets,
   1148     eval_names=eval_names,
   1149     eval_sample_weight=eval_sample_weight,
   1150     eval_class_weight=eval_class_weight,
   1151     eval_init_score=eval_init_score,
   1152     eval_metric=eval_metric,
   1153     feature_name=feature_name,
   1154     categorical_feature=categorical_feature,
   1155     callbacks=callbacks,
   1156     init_model=init_model
   1157 )
   1158 return self

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\sklearn.py:842, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, feature_name, categorical_feature, callbacks, init_model)
    839 evals_result: _EvalResultDict = {}
    840 callbacks.append(record_evaluation(evals_result))
--> 842 self._Booster = train(
    843     params=params,
    844     train_set=train_set,
    845     num_boost_round=self.n_estimators,
    846     valid_sets=valid_sets,
    847     valid_names=eval_names,
    848     feval=eval_metrics_callable,  # type: ignore[arg-type]
    849     init_model=init_model,
    850     feature_name=feature_name,
    851     callbacks=callbacks
    852 )
    854 self._evals_result = evals_result
    855 self._best_iteration = self._Booster.best_iteration

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\engine.py:255, in train(params, train_set, num_boost_round, valid_sets, valid_names, feval, init_model, feature_name, categorical_feature, keep_training_booster, callbacks)
    253 # construct booster
    254 try:
--> 255     booster = Booster(params=params, train_set=train_set)
    256     if is_valid_contain_train:
    257         booster.set_train_data_name(train_data_name)

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:3200, in Booster.__init__(self, params, train_set, model_file, model_str)
   3193     self.set_network(
   3194         machines=machines,
   3195         local_listen_port=params["local_listen_port"],
   3196         listen_time_out=params.get("time_out", 120),
   3197         num_machines=params["num_machines"]
   3198     )
   3199 # construct booster object
-> 3200 train_set.construct()
   3201 # copy the parameters from train_set
   3202 params.update(train_set.get_params())

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:2276, in Dataset.construct(self)
   2269             self._set_init_score_by_predictor(
   2270                 predictor=self._predictor,
   2271                 data=self.data,
   2272                 used_indices=used_indices
   2273             )
   2274 else:
   2275     # create train
-> 2276     self._lazy_init(data=self.data, label=self.label, reference=None,
   2277                     weight=self.weight, group=self.group,
   2278                     init_score=self.init_score, predictor=self._predictor,
   2279                     feature_name=self.feature_name, categorical_feature=self.categorical_feature,
   2280                     params=self.params, position=self.position)
   2281 if self.free_raw_data:
   2282     self.data = None

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:1959, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, feature_name, categorical_feature, params, position)
   1957     raise TypeError(f'Wrong predictor type {type(predictor).__name__}')
   1958 # set feature names
-> 1959 return self.set_feature_name(feature_name)

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:2639, in Dataset.set_feature_name(self, feature_name)
   2637         raise ValueError(f"Length of feature_name({len(feature_name)}) and num_feature({self.num_feature()}) don't match")
   2638     c_feature_name = [_c_str(name) for name in feature_name]
-> 2639     _safe_call(_LIB.LGBM_DatasetSetFeatureNames(
   2640         self._handle,
   2641         _c_array(ctypes.c_char_p, c_feature_name),
   2642         ctypes.c_int(len(feature_name))))
   2643 return self

File ~\anaconda3\envs\ml\Lib\site-packages\lightgbm\basic.py:242, in _safe_call(ret)
    234 """Check the return value from C API call.
    235 
    236 Parameters
   (...)
    239     The return value from C API calls.
    240 """
    241 if ret != 0:
--> 242     raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))

LightGBMError: Do not support special JSON characters in feature name.

Running LGB_clf.fit(train_X, train_Y) gave LightGBMError: Do not support special JSON characters in feature name.. I assume its is because of special characters in the columns name. I have remove unwanted characters from columns names below and then run the fit on the training data.

In [94]:
# Remove unwanted characters from column names
train_X.columns = [col.replace(',', '').replace(']', '').replace('[', '').replace('{', '').replace('}', '')
              .replace('"', '').replace(':', '').replace('/', '').replace(':', '').replace(' ', '').replace('_', '') for col in train_X.columns]

test_X.columns = [col.replace(',', '').replace(']', '').replace('[', '').replace('{', '').replace('}', '')
              .replace('"', '').replace(':', '').replace('/', '').replace(':', '').replace(' ', '').replace('_', '') for col in test_X.columns]
In [95]:
LGB_clf.fit(train_X, train_Y)
[LightGBM] [Info] Number of positive: 24825, number of negative: 282686
[LightGBM] [Info] Total Bins 8872
[LightGBM] [Info] Number of data points in the train set: 307511, number of used features: 215
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080729 -> initscore=-2.432486
[LightGBM] [Info] Start training from score -2.432486
Out[95]:
LGBMClassifier(force_col_wise=True, metric='binary_logloss', objective='binary')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMClassifier(force_col_wise=True, metric='binary_logloss', objective='binary')
In [96]:
LGB_clf_pred = LGB_clf.predict_proba(test_X)[:, 1]
In [97]:
LGB_clf_pred
Out[97]:
array([0.034586  , 0.10051357, 0.01817225, ..., 0.03762378, 0.03859904,
       0.10736544])
In [ ]:
# Creating Submission Dataframe
submit = pd.read_csv('./home-credit-default-risk/application_test.csv')
df = pd.DataFrame({'SK_ID_CURR': submit['SK_ID_CURR'], 'TARGET': LGB_clf_pred})
df.to_csv('./lgbm.csv', index = False)
In [ ]:
 
In [1]:
## In progress
In [ ]:
 
In [ ]: