How we reduced the time to develop scoring models five times by switching to Python

    image

    Now everyone is talking a lot about artificial intelligence and its application in all areas of the company. However, there are some areas where one type of model has dominated since ancient times, the so-called “white box” - logistic regression. One such area is bank credit scoring.

    There are several reasons for this:

    • Regression coefficients can be easily explained in contrast to “black boxes” like boosting, which can include more than 500 variables.
    • Machine learning is still not credible to management because of the difficulty in interpreting models.
    • There are unwritten requirements of the regulator for the interpretability of models: at any time, for example, the Central Bank may ask for an explanation - why the loan was denied to the borrower
    • Companies use external data mining programs (for example, rapid miner, SAS Enterprise Miner, STATISTICA or any other package), which allow you to quickly learn how to build models without even programming skills

    These reasons make it almost impossible to use complex machine learning models in some areas, so it is important to be able to “squeeze the most” out of a simple logistic regression, which is easy to explain and interpret.

    In this post, we will describe how, when building scoring, we abandoned external data mining packages in favor of an open source solution in the form of Python, increased development speed several times, and also improved the quality of all models.

    Scoring process


    The classic process of building scoring models on a regression looks like this:

    image

    It can vary from company to company, but the main stages remain constant. We always need to produce binning variables (as opposed to the machine learning paradigm, where in most cases only categorical coding is needed), eliminating them by the Information Value (IV), and manually unloading all coefficients and bins for further integration into DSL.
    Such an approach to building scoring cards worked fine in the 90s, but the technology of classical data mining packages is very outdated and does not allow the use of new techniques, such as, for example, L2 regularization in regression, which can significantly improve the quality of models.

    At one point, as a study, we decided to reproduce all the steps that analysts do when building scoring, supplement them with knowledge of Data Scientists, and also automate the entire process as much as possible.

    Python improvements


    As a development tool, we chose Python for its simplicity and good libraries, and began to reproduce all the steps in order.

    The first step is to collect data and generate variables - this stage is a significant part of the work of analysts.

    In Python, you can load the collected data from the database using pymysql.

    Code to download from the database
    defcon():
        conn = pymysql.connect(
            host='10.100.10.100',
            port=3306,
            user='******* ',
            password='*****',
            db='mysql')
        return conn;
    df = pd.read_sql('''
    SELECT  * 
    FROM idf_ru.data_for_scoring
    ''', con=con())


    Next, we replace the rare and missing values ​​with a separate category to prevent ovefitting, select the target, delete the extra columns, and also divide by the train and test.

    Data preparation
    deffilling(df):
        cat_vars = df.select_dtypes(include=[object]).columns
        num_vars = df.select_dtypes(include=[np.number]).columns
        df[cat_vars] = df[cat_vars].fillna('_MISSING_')
        df[num_vars] = df[num_vars].fillna(np.nan)
        return df
    defreplace_not_frequent(df, cols, perc_min=5, value_to_replace = "_ELSE_"):
            else_df = pd.DataFrame(columns=['var', 'list'])
            for i in cols:
                if i != 'date_requested'and i != 'credit_id':
                    t = df[i].value_counts(normalize=True)
                    q = list(t[t.values < perc_min/100].index)
                    if q:
                        else_df = else_df.append(pd.DataFrame([[i, q]], columns=['var', 'list']))
                    df.loc[df[i].value_counts(normalize=True)[df[i]].values < perc_min/100, i] =value_to_replace
            else_df = else_df.set_index('var')
            return df, else_df
    cat_vars = df.select_dtypes(include=[object]).columns
    df = filling(df)
    df, else_df = replace_not_frequent_2(df, cat_vars)
    df.drop(['credit_id', 'target_value', 'bor_credit_id', 'bchg_credit_id', 'last_credit_id', 'bcacr_credit_id',  'bor_bonuses_got' ], axis=1, inplace=True)
    df_train, df_test, y_train, y_test = train_test_split(df, y, test_size=0.33, stratify=df.y, random_state=42)


    Now begins the most important stage in scoring for regression - you need to write WOE-binning for numeric and categorical variables. In the open access, we did not find good and suitable options for us and decided to write it ourselves. They took this article of 2017 as the basis for numerical binning , as well as this , categorical, they wrote from scratch themselves. The results were impressive (Gini rose by 3–5 in comparison with the binning algorithms of external data mining programs).

    After that, you can look at the graphs or tables (which we later write to excel), how the variables were broken into groups and check the monotony:

    image

    image

    Drawing Bean Charts
    defplot_bin(ev, for_excel=False):
        ind = np.arange(len(ev.index)) 
        width = 0.35
        fig, ax1 = plt.subplots(figsize=(10, 7))
        ax2 = ax1.twinx()
        p1 = ax1.bar(ind, ev['NONEVENT'], width, color=(24/254, 192/254, 196/254))
        p2 = ax1.bar(ind, ev['EVENT'], width, bottom=ev['NONEVENT'], color=(246/254, 115/254, 109/254))
        ax1.set_ylabel('Event Distribution', fontsize=15)
        ax2.set_ylabel('WOE', fontsize=15)
        plt.title(list(ev.VAR_NAME)[0], fontsize=20) 
        ax2.plot(ind, ev['WOE'], marker='o', color='blue')
        # Legend
        plt.legend((p2[0], p1[0]), ('bad', 'good'), loc='best', fontsize=10)
        #Set xticklabels
        q = list()
        for i in range(len(ev)):
            try:
                mn = str(round(ev.MIN_VALUE[i], 2))
                mx = str(round(ev.MAX_VALUE[i], 2))
            except:
                mn = str((ev.MIN_VALUE[i]))
                mx = str((ev.MAX_VALUE[i]))
            q.append(mn + '-' + mx)
        plt.xticks(ind, q, rotation='vertical')
        for tick in ax1.get_xticklabels():
            tick.set_rotation(60)
        plt.savefig('{}.png'.format(ev.VAR_NAME[0]), dpi=500, bbox_inches = 'tight')
        plt.show() 
    defplot_all_bins(iv_df):for i in [x.replace('WOE_','') for x in X_train.columns]:
            ev = iv_df[iv_df.VAR_NAME==i]
            ev.reset_index(inplace=True)
            plot_bin(ev)     


    A separate function was written for manual binning, which is useful, for example, in the case of the variable "OS version", where all phones on Android and iOS were grouped manually.

    Manual binning function
    defadjust_binning(df, bins_dict):for i in range(len(bins_dict)):
            key = list(bins_dict.keys())[i]
            if type(list(bins_dict.values())[i])==dict:
                df[key] = df[key].map(list(bins_dict.values())[i])
            else:
                #Categories labels
                categories = list()
                for j in range(len(list(bins_dict.values())[i])):
                    if j == 0:
                        categories.append('<'+ str(list(bins_dict.values())[i][j]))
                        try:                        
                            categories.append('(' + str(list(bins_dict.values())[i][j]) +'; '+ str(list(bins_dict.values())[i][j+1]) + ']')
                        except:                       
                            categories.append('(' + str(list(bins_dict.values())[i][j]))
                    elif j==len(list(bins_dict.values())[i])-1:
                        categories.append(str(list(bins_dict.values())[i][j]) +'>')
                    else:
                        categories.append('(' + str(list(bins_dict.values())[i][j]) +'; '+ str(list(bins_dict.values())[i][j+1]) + ']')
                values = [df[key].min()] + list(bins_dict.values())[i]  + [df[key].max()]        
                df[key + '_bins'] = pd.cut(df[key], values, include_lowest=True, labels=categories).astype(object).fillna('_MISSING_').astype(str)
                df[key] = df[key + '_bins']#.map(df.groupby(key + '_bins')[key].agg('median'))
                df.drop([key + '_bins'], axis=1, inplace=True)
        return df
    bins_dict = {    
       'equi_delinquencyDays': [ 200,400,600]
        'loan_purpose': {'medicine':'1_group',
                        'repair':'1_group',
                        'helpFriend':'2_group'}
    }
    df = adjust_binning(df, bins_dict)


    The next step is the selection of variables on the Information Value. The standard value is a cut-off of 0.1 (all variables below do not have good predictive power).

    After the correlation test was carried out. Of the two correlating variables, you need to remove the one that has IV less. Cut-off removal was taken 0.75.

    image

    Remove correlations
    defdelete_correlated_features(df, cut_off=0.75, exclude = []):# Create correlation matrix
        corr_matrix = df.corr().abs()
        # Select upper triangle of correlation matrix
        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
        # Plotting All correlations
        f, ax = plt.subplots(figsize=(15, 10))
        plt.title('All correlations', fontsize=20)
        sns.heatmap(X_train.corr(), annot=True)
        # Plotting highly correlatedtry:
            f, ax = plt.subplots(figsize=(15, 10))
            plt.title('High correlated', fontsize=20)
            sns.heatmap(corr_matrix[(corr_matrix>cut_off) & (corr_matrix!=1)].dropna(axis=0, how='all').dropna(axis=1, how='all'), annot=True, linewidths=.5)
        except:
            print ('No highly correlated features found')
        # Find index of feature columns with correlation greater than cut_off
        to_drop = [column for column in upper.columns if any(upper[column] > cut_off)]
        to_drop = [column for column in to_drop if column notin exclude]
        print ('Dropped columns:', to_drop, '\n')
        df2 = df.drop(to_drop, axis=1)
        print ('Features left after correlation check: {}'.format(len(df.columns)-len(to_drop)), '\n')    
        print ('Not dropped columns:', list(df2.columns), '\n')
        # Plotting final correlations
        f, ax = plt.subplots(figsize=(15, 10))
        plt.title('Final correlations', fontsize=20)
        sns.heatmap(df2.corr(), annot=True)
        plt.show()
        return df2 


    In addition to selecting by IV, we added a recursive search for the optimal number of variables using the RFE method from sklearn.
    As we see in the graph, after 13 variables, the quality does not change, which means that the extra ones can be removed. For regression, more than 15 variables in scoring are considered bad tone, which in most cases is corrected with RFE.

    image
    RFE
    defRFE_feature_selection(clf_lr, X, y):
        rfecv = RFECV(estimator=clf_lr, step=1, cv=StratifiedKFold(5), verbose=0, scoring='roc_auc')
        rfecv.fit(X, y)
        print("Optimal number of features : %d" % rfecv.n_features_)
        # Plot number of features VS. cross-validation scores
        f, ax = plt.subplots(figsize=(14, 9))
        plt.xlabel("Number of features selected")
        plt.ylabel("Cross validation score (nb of correct classifications)")
        plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
        plt.show()
        mask = rfecv.get_support()
        X = X.ix[:, mask]
        return X


    Next, a regression was constructed and its metrics were evaluated for cross-validation and test sampling. Usually everyone looks at the Gini coefficient (a good article about him is here ).

    image

    Simulation results
    defplot_score(clf, X_test, y_test, feat_to_show=30, is_normalize=False, cut_off=0.5):#cm = confusion_matrix(pd.Series(clf.predict_proba(X_test)[:,1]).apply(lambda x: 1 if x>cut_off else 0), y_test)print ('ROC_AUC:  ', round(roc_auc_score(y_test, clf.predict_proba(X_test)[:,1]), 3))
        print ('Gini:     ', round(2*roc_auc_score(y_test, clf.predict_proba(X_test)[:,1]) - 1, 3))
        print ('F1_score: ', round(f1_score(y_test, clf.predict(X_test)), 3))
        print ('Log_loss: ', round(log_loss(y_test, clf.predict(X_test)), 3))
        print ('\n')
        print ('Classification_report: \n', classification_report(pd.Series(clf.predict_proba(X_test)[:,1]).apply(lambda x: 1if x>cut_off else0), y_test))
        skplt.metrics.plot_confusion_matrix(y_test, pd.Series(clf.predict_proba(X_test)[:,1]).apply(lambda x: 1if x>cut_off else0), title="Confusion Matrix",
                        normalize=is_normalize,figsize=(8,8),text_fontsize='large')
        display(eli5.show_weights(clf, top=20, feature_names = list(X_test.columns)))
    clf_lr = LogisticRegressionCV(random_state=1, cv=7)
    clf_lr.fit(X_train, y_train)
    plot_score(clf_lr, X_test, y_test, cut_off=0.5)


    When we made sure that the quality of the model suits us, we need to record all the results (regression coefficients, groups of beans, Gini stability graphs and variables, etc.) in excel. For this, it is convenient to use xlsxwriter, which can work with both data and pictures.

    Exel sheet examples:

    image

    image

    Write to excel
    #WRITING
    writer = pd.ExcelWriter('PDL_Score_20180815-3.xlsx', engine='xlsxwriter')
    workbook  = writer.book
    worksheet = workbook.add_worksheet('Sample information')
    bold = workbook.add_format({'bold': True})
    percent_fmt = workbook.add_format({'num_format': '0.00%'})
    worksheet.set_column('A:A', 20)
    worksheet.set_column('B:B', 15)
    worksheet.set_column('C:C', 10)
    # Sample
    worksheet.write('A2', 'Sample conditions', bold)
    worksheet.write('A3', 1)
    worksheet.write('A4', 2)
    worksheet.write('A5', 3)
    worksheet.write('A6', 4)
    # Model
    worksheet.write('A8', 'Model development', bold)
    worksheet.write('A9', 1)
    #labels
    worksheet.write('C8', 'Bads')
    worksheet.write('D8', 'Goods')
    worksheet.write('B9', 'Train')
    worksheet.write('B10', 'Valid')
    worksheet.write('B11', 'Total')
    # goods and bads
    worksheet.write('C9', y_train.value_counts()[1])
    worksheet.write('C10', y_test.value_counts()[1])
    worksheet.write('D9', y_train.value_counts()[0])
    worksheet.write('D10', y_test.value_counts()[0])
    worksheet.write('C11', y.value_counts()[1])
    worksheet.write('D11', y.value_counts()[0])
    # NPL
    worksheet.write('A13', 2)
    worksheet.write('B13', 'NPL')
    worksheet.write('C13', (y.value_counts()[1]/(y.value_counts()[1]+y.value_counts()[0])), percent_fmt)
    worksheet.write('A16', 3)
    worksheet.write('C15', 'Gini')
    worksheet.write('B16', 'Train')
    worksheet.write('B17', 'Valid')
    worksheet.write('B18', 'CV Scores')
    worksheet.write('C18', str([round(sc, 2) for sc in scores]))
    worksheet.write('C16', round(2*roc_auc_score(y_train, clf_lr.predict_proba(X_train)[:,1]) - 1, 3))
    worksheet.write('C17', round(2*roc_auc_score(y_test, clf_lr.predict_proba(X_test)[:,1]) - 1, 3))
    # Regreesion coefs
    feat.to_excel(writer, sheet_name='Regression coefficients', index=False)
    worksheet2 = writer.sheets['Regression coefficients']
    worksheet2.set_column('A:A', 15)
    worksheet2.set_column('B:B', 50)
    #WOE
    ivs[['VAR_NAME', 'Variable range', 'WOE', 'COUNT', 'WOE_group']].to_excel(writer, sheet_name='WOE', index=False)
    worksheet3 = writer.sheets['WOE']
    worksheet3.set_column('A:A', 50)
    worksheet3.set_column('B:B', 60)
    worksheet3.set_column('C:C', 30)
    worksheet3.set_column('D:D', 20)
    worksheet3.set_column('E:E', 12)
    for num, i in enumerate([x.replace('WOE_','') for x in X_train.columns]):
            ev = iv_df[iv_df.VAR_NAME==i]
            ev.reset_index(inplace=True)
            worksheet3.insert_image('G{}'.format(num*34+1), '{}.png'.format(i))
    df3.to_excel(writer, sheet_name='Data', index=False)
    table.to_excel(writer, sheet_name='Scores by buckets', header = True, index = True)
    worksheet4 = writer.sheets['Scores by buckets']
    worksheet4.set_column('A:A', 20)
    worksheet4.insert_image('J1', 'score_distribution.png')
    Ginis.to_excel(writer, sheet_name='Gini distribution', header = True, index = True)
    worksheet5 = writer.sheets['Gini distribution']
    worksheet5.insert_image('E1', 'gini_stability.png')
    writer.save()


    The final excel at the end once again looks like management, and then given to IT to embed the model in production.

    Total


    As we have seen, almost all stages of scoring can be automated so that analysts do not need programming skills to build models. In our case, after creating this framework, the analyst is only required to collect data and specify several parameters (specify the target variable, which columns to delete, the minimum number of bins, the cut-off coefficient for correlation of variables, etc.), after which you can run the python script, who builds a model and gives excel with the desired results.
    Of course, sometimes you have to correct the code for the needs of a specific project, and you will not manage to run a single script button while modeling, but even now we see the quality better than the packages used in the data mining market thanks to such techniques as optimal and monotonous binning, correlation testing , RFE, regularized version of regression, etc.

    Thus, thanks to the use of Python, we have significantly reduced the development time for scoring cards, as well as reduced labor costs for analysts.

    Also popular now: