首页
AI
【金融风控系列】_[3]_贷款违约识别

【金融风控系列】_[3]_贷款违约识别

热心网友
转载
2025-07-23
来源:https://www.php.cn/faq/1421779.html

本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。

【金融风控系列】_[3]_贷款违约识别 - 游乐网

Home Credit Default Risk(家庭信用违约风险)

该赛题来自 KAGGLE,仅用作学习交流


由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。

Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。


信息表

application_train/test 客户申请表

包含了

目标变量(客户是否违约-0/1变量)客户申请贷款信息(贷款类型, 贷款总额, 年金)客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)客户财务信息(年收入, 房/车情况)申请时提供的资料等.

bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)

包含了客户在征信中心的

信用记录,违约金额,违约时间等.

以时间序列(按行)的形式进行记录.

POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)

包含了客户

已付款情况未付款情况

credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)

包含了客户

消费次数消费金额

等情况.

previous_application 客户先前的申请记录

包含了客户所有历史申请记录(申请信息, 申请结果等).

installments_payments 客户先前信用卡的还款记录

包含了客户的还款情况

还款日期是否逾期还款金额是否欠款等

参考:

[1] https://zhuanlan.zhihu.com/p/43541825

[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/

[3] https://zhuanlan.zhihu.com/p/40790434

[4] https://www.kaggle.com/tahmidnafi/cse499

[5] https://blog.csdn.net/zhangchen2449/article/details/83338978

主要字段表

In [20]
#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data
登录后复制        
unzip:  cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.
登录后复制        In [22]
# 安装依赖包!pip install xgboost!pip install lightgbm
登录后复制        
Looking in indexes: https://mirror.baidu.com/pypi/simple/Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3)Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3)Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3)Looking in indexes: https://mirror.baidu.com/pypi/simple/Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2)Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3)Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3)Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2)Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0)Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
登录后复制        In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifierwarnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifier
登录后复制    In [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv'))df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv'))df = df_train.append(df_test)del df_train, df_test; gc.collect()
登录后复制        
39
登录后复制登录后复制登录后复制                In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000]df = df[df['CODE_GENDER'] != 'XNA']df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)
登录后复制    In [26]
def get_age_group(days_birth):    age_years = -days_birth / 365    if age_years < 27: return 1    elif age_years < 40: return 2    elif age_years < 50: return 3    elif age_years < 65: return 4    elif age_years < 99: return 5    else: return 0
登录后复制    In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f]df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))
登录后复制    In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']:    feature_name = 'EXT_SOURCES_{}'.format(function_name.upper())    df[feature_name] = eval('np.{}'.format(function_name))(        df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)
登录后复制    In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY']df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH']    df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']
登录后复制    In [30]
def do_mean(df, group_cols, counted, agg_name):    gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename(        columns={counted: agg_name})    df = df.merge(gp, on=group_cols, how='left')    del gp    gc.collect()    return df
登录后复制    In [31]
def do_median(df, group_cols, counted, agg_name):    gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename(        columns={counted: agg_name})    df = df.merge(gp, on=group_cols, how='left')    del gp    gc.collect()    return df
登录后复制    In [32]
def do_std(df, group_cols, counted, agg_name):    gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename(        columns={counted: agg_name})    df = df.merge(gp, on=group_cols, how='left')    del gp    gc.collect()    return df
登录后复制    In [33]
def do_sum(df, group_cols, counted, agg_name):    gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename(        columns={counted: agg_name})    df = df.merge(gp, on=group_cols, how='left')    del gp    gc.collect()    return df
登录后复制    In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER']df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN')df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD')df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN')df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD')df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN')df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD')df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN')df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN')df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')
登录后复制    In [35]
def label_encoder(df, categorical_columns=None):    if not categorical_columns:        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']    for col in categorical_columns:        df[col], uniques = pd.factorize(df[col])    return df, categorical_columns
登录后复制    In [36]
def drop_application_columns(df):    drop_list = [        'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START',        'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE',        'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',        'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',        'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR',         'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_AVG',        'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE',        'FLOORSMIN_AVG', 'LANDAREA_AVG', 'FLOORSMIN_MODE', 'LANDAREA_MEDI',        'COMMONAREA_MEDI', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'BASEMENTAREA_AVG',        'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI',         'LIVINGAPARTMENTS_AVG', 'ELEVATORS_AVG', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE',        'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI',        'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI',        'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_AVG', 'HOUSETYPE_MODE',        'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE'    ]    for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]:        drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num))    df.drop(drop_list, axis=1, inplace=True)    return df
登录后复制    In [37]
df, le_encoded_cols = label_encoder(df, None)df = drop_application_columns(df)
登录后复制    In [38]
df = pd.get_dummies(df)
登录后复制    In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))
登录后复制    In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT']bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT']bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']
登录后复制    In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True):    original_columns = list(df.columns)    if not categorical_columns:        categorical_columns = [col for col in df.columns if df[col].dtype == 'object']    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)    categorical_columns = [c for c in df.columns if c not in original_columns]    return df, categorical_columns
登录后复制    In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):    agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations)    agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper())                               for e in agg_df.columns.tolist()])    return agg_df.reset_index()
登录后复制    In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'):    agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by)    return df_to_merge.merge(agg_df, how='left', on= aggregate_by)
登录后复制    In [44]
def get_bureau_balance(path, num_rows= None):    bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv'))    bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False)    # Calculate rate for each category with decay    bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index()    # Min, Max, Count and mean duration of payments (months)    agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']}    bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU')    del bb; gc.collect()    return bb_processed
登录后复制    In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')bureau['STATUS_12345'] = 0for i in range(1,6):    bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]
登录后复制    In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM',    'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()
登录后复制        
39
登录后复制登录后复制登录后复制                In [47]
BUREAU_AGG = {    'SK_ID_BUREAU': ['nunique'],    'DAYS_CREDIT': ['min', 'max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'],    'AMT_ANNUITY': ['mean'],    'DEBT_CREDIT_DIFF': ['mean', 'sum'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],    'STATUS_0': ['mean'],    'STATUS_1': ['mean'],    'STATUS_12345': ['mean'],    'STATUS_C': ['mean'],    'STATUS_X': ['mean'],    'CREDIT_ACTIVE_Active': ['mean'],    'CREDIT_ACTIVE_Closed': ['mean'],    'CREDIT_ACTIVE_Sold': ['mean'],    'CREDIT_TYPE_Consumer credit': ['mean'],    'CREDIT_TYPE_Credit card': ['mean'],    'CREDIT_TYPE_Car loan': ['mean'],    'CREDIT_TYPE_Mortgage': ['mean'],    'CREDIT_TYPE_Microloan': ['mean'],    'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'],    'LL_DEBT_CREDIT_DIFF': ['mean'],    'LL_STATUS_12345': ['mean'],}BUREAU_ACTIVE_AGG = {    'DAYS_CREDIT': ['max', 'mean'],    'DAYS_CREDIT_ENDDATE': ['min', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'],    'DAYS_CREDIT_UPDATE': ['min', 'mean'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'CREDIT_TO_ANNUITY_RATIO': ['mean'],    'MONTHS_BALANCE_MEAN': ['mean', 'var'],    'MONTHS_BALANCE_SIZE': ['mean', 'sum'],}BUREAU_CLOSED_AGG = {    'DAYS_CREDIT': ['max', 'var'],    'DAYS_CREDIT_ENDDATE': ['max'],    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['max', 'sum'],    'DAYS_CREDIT_UPDATE': ['max'],    'ENDDATE_DIF': ['mean'],    'STATUS_12345': ['mean'],}BUREAU_LOAN_TYPE_AGG = {    'DAYS_CREDIT': ['mean', 'max'],    'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'],    'AMT_CREDIT_SUM': ['mean', 'max'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'max'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'DAYS_CREDIT_ENDDATE': ['max'],}BUREAU_TIME_AGG = {    'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'],    'AMT_CREDIT_SUM_OVERDUE': ['mean'],    'AMT_CREDIT_SUM': ['max', 'sum'],    'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],    'DEBT_PERCENTAGE': ['mean'],    'DEBT_CREDIT_DIFF': ['mean'],    'STATUS_0': ['mean'],    'STATUS_12345': ['mean'],}
登录后复制    In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']:    type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1]    prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_'    agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG)    del type_df; gc.collect()for time_frame in [6, 12]:    prefix = "BUREAU_LAST{}M_".format(time_frame)    time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame]    agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG)    del time_frame_df; gc.collect()
登录后复制    In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \    agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \    agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']
登录后复制    In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureaugc.collect()
登录后复制        
39
登录后复制登录后复制登录后复制                In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv'))pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))
登录后复制    In [52]
PREVIOUS_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'],    'RATE_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['max'],    # Engineered features    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'],    'DOWN_PAYMENT_TO_CREDIT': ['mean'],}PREVIOUS_ACTIVE_AGG = {    'SK_ID_PREV': ['nunique'],    'SIMPLE_INTERESTS': ['mean'],    'AMT_ANNUITY': ['max', 'sum'],    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['sum'],    'AMT_DOWN_PAYMENT': ['max', 'mean'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['mean', 'sum'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features    'AMT_PAYMENT': ['sum'],    'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'],    'REMAINING_DEBT': ['max', 'mean', 'sum'],    'REPAYMENT_RATIO': ['mean'],}PREVIOUS_LATE_PAYMENTS_AGG = {    'DAYS_DECISION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features    'APPLICATION_CREDIT_DIFF': ['min'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],}PREVIOUS_LOAN_TYPE_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'],    'APPLICATION_CREDIT_DIFF': ['min', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'DAYS_DECISION': ['max'],    'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'],    'CNT_PAYMENT': ['mean'],}PREVIOUS_TIME_AGG = {    'AMT_CREDIT': ['sum'],    'AMT_ANNUITY': ['mean', 'max'],    'SIMPLE_INTERESTS': ['mean', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    # Engineered features    'APPLICATION_CREDIT_DIFF': ['min'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],}PREVIOUS_APPROVED_AGG = {    'SK_ID_PREV': ['nunique'],    'AMT_ANNUITY': ['min', 'max', 'mean'],    'AMT_CREDIT': ['min', 'max', 'mean'],    'AMT_DOWN_PAYMENT': ['max'],    'AMT_GOODS_PRICE': ['max'],    'HOUR_APPR_PROCESS_START': ['min', 'max'],    'DAYS_DECISION': ['min', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    'DAYS_TERMINATION': ['mean'],    # Engineered features    'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'],    'APPLICATION_CREDIT_DIFF': ['max'],    'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'],    # The following features are only for approved applications    'DAYS_FIRST_DRAWING': ['max', 'mean'],    'DAYS_FIRST_DUE': ['min', 'mean'],    'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'],    'DAYS_LAST_DUE': ['max', 'mean'],    'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'],    'SIMPLE_INTERESTS': ['min', 'max', 'mean'],}PREVIOUS_REFUSED_AGG = {    'AMT_APPLICATION': ['max', 'mean'],    'AMT_CREDIT': ['min', 'max'],    'DAYS_DECISION': ['min', 'max', 'mean'],    'CNT_PAYMENT': ['max', 'mean'],    # Engineered features    'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'],    'APPLICATION_CREDIT_RATIO': ['min', 'mean'],    'NAME_CONTRACT_TYPE_Consumer loans': ['mean'],    'NAME_CONTRACT_TYPE_Cash loans': ['mean'],    'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],}
登录后复制    In [53]
ohe_columns = [    'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE',    'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',    'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE']prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)
登录后复制    In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY']prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT']total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']
登录后复制    In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]active_df = approved[approved['DAYS_LAST_DUE'] == 365243]active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()active_pay_agg.reset_index(inplace= True)active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\                                            active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()
登录后复制        
0
登录后复制登录后复制登录后复制登录后复制登录后复制                In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)
登录后复制    In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']
登录后复制    In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}
登录后复制    In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()
登录后复制        
0
登录后复制登录后复制登录后复制登录后复制登录后复制                In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG)refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()
登录后复制        
0
登录后复制登录后复制登录后复制登录后复制登录后复制                In [61]
for loan_type in ['Consumer loans', 'Cash loans']:    type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1]    prefix = 'PREV_' + loan_type.split(" ")[0] + '_'    agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG)    del type_df; gc.collect()
登录后复制    In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT']pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0)dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()
登录后复制    In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev,                                    'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()
登录后复制        
0
登录后复制登录后复制登录后复制登录后复制登录后复制                In [64]
for time_frame in [12, 24]:    time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame]    prefix = 'PREV_LAST{}M_'.format(time_frame)    agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG)    del time_frame_df; gc.collect()del prev; gc.collect()
登录后复制        
0
登录后复制登录后复制登录后复制登录后复制登录后复制                In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')
登录后复制    In [66]
train = df[df['TARGET'].notnull()]test = df[df['TARGET'].isnull()]del dfgc.collect()
登录后复制        
98
登录后复制                In [67]
labels = train['TARGET']test_lebels=test['TARGET']train = train.drop(columns=['TARGET'])test = test.drop(columns=['TARGET'])
登录后复制    In [68]
feature = list(train.columns)train.replace([np.inf, -np.inf], np.nan, inplace=True)test.replace([np.inf, -np.inf], np.nan, inplace=True)test_df = test.copy()train_df = train.copy()train_df['TARGET'] = labelstest_df['TARGET'] = test_lebels
登录后复制    In [69]
imputer = SimpleImputer(strategy = 'median')imputer.fit(train)imputer.fit(test)train = imputer.transform(train)test = imputer.transform(test)
登录后复制    In [70]
scaler = MinMaxScaler(feature_range = (0, 1))scaler.fit(train)scaler.fit(test)train = scaler.transform(train)test = scaler.transform(test)
登录后复制    In [71]
from lightgbm import LGBMClassifierlgbmc = LGBMClassifier()lgbmc.fit(train, labels)
登录后复制        
LGBMClassifier()
登录后复制                In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]
登录后复制    In [74]
submit = test_df[['SK_ID_CURR']]submit['TARGET'] = lgbm_pred
登录后复制    In [75]
submit.to_csv('lgbm.csv', index = False)
登录后复制    

总结

数据的提交结果如下:(提交需要科学上网)

免责声明

游乐网为非赢利性网站,所展示的游戏/软件/文章内容均来自于互联网或第三方用户上传分享,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系youleyoucom@outlook.com。

同类文章

京东未来3年加码AI布局,共建万亿级智能生态

在人工智能技术快速迭代的背景下,京东集团宣布将深化技术战略布局,计划通过三年持续投入构建覆盖全产业链的万亿级AI生态。这一决策基于其多年来在供应链数字化领域的深厚积累,旨在推动技术成果向实体产业深度

2025-09-26.

全球AI算力竞争升级:OpenAI万亿投资打造超级基建

人工智能领域迎来重大转折,行业领军者OpenAI宣布启动全球计算基础设施的史诗级扩张计划,总投资规模预计突破1万亿美元。这一战略标志着全球AI产业竞争焦点从模型算法创新转向底层算力基建的深度布局,得

2025-09-26.

数贸会杭州开幕:钉钉AI新品引关注,西湖畔科技盛会

第四届中国数智贸易交易会(数贸会)在杭州拉开帷幕,阿里巴巴旗下智能办公平台钉钉携AI钉钉1 0新品亮相主题展区,其首款AI硬件DingTalk AI凭借创新功能成为全场焦点,引发国际客商浓厚兴趣。作

2025-09-26.

AGI只是开端,吴泳铭称AI将主导智能时代商业变革

阿里巴巴集团首席执行官兼阿里云智能集团董事长吴泳铭近日发表重要演讲,指出通用人工智能(AGI)的实现已成为必然趋势,但这仅仅是技术演进的起点。他强调,人类智能的终极目标是开发出具备自我迭代能力的超级

2025-09-26.

京东AI战略发布:三年投入将带动万亿规模生态建设

京东全球科技探索者大会(JDDiscovery-2025)在北京盛大启幕,集团首席执行官许冉在会上正式发布AI全景战略,宣布未来三年将加大投入力度,推动人工智能与各产业深度融合,构建规模达万亿级的A

2025-09-26.

热门教程

更多
  • 游戏攻略
  • 安卓教程
  • 苹果教程
  • 电脑教程

最新下载

更多
宝宝地震安全游戏
宝宝地震安全游戏 休闲益智 2025-09-26更新
查看
开放空间
开放空间 角色扮演 2025-09-26更新
查看
高能手办团
高能手办团 角色扮演 2025-09-26更新
查看
云海寻仙记
云海寻仙记 角色扮演 2025-09-26更新
查看
贪吃蛇大作战九游
贪吃蛇大作战九游 休闲益智 2025-09-26更新
查看
泥泞奔驰
泥泞奔驰 体育竞技 2025-09-26更新
查看
道友请留步小米渠道服
道友请留步小米渠道服 角色扮演 2025-09-26更新
查看
九州世界
九州世界 角色扮演 2025-09-26更新
查看
逃出生化镇
逃出生化镇 动作冒险 2025-09-26更新
查看
梦想协奏曲!少女乐团派对!
梦想协奏曲!少女乐团派对! 模拟经营 2025-09-26更新
查看