【金融风控系列】_[3]_贷款违约识别

本文围绕Kaggle的Home Credit Default Risk赛题展开,利用客户申请表等7张表数据构建模型预测客户还款能力。通过数据清洗、特征工程,融合多表信息生成衍生特征,经LightGBM模型训练,最终线上评分为0.78277,为信用记录不足人群的贷款评估提供参考。
Home Credit Default Risk(家庭信用违约风险)
该赛题来自 KAGGLE,仅用作学习交流
由于信用记录不足或不存在,许多人往往被划分为低信用借贷人而难以获得贷款。 为了确保这些人群获得贷款,Home Credit公司利用替代数据(包括电信和交易信息)预测客户的还款能力。
Home Credit一共提供了7张表,一共218个字段,其中训练集样本约31万(逾期8%),测试集样本约5万。
信息表
application_train/test 客户申请表
包含了
目标变量(客户是否违约-0/1变量)客户申请贷款信息(贷款类型, 贷款总额, 年金)客户基本信息(性别, 年龄, 家庭, 学历, 职业, 行业, 居住地情况)客户财务信息(年收入, 房/车情况)申请时提供的资料等.bureau/bureau_balance 由其他金融机构提供给征信中心的客户信用记录历史(月数据)
包含了客户在征信中心的
信用记录,违约金额,违约时间等.以时间序列(按行)的形式进行记录.
POS_CASH_balance 客户在Home Credit数据库中POS(point of sales)和现金贷款历史(月数据)
包含了客户
已付款情况未付款情况credit_card_balance 客户在Home Credit数据库中信用卡的snapshot历史(月数据)
包含了客户
消费次数消费金额等情况.
previous_application 客户先前的申请记录
包含了客户所有历史申请记录(申请信息, 申请结果等).
installments_payments 客户先前信用卡的还款记录
包含了客户的还款情况
还款日期是否逾期还款金额是否欠款等参考:
[1] https://zhuanlan.zhihu.com/p/43541825
[2] https://www.kaggle.com/xucheng/cv-7993-private-score-7996/
[3] https://zhuanlan.zhihu.com/p/40790434
[4] https://www.kaggle.com/tahmidnafi/cse499
[5] https://blog.csdn.net/zhangchen2449/article/details/83338978
主要字段表
In [20]#!unzip -q -o data/data105246/home_credit_default_risk.zip -d /home/aistudio/data登录后复制
unzip: cannot find or open data/data104475/IEEE_CIS_Fraud_Detection.zip, data/data104475/IEEE_CIS_Fraud_Detection.zip.zip or data/data104475/IEEE_CIS_Fraud_Detection.zip.ZIP.登录后复制 In [22]
# 安装依赖包!pip install xgboost!pip install lightgbm登录后复制
Looking in indexes: https://mirror.baidu.com/pypi/simple/Requirement already satisfied: xgboost in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.3.3)Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.6.3)Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from xgboost) (1.20.3)Looking in indexes: https://mirror.baidu.com/pypi/simple/Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.24.2)Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.20.3)Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.6.3)Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.36.2)Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (2.1.0)Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)登录后复制 In [23]
import osimport gcimport numpy as npimport pandas as pdfrom scipy.stats import kurtosisfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import MinMaxScalerfrom sklearn.impute import SimpleImputerfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as pltimport seaborn as snsimport warningsfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldimport xgboost as xgbfrom xgboost import XGBClassifierwarnings.simplefilter(action='ignore', category=FutureWarning)from lightgbm import LGBMClassifier登录后复制 In [24]
DATA_DIRECTORY = "./data"df_train = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_train.csv'))df_test = pd.read_csv(os.path.join(DATA_DIRECTORY, 'application_test.csv'))df = df_train.append(df_test)del df_train, df_test; gc.collect()登录后复制
39登录后复制登录后复制登录后复制 In [25]
df = df[df['AMT_INCOME_TOTAL'] < 20000000]df = df[df['CODE_GENDER'] != 'XNA']df['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)df['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)登录后复制 In [26]
def get_age_group(days_birth): age_years = -days_birth / 365 if age_years < 27: return 1 elif age_years < 40: return 2 elif age_years < 50: return 3 elif age_years < 65: return 4 elif age_years < 99: return 5 else: return 0登录后复制 In [27]
docs = [f for f in df.columns if 'FLAG_DOC' in f]df['DOCUMENT_COUNT'] = df[docs].sum(axis=1)df['NEW_DOC_KURT'] = df[docs].kurtosis(axis=1)df['AGE_RANGE'] = df['DAYS_BIRTH'].apply(lambda x: get_age_group(x))登录后复制 In [28]
df['EXT_SOURCES_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']df['EXT_SOURCES_WEIGHTED'] = df.EXT_SOURCE_1 * 2 + df.EXT_SOURCE_2 * 1 + df.EXT_SOURCE_3 * 3np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')for function_name in ['min', 'max', 'mean', 'nanmedian', 'var']: feature_name = 'EXT_SOURCES_{}'.format(function_name.upper()) df[feature_name] = eval('np.{}'.format(function_name))( df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']], axis=1)登录后复制 In [29]
df['CREDIT_TO_ANNUITY_RATIO'] = df['AMT_CREDIT'] / df['AMT_ANNUITY']df['CREDIT_TO_GOODS_RATIO'] = df['AMT_CREDIT'] / df['AMT_GOODS_PRICE']df['ANNUITY_TO_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']df['CREDIT_TO_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']df['INCOME_TO_EMPLOYED_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_EMPLOYED']df['INCOME_TO_BIRTH_RATIO'] = df['AMT_INCOME_TOTAL'] / df['DAYS_BIRTH'] df['EMPLOYED_TO_BIRTH_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']df['ID_TO_BIRTH_RATIO'] = df['DAYS_ID_PUBLISH'] / df['DAYS_BIRTH']df['CAR_TO_BIRTH_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_BIRTH']df['CAR_TO_EMPLOYED_RATIO'] = df['OWN_CAR_AGE'] / df['DAYS_EMPLOYED']df['PHONE_TO_BIRTH_RATIO'] = df['DAYS_LAST_PHONE_CHANGE'] / df['DAYS_BIRTH']登录后复制 In [30]
def do_mean(df, group_cols, counted, agg_name): gp = df[group_cols + [counted]].groupby(group_cols)[counted].mean().reset_index().rename( columns={counted: agg_name}) df = df.merge(gp, on=group_cols, how='left') del gp gc.collect() return df登录后复制 In [31]
def do_median(df, group_cols, counted, agg_name): gp = df[group_cols + [counted]].groupby(group_cols)[counted].median().reset_index().rename( columns={counted: agg_name}) df = df.merge(gp, on=group_cols, how='left') del gp gc.collect() return df登录后复制 In [32]
def do_std(df, group_cols, counted, agg_name): gp = df[group_cols + [counted]].groupby(group_cols)[counted].std().reset_index().rename( columns={counted: agg_name}) df = df.merge(gp, on=group_cols, how='left') del gp gc.collect() return df登录后复制 In [33]
def do_sum(df, group_cols, counted, agg_name): gp = df[group_cols + [counted]].groupby(group_cols)[counted].sum().reset_index().rename( columns={counted: agg_name}) df = df.merge(gp, on=group_cols, how='left') del gp gc.collect() return df登录后复制 In [34]
group = ['ORGANIZATION_TYPE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'AGE_RANGE', 'CODE_GENDER']df = do_median(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_MEDIAN')df = do_std(df, group, 'EXT_SOURCES_MEAN', 'GROUP_EXT_SOURCES_STD')df = do_mean(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_MEAN')df = do_std(df, group, 'AMT_INCOME_TOTAL', 'GROUP_INCOME_STD')df = do_mean(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_MEAN')df = do_std(df, group, 'CREDIT_TO_ANNUITY_RATIO', 'GROUP_CREDIT_TO_ANNUITY_STD')df = do_mean(df, group, 'AMT_CREDIT', 'GROUP_CREDIT_MEAN')df = do_mean(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_MEAN')df = do_std(df, group, 'AMT_ANNUITY', 'GROUP_ANNUITY_STD')登录后复制 In [35]
def label_encoder(df, categorical_columns=None): if not categorical_columns: categorical_columns = [col for col in df.columns if df[col].dtype == 'object'] for col in categorical_columns: df[col], uniques = pd.factorize(df[col]) return df, categorical_columns登录后复制 In [36]
def drop_application_columns(df): drop_list = [ 'CNT_CHILDREN', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'FLAG_PHONE', 'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_WORK_CITY', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'COMMONAREA_MODE', 'NONLIVINGAREA_MODE', 'ELEVATORS_MODE', 'NONLIVINGAREA_AVG', 'FLOORSMIN_MEDI', 'LANDAREA_MODE', 'NONLIVINGAREA_MEDI', 'LIVINGAPARTMENTS_MODE', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'FLOORSMIN_MODE', 'LANDAREA_MEDI', 'COMMONAREA_MEDI', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'LIVINGAPARTMENTS_AVG', 'ELEVATORS_AVG', 'YEARS_BUILD_MEDI', 'ENTRANCES_MODE', 'NONLIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'LIVINGAPARTMENTS_MEDI', 'YEARS_BUILD_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MODE', 'NONLIVINGAPARTMENTS_AVG', 'HOUSETYPE_MODE', 'FONDKAPREMONT_MODE', 'EMERGENCYSTATE_MODE' ] for doc_num in [2,4,5,6,7,9,10,11,12,13,14,15,16,17,19,20,21]: drop_list.append('FLAG_DOCUMENT_{}'.format(doc_num)) df.drop(drop_list, axis=1, inplace=True) return df登录后复制 In [37]
df, le_encoded_cols = label_encoder(df, None)df = drop_application_columns(df)登录后复制 In [38]
df = pd.get_dummies(df)登录后复制 In [39]
bureau = pd.read_csv(os.path.join(DATA_DIRECTORY, 'bureau.csv'))登录后复制 In [40]
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_CREDIT_SUM_DEBT']bureau['DEBT_CREDIT_DIFF'] = bureau['AMT_CREDIT_SUM'] - bureau['AMT_CREDIT_SUM_DEBT']bureau['CREDIT_TO_ANNUITY_RATIO'] = bureau['AMT_CREDIT_SUM'] / bureau['AMT_ANNUITY']登录后复制 In [41]
def one_hot_encoder(df, categorical_columns=None, nan_as_category=True): original_columns = list(df.columns) if not categorical_columns: categorical_columns = [col for col in df.columns if df[col].dtype == 'object'] df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category) categorical_columns = [c for c in df.columns if c not in original_columns] return df, categorical_columns登录后复制 In [42]
def group(df_to_agg, prefix, aggregations, aggregate_by= 'SK_ID_CURR'): agg_df = df_to_agg.groupby(aggregate_by).agg(aggregations) agg_df.columns = pd.Index(['{}{}_{}'.format(prefix, e[0], e[1].upper()) for e in agg_df.columns.tolist()]) return agg_df.reset_index()登录后复制 In [43]
def group_and_merge(df_to_agg, df_to_merge, prefix, aggregations, aggregate_by= 'SK_ID_CURR'): agg_df = group(df_to_agg, prefix, aggregations, aggregate_by= aggregate_by) return df_to_merge.merge(agg_df, how='left', on= aggregate_by)登录后复制 In [44]
def get_bureau_balance(path, num_rows= None): bb = pd.read_csv(os.path.join(path, 'bureau_balance.csv')) bb, categorical_cols = one_hot_encoder(bb, nan_as_category= False) # Calculate rate for each category with decay bb_processed = bb.groupby('SK_ID_BUREAU')[categorical_cols].mean().reset_index() # Min, Max, Count and mean duration of payments (months) agg = {'MONTHS_BALANCE': ['min', 'max', 'mean', 'size']} bb_processed = group_and_merge(bb, bb_processed, '', agg, 'SK_ID_BUREAU') del bb; gc.collect() return bb_processed登录后复制 In [45]
bureau, categorical_cols = one_hot_encoder(bureau, nan_as_category= False)bureau = bureau.merge(get_bureau_balance(DATA_DIRECTORY), how='left', on='SK_ID_BUREAU')bureau['STATUS_12345'] = 0for i in range(1,6): bureau['STATUS_12345'] += bureau['STATUS_{}'.format(i)]登录后复制 In [46]
features = ['AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'DEBT_PERCENTAGE', 'DEBT_CREDIT_DIFF', 'STATUS_0', 'STATUS_12345']agg_length = bureau.groupby('MONTHS_BALANCE_SIZE')[features].mean().reset_index()agg_length.rename({feat: 'LL_' + feat for feat in features}, axis=1, inplace=True)bureau = bureau.merge(agg_length, how='left', on='MONTHS_BALANCE_SIZE')del agg_length; gc.collect()登录后复制
39登录后复制登录后复制登录后复制 In [47]
BUREAU_AGG = { 'SK_ID_BUREAU': ['nunique'], 'DAYS_CREDIT': ['min', 'max', 'mean'], 'DAYS_CREDIT_ENDDATE': ['min', 'max'], 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM': ['max', 'mean', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'], 'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean', 'sum'], 'AMT_ANNUITY': ['mean'], 'DEBT_CREDIT_DIFF': ['mean', 'sum'], 'MONTHS_BALANCE_MEAN': ['mean', 'var'], 'MONTHS_BALANCE_SIZE': ['mean', 'sum'], 'STATUS_0': ['mean'], 'STATUS_1': ['mean'], 'STATUS_12345': ['mean'], 'STATUS_C': ['mean'], 'STATUS_X': ['mean'], 'CREDIT_ACTIVE_Active': ['mean'], 'CREDIT_ACTIVE_Closed': ['mean'], 'CREDIT_ACTIVE_Sold': ['mean'], 'CREDIT_TYPE_Consumer credit': ['mean'], 'CREDIT_TYPE_Credit card': ['mean'], 'CREDIT_TYPE_Car loan': ['mean'], 'CREDIT_TYPE_Mortgage': ['mean'], 'CREDIT_TYPE_Microloan': ['mean'], 'LL_AMT_CREDIT_SUM_OVERDUE': ['mean'], 'LL_DEBT_CREDIT_DIFF': ['mean'], 'LL_STATUS_12345': ['mean'],}BUREAU_ACTIVE_AGG = { 'DAYS_CREDIT': ['max', 'mean'], 'DAYS_CREDIT_ENDDATE': ['min', 'max'], 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM': ['max', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'], 'AMT_CREDIT_SUM_OVERDUE': ['max', 'mean'], 'DAYS_CREDIT_UPDATE': ['min', 'mean'], 'DEBT_PERCENTAGE': ['mean'], 'DEBT_CREDIT_DIFF': ['mean'], 'CREDIT_TO_ANNUITY_RATIO': ['mean'], 'MONTHS_BALANCE_MEAN': ['mean', 'var'], 'MONTHS_BALANCE_SIZE': ['mean', 'sum'],}BUREAU_CLOSED_AGG = { 'DAYS_CREDIT': ['max', 'var'], 'DAYS_CREDIT_ENDDATE': ['max'], 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM_OVERDUE': ['mean'], 'AMT_CREDIT_SUM': ['max', 'mean', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['max', 'sum'], 'DAYS_CREDIT_UPDATE': ['max'], 'ENDDATE_DIF': ['mean'], 'STATUS_12345': ['mean'],}BUREAU_LOAN_TYPE_AGG = { 'DAYS_CREDIT': ['mean', 'max'], 'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'], 'AMT_CREDIT_SUM': ['mean', 'max'], 'AMT_CREDIT_SUM_DEBT': ['mean', 'max'], 'DEBT_PERCENTAGE': ['mean'], 'DEBT_CREDIT_DIFF': ['mean'], 'DAYS_CREDIT_ENDDATE': ['max'],}BUREAU_TIME_AGG = { 'AMT_CREDIT_MAX_OVERDUE': ['max', 'mean'], 'AMT_CREDIT_SUM_OVERDUE': ['mean'], 'AMT_CREDIT_SUM': ['max', 'sum'], 'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'], 'DEBT_PERCENTAGE': ['mean'], 'DEBT_CREDIT_DIFF': ['mean'], 'STATUS_0': ['mean'], 'STATUS_12345': ['mean'],}登录后复制 In [48]
agg_bureau = group(bureau, 'BUREAU_', BUREAU_AGG)active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]agg_bureau = group_and_merge(active,agg_bureau,'BUREAU_ACTIVE_',BUREAU_ACTIVE_AGG)closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]agg_bureau = group_and_merge(closed,agg_bureau,'BUREAU_CLOSED_',BUREAU_CLOSED_AGG)del active, closed; gc.collect()for credit_type in ['Consumer credit', 'Credit card', 'Mortgage', 'Car loan', 'Microloan']: type_df = bureau[bureau['CREDIT_TYPE_' + credit_type] == 1] prefix = 'BUREAU_' + credit_type.split(' ')[0].upper() + '_' agg_bureau = group_and_merge(type_df, agg_bureau, prefix, BUREAU_LOAN_TYPE_AGG) del type_df; gc.collect()for time_frame in [6, 12]: prefix = "BUREAU_LAST{}M_".format(time_frame) time_frame_df = bureau[bureau['DAYS_CREDIT'] >= -30*time_frame] agg_bureau = group_and_merge(time_frame_df, agg_bureau, prefix, BUREAU_TIME_AGG) del time_frame_df; gc.collect()登录后复制 In [49]
sort_bureau = bureau.sort_values(by=['DAYS_CREDIT'])gr = sort_bureau.groupby('SK_ID_CURR')['AMT_CREDIT_MAX_OVERDUE'].last().reset_index()gr.rename({'AMT_CREDIT_MAX_OVERDUE': 'BUREAU_LAST_LOAN_MAX_OVERDUE'}, inplace=True)agg_bureau = agg_bureau.merge(gr, on='SK_ID_CURR', how='left')agg_bureau['BUREAU_DEBT_OVER_CREDIT'] = \ agg_bureau['BUREAU_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_AMT_CREDIT_SUM_SUM']agg_bureau['BUREAU_ACTIVE_DEBT_OVER_CREDIT'] = \ agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_DEBT_SUM']/agg_bureau['BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM']登录后复制 In [50]
df = pd.merge(df, agg_bureau, on='SK_ID_CURR', how='left')del agg_bureau, bureaugc.collect()登录后复制
39登录后复制登录后复制登录后复制 In [51]
prev = pd.read_csv(os.path.join(DATA_DIRECTORY, 'previous_application.csv'))pay = pd.read_csv(os.path.join(DATA_DIRECTORY, 'installments_payments.csv'))登录后复制 In [52]
PREVIOUS_AGG = { 'SK_ID_PREV': ['nunique'], 'AMT_ANNUITY': ['min', 'max', 'mean'], 'AMT_DOWN_PAYMENT': ['max', 'mean'], 'HOUR_APPR_PROCESS_START': ['min', 'max', 'mean'], 'RATE_DOWN_PAYMENT': ['max', 'mean'], 'DAYS_DECISION': ['min', 'max', 'mean'], 'CNT_PAYMENT': ['max', 'mean'], 'DAYS_TERMINATION': ['max'], # Engineered features 'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'], 'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean', 'var'], 'DOWN_PAYMENT_TO_CREDIT': ['mean'],}PREVIOUS_ACTIVE_AGG = { 'SK_ID_PREV': ['nunique'], 'SIMPLE_INTERESTS': ['mean'], 'AMT_ANNUITY': ['max', 'sum'], 'AMT_APPLICATION': ['max', 'mean'], 'AMT_CREDIT': ['sum'], 'AMT_DOWN_PAYMENT': ['max', 'mean'], 'DAYS_DECISION': ['min', 'mean'], 'CNT_PAYMENT': ['mean', 'sum'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], # Engineered features 'AMT_PAYMENT': ['sum'], 'INSTALMENT_PAYMENT_DIFF': ['mean', 'max'], 'REMAINING_DEBT': ['max', 'mean', 'sum'], 'REPAYMENT_RATIO': ['mean'],}PREVIOUS_LATE_PAYMENTS_AGG = { 'DAYS_DECISION': ['min', 'max', 'mean'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], # Engineered features 'APPLICATION_CREDIT_DIFF': ['min'], 'NAME_CONTRACT_TYPE_Consumer loans': ['mean'], 'NAME_CONTRACT_TYPE_Cash loans': ['mean'], 'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],}PREVIOUS_LOAN_TYPE_AGG = { 'AMT_CREDIT': ['sum'], 'AMT_ANNUITY': ['mean', 'max'], 'SIMPLE_INTERESTS': ['min', 'mean', 'max', 'var'], 'APPLICATION_CREDIT_DIFF': ['min', 'var'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'], 'DAYS_DECISION': ['max'], 'DAYS_LAST_DUE_1ST_VERSION': ['max', 'mean'], 'CNT_PAYMENT': ['mean'],}PREVIOUS_TIME_AGG = { 'AMT_CREDIT': ['sum'], 'AMT_ANNUITY': ['mean', 'max'], 'SIMPLE_INTERESTS': ['mean', 'max'], 'DAYS_DECISION': ['min', 'mean'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], # Engineered features 'APPLICATION_CREDIT_DIFF': ['min'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'], 'NAME_CONTRACT_TYPE_Consumer loans': ['mean'], 'NAME_CONTRACT_TYPE_Cash loans': ['mean'], 'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],}PREVIOUS_APPROVED_AGG = { 'SK_ID_PREV': ['nunique'], 'AMT_ANNUITY': ['min', 'max', 'mean'], 'AMT_CREDIT': ['min', 'max', 'mean'], 'AMT_DOWN_PAYMENT': ['max'], 'AMT_GOODS_PRICE': ['max'], 'HOUR_APPR_PROCESS_START': ['min', 'max'], 'DAYS_DECISION': ['min', 'mean'], 'CNT_PAYMENT': ['max', 'mean'], 'DAYS_TERMINATION': ['mean'], # Engineered features 'CREDIT_TO_ANNUITY_RATIO': ['mean', 'max'], 'APPLICATION_CREDIT_DIFF': ['max'], 'APPLICATION_CREDIT_RATIO': ['min', 'max', 'mean'], # The following features are only for approved applications 'DAYS_FIRST_DRAWING': ['max', 'mean'], 'DAYS_FIRST_DUE': ['min', 'mean'], 'DAYS_LAST_DUE_1ST_VERSION': ['min', 'max', 'mean'], 'DAYS_LAST_DUE': ['max', 'mean'], 'DAYS_LAST_DUE_DIFF': ['min', 'max', 'mean'], 'SIMPLE_INTERESTS': ['min', 'max', 'mean'],}PREVIOUS_REFUSED_AGG = { 'AMT_APPLICATION': ['max', 'mean'], 'AMT_CREDIT': ['min', 'max'], 'DAYS_DECISION': ['min', 'max', 'mean'], 'CNT_PAYMENT': ['max', 'mean'], # Engineered features 'APPLICATION_CREDIT_DIFF': ['min', 'max', 'mean', 'var'], 'APPLICATION_CREDIT_RATIO': ['min', 'mean'], 'NAME_CONTRACT_TYPE_Consumer loans': ['mean'], 'NAME_CONTRACT_TYPE_Cash loans': ['mean'], 'NAME_CONTRACT_TYPE_Revolving loans': ['mean'],}登录后复制 In [53]
ohe_columns = [ 'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'CHANNEL_TYPE', 'NAME_TYPE_SUITE', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION', 'NAME_PRODUCT_TYPE', 'NAME_CLIENT_TYPE']prev, categorical_cols = one_hot_encoder(prev, ohe_columns, nan_as_category= False)登录后复制 In [54]
prev['APPLICATION_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']prev['APPLICATION_CREDIT_RATIO'] = prev['AMT_APPLICATION'] / prev['AMT_CREDIT']prev['CREDIT_TO_ANNUITY_RATIO'] = prev['AMT_CREDIT']/prev['AMT_ANNUITY']prev['DOWN_PAYMENT_TO_CREDIT'] = prev['AMT_DOWN_PAYMENT'] / prev['AMT_CREDIT']total_payment = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']prev['SIMPLE_INTERESTS'] = (total_payment/prev['AMT_CREDIT'] - 1)/prev['CNT_PAYMENT']登录后复制 In [55]
approved = prev[prev['NAME_CONTRACT_STATUS_Approved'] == 1]active_df = approved[approved['DAYS_LAST_DUE'] == 365243]active_pay = pay[pay['SK_ID_PREV'].isin(active_df['SK_ID_PREV'])]active_pay_agg = active_pay.groupby('SK_ID_PREV')[['AMT_INSTALMENT', 'AMT_PAYMENT']].sum()active_pay_agg.reset_index(inplace= True)active_pay_agg['INSTALMENT_PAYMENT_DIFF'] = active_pay_agg['AMT_INSTALMENT'] - active_pay_agg['AMT_PAYMENT']active_df = active_df.merge(active_pay_agg, on= 'SK_ID_PREV', how= 'left')active_df['REMAINING_DEBT'] = active_df['AMT_CREDIT'] - active_df['AMT_PAYMENT']active_df['REPAYMENT_RATIO'] = active_df['AMT_PAYMENT'] / active_df['AMT_CREDIT']active_agg_df = group(active_df, 'PREV_ACTIVE_', PREVIOUS_ACTIVE_AGG)active_agg_df['TOTAL_REPAYMENT_RATIO'] = active_agg_df['PREV_ACTIVE_AMT_PAYMENT_SUM']/\ active_agg_df['PREV_ACTIVE_AMT_CREDIT_SUM']del active_pay, active_pay_agg, active_df; gc.collect()登录后复制
0登录后复制登录后复制登录后复制登录后复制登录后复制 In [56]
prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace= True)prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)登录后复制 In [57]
prev['DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']approved['DAYS_LAST_DUE_DIFF'] = approved['DAYS_LAST_DUE_1ST_VERSION'] - approved['DAYS_LAST_DUE']登录后复制 In [58]
categorical_agg = {key: ['mean'] for key in categorical_cols}登录后复制 In [59]
agg_prev = group(prev, 'PREV_', {**PREVIOUS_AGG, **categorical_agg})agg_prev = agg_prev.merge(active_agg_df, how='left', on='SK_ID_CURR')del active_agg_df; gc.collect()登录后复制
0登录后复制登录后复制登录后复制登录后复制登录后复制 In [60]
agg_prev = group_and_merge(approved, agg_prev, 'APPROVED_', PREVIOUS_APPROVED_AGG)refused = prev[prev['NAME_CONTRACT_STATUS_Refused'] == 1]agg_prev = group_and_merge(refused, agg_prev, 'REFUSED_', PREVIOUS_REFUSED_AGG)del approved, refused; gc.collect()登录后复制
0登录后复制登录后复制登录后复制登录后复制登录后复制 In [61]
for loan_type in ['Consumer loans', 'Cash loans']: type_df = prev[prev['NAME_CONTRACT_TYPE_{}'.format(loan_type)] == 1] prefix = 'PREV_' + loan_type.split(" ")[0] + '_' agg_prev = group_and_merge(type_df, agg_prev, prefix, PREVIOUS_LOAN_TYPE_AGG) del type_df; gc.collect()登录后复制 In [62]
pay['LATE_PAYMENT'] = pay['DAYS_ENTRY_PAYMENT'] - pay['DAYS_INSTALMENT']pay['LATE_PAYMENT'] = pay['LATE_PAYMENT'].apply(lambda x: 1 if x > 0 else 0)dpd_id = pay[pay['LATE_PAYMENT'] > 0]['SK_ID_PREV'].unique()登录后复制 In [63]
agg_dpd = group_and_merge(prev[prev['SK_ID_PREV'].isin(dpd_id)], agg_prev, 'PREV_LATE_', PREVIOUS_LATE_PAYMENTS_AGG)del agg_dpd, dpd_id; gc.collect()登录后复制
0登录后复制登录后复制登录后复制登录后复制登录后复制 In [64]
for time_frame in [12, 24]: time_frame_df = prev[prev['DAYS_DECISION'] >= -30*time_frame] prefix = 'PREV_LAST{}M_'.format(time_frame) agg_prev = group_and_merge(time_frame_df, agg_prev, prefix, PREVIOUS_TIME_AGG) del time_frame_df; gc.collect()del prev; gc.collect()登录后复制
0登录后复制登录后复制登录后复制登录后复制登录后复制 In [65]
df = pd.merge(df, agg_prev, on='SK_ID_CURR', how='left')登录后复制 In [66]
train = df[df['TARGET'].notnull()]test = df[df['TARGET'].isnull()]del dfgc.collect()登录后复制
98登录后复制 In [67]
labels = train['TARGET']test_lebels=test['TARGET']train = train.drop(columns=['TARGET'])test = test.drop(columns=['TARGET'])登录后复制 In [68]
feature = list(train.columns)train.replace([np.inf, -np.inf], np.nan, inplace=True)test.replace([np.inf, -np.inf], np.nan, inplace=True)test_df = test.copy()train_df = train.copy()train_df['TARGET'] = labelstest_df['TARGET'] = test_lebels登录后复制 In [69]
imputer = SimpleImputer(strategy = 'median')imputer.fit(train)imputer.fit(test)train = imputer.transform(train)test = imputer.transform(test)登录后复制 In [70]
scaler = MinMaxScaler(feature_range = (0, 1))scaler.fit(train)scaler.fit(test)train = scaler.transform(train)test = scaler.transform(test)登录后复制 In [71]
from lightgbm import LGBMClassifierlgbmc = LGBMClassifier()lgbmc.fit(train, labels)登录后复制
LGBMClassifier()登录后复制 In [72]
lgbm_pred = lgbmc.predict_proba(test)[:, 1]登录后复制 In [74]
submit = test_df[['SK_ID_CURR']]submit['TARGET'] = lgbm_pred登录后复制 In [75]
submit.to_csv('lgbm.csv', index = False)登录后复制
总结
数据的提交结果如下:(提交需要科学上网)
免责声明
游乐网为非赢利性网站,所展示的游戏/软件/文章内容均来自于互联网或第三方用户上传分享,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系youleyoucom@outlook.com。
同类文章
京东未来3年加码AI布局,共建万亿级智能生态
在人工智能技术快速迭代的背景下,京东集团宣布将深化技术战略布局,计划通过三年持续投入构建覆盖全产业链的万亿级AI生态。这一决策基于其多年来在供应链数字化领域的深厚积累,旨在推动技术成果向实体产业深度
全球AI算力竞争升级:OpenAI万亿投资打造超级基建
人工智能领域迎来重大转折,行业领军者OpenAI宣布启动全球计算基础设施的史诗级扩张计划,总投资规模预计突破1万亿美元。这一战略标志着全球AI产业竞争焦点从模型算法创新转向底层算力基建的深度布局,得
数贸会杭州开幕:钉钉AI新品引关注,西湖畔科技盛会
第四届中国数智贸易交易会(数贸会)在杭州拉开帷幕,阿里巴巴旗下智能办公平台钉钉携AI钉钉1 0新品亮相主题展区,其首款AI硬件DingTalk AI凭借创新功能成为全场焦点,引发国际客商浓厚兴趣。作
AGI只是开端,吴泳铭称AI将主导智能时代商业变革
阿里巴巴集团首席执行官兼阿里云智能集团董事长吴泳铭近日发表重要演讲,指出通用人工智能(AGI)的实现已成为必然趋势,但这仅仅是技术演进的起点。他强调,人类智能的终极目标是开发出具备自我迭代能力的超级
京东AI战略发布:三年投入将带动万亿规模生态建设
京东全球科技探索者大会(JDDiscovery-2025)在北京盛大启幕,集团首席执行官许冉在会上正式发布AI全景战略,宣布未来三年将加大投入力度,推动人工智能与各产业深度融合,构建规模达万亿级的A
相关攻略
热门教程
更多- 游戏攻略
- 安卓教程
- 苹果教程
- 电脑教程


















