数据挖掘-信用卡预测问题

发表于 2023-07-04 分类于深度学习，数据分析阅读数：评论数：本文字数： 6.4k 阅读时长 ≈ 11 分钟

21级数据科学导论的大作业。虽然测试集的结果泄露了，但是我没有下载到(´;ω;)，只能在训练集上提升效果了。如果我找到了特征工程的新办法我会回来修改的。

题目背景与目标

题目背景：

GAMMA银行是一家私人银行，经营各种银行产品，如储蓄账户、活期账户、投资产品、信贷产品等。该行还向现有客户交叉销售产品，为此，客户使用不同的通信方式，如电视广播、电子邮件、网上银行推荐、手机银行等。在这种情况下，GAMMA客户银行希望将其信用卡交叉销售给现有客户。银行已经确定了一组有资格使用这些信用卡的客户。现在，银行正在寻求您的帮助，以确定可能对推荐的信用卡表现出更高意向的客户。

分析目标：

通过银行收集到的客户属性数据，预测客户是否对当前推出的信用卡感兴趣

资料获取

训练集

测试集

题目描述

分析与实现

特征工程

存在字段ID ,Age ,Region_Code ,Occupation ,Channel_Code ,Vintage ,Credit_Product ,AvgAccountBalance ,Is_Active ,Is_Lead(Target)

很明显，ID字段不可能和目标有任何关系，于是乎直接移除。而地区字段仅仅是RG+数字，于是把RG去除就是Region_Code的数字特征

1 2	train=train.drop(["ID"],axis=1) train['Region_Code']=train['Region_Code'].apply(lambda x:x[2:]).astype('int64')

而后，将可能有关系的字段先可视化一下

这两个字段应该也有办法处理一下，但是我不会(´;ω;)

根据图片，需要处理Is_Lead，因为Is_Lead=1的样本量太少了，需要欠采样多数类

# 欠采样多数类
df_majority = train[train['Is_Lead']==0]
df_minority = train[train['Is_Lead']==1]
df_majority_undersampled = resample(df_majority,replace=True,n_samples=len(df_minority),random_state=0)
# 结合少数类和过采样的多数类
df_undersampled = pd.concat([df_minority, df_majority_undersampled])
df_undersampled['Is_Lead'].value_counts()
train=df_undersampled

首先对字段下为String的部分转换为数字特征

# 将Gender、Credit_Product、Is_Active改为0，1表示
train["Gender"] = train["Gender"].replace({"Male": 1, "Female": 0}).astype("int32")
train["Is_Active"] = train["Is_Active"].replace({"Yes": 1, "No": 0}).astype("int32")
train["Credit_Product"] = train["Credit_Product"].replace({"Yes": 1, "No": 0})
encoded_data = pd.get_dummies(train["Occupation"], prefix="Occupation").replace({True: 1, False: 0})
train = pd.concat([train.iloc[:, :-1], encoded_data, train.iloc[:, -1]], axis=1)
train=train.drop(["Occupation"],axis=1)
# 将Channel_Code进行one-hot编码
train = pd.get_dummies(train, columns=['Channel_Code'], prefix=['X']).replace({True: 1, False: 0})
# 获取倒数第五列的列名
column_to_move = train.columns[-5]
# 将倒数第五列保存到变量中
column_data = train[column_to_move]
# 删除倒数第五列
train = train.drop(column_to_move, axis=1)
# 将倒数第五列重新添加到DataFrame的最后
train[column_to_move] = column_datav

之后打算查看各特征值与是否可能成为信用卡客户的相关性

1 2	df_corr_lead=train.corr()[u'Is_Lead'].sort_values(ascending=False) df_corr_lead=df_corr_lead.drop(['Is_Lead'],axis=0)

发现Channel_Code ,Vintage ,Credit_Product这三个与Is_Lead关系很密切

那么检查一下是否有缺失吧

1 2	# 判断是否存在缺失值 pd.isnull(train).any()

结果

Gender                      False
Age                         False
Region_Code                 False
Vintage                     False
Credit_Product               True
Avg_Account_Balance         False
Is_Active                   False
Occupation_Entrepreneur     False
Occupation_Other            False
Occupation_Salaried         False
Occupation_Self_Employed    False
X_X1                        False
X_X2                        False
X_X3                        False
X_X4                        False
Is_Lead                     False
dtype: bool

oops,Credit_Product居然有缺失，它又是和Is_Lead紧密相关，肯定要预测一下

Credit_Product的处理

首先分出Credit_Product的测试集和训练集

# 筛选空值部分
missing_credit_product = train[train["Credit_Product"].isnull()]
# 筛选非空值部分
non_missing_credit_product = train[train["Credit_Product"].notnull()]

然后查看各特征值与是否可能成为信用卡客户的相关性

感觉并没有特别突出的，那就全部训练吧

PS:我尝试对Credit_Product字段也进行欠采样多数类，但是AUC反而变低了，应该是欠采样多数类后样本较少的缘故

# 对数据集进行划分
from sklearn.model_selection import train_test_split
data=non_missing_credit_product.drop(["Is_Lead", "Credit_Product"], axis=1)
x_train,x_test,y_train,y_test=train_test_split(data,non_missing_credit_product["Credit_Product"],test_size=0.2,random_state=22)
# 训练集与测试集标准化
from sklearn.preprocessing import StandardScaler
transfer=StandardScaler()
x_train_new=transfer.fit_transform(x_train)
x_test_new=transfer.transform(x_test)

模型测试了KNN，逻辑回归、随机森林和lightgbm，lightgbm效果最好

import lightgbm as lgb
from sklearn.metrics import roc_auc_score
# 创建LightGBM分类器
lgb_classifier = lgb.LGBMClassifier(n_estimators=80, random_state=40)
# 训练LightGBM分类器
lgb_classifier.fit(x_train_new, y_train)
# 在测试集上进行预测
y_pred_prob = lgb_classifier.predict_proba(x_test_new)[:, 1]  # 预测样本属于正类的概率
# 计算AUC
auc_score = roc_auc_score(y_test, y_pred_prob)
print("AUC Score:", auc_score)

AUC Score: 0.7671483718923551

填充好missing_credit_product的值后就开始预测Is_Lead

预测Is_Lead

# 开始预测目标值
data=train.drop(["Is_Lead"], axis=1)
x_train,x_test,y_train,y_test=train_test_split(data,train["Is_Lead"],test_size=0.25,random_state=42)
# 训练集与测试集标准化
transfer=StandardScaler()
x_train_new=transfer.fit_transform(x_train)
x_test_new=transfer.transform(x_test)

模型依旧选择了lightgbm，但是采用了贝叶斯优化

from bayes_opt import BayesianOptimization
# Define the function to optimize (AUC score) using Bayesian optimization
def lgb_cv(n_estimators, learning_rate, max_depth, num_leaves, min_child_samples, subsample, colsample_bytree):
    # Convert integer hyperparameters to integers
    n_estimators = int(n_estimators)
    max_depth = int(max_depth)
    num_leaves = int(num_leaves)
    min_child_samples = int(min_child_samples)
    # Create a LightGBM classifier with the given hyperparameters
    lgb_classifier = lgb.LGBMClassifier(
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth,
        num_leaves=num_leaves,
        min_child_samples=min_child_samples,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        random_state=40
    )
    # Train the classifier on the training data
    lgb_classifier.fit(x_train_new, y_train)
    # Make predictions on the test data
    y_pred_prob = lgb_classifier.predict_proba(x_test_new)[:, 1]
    # Calculate the AUC score
    auc_score = roc_auc_score(y_test, y_pred_prob)
    return auc_score
# Define the hyperparameter ranges for Bayesian optimization
pbounds = {
    'n_estimators': (150, 220),
    'learning_rate': (0.02, 0.09),
    'max_depth': (10, 15),
    'num_leaves': (20, 100),
    'min_child_samples': (5, 50),
    'subsample': (0.5, 0.7),
    'colsample_bytree': (0.6, 1.0)
}
# Create the BayesianOptimization object with the function to optimize and the hyperparameter bounds
optimizer = BayesianOptimization(
    f=lgb_cv,
    pbounds=pbounds,
    random_state=42
)
# Perform Bayesian optimization
optimizer.maximize(init_points=10, n_iter=100)
# Get the optimal hyperparameters and corresponding AUC score
best_params = optimizer.max['params']
best_auc = optimizer.max['target']
print("Optimal Hyperparameters:")
print(best_params)
print("Best AUC Score:", best_auc)

超参数：
‘colsample_bytree’: 0.610865775300284

‘learning_rate’: 0.0805284415086373

‘max_depth’: 13.707039129713081

‘min_child_samples’: 7.262616569392997

‘n_estimators’: 215.9489533887009

‘num_leaves’: 87.00033587605007

‘subsample’: 0.620344288616325

Best AUC Score: 0.8755665739616039

后记

这个题给我的感觉是特征工程比较重要，最开始选择把Credit_Product填为平均值0.333333，没有进行任何其他操作，效果很差劲，进行预测之后，虽然AUC仅仅0.76，但是对于最终AUC提升还是比较明显的

源码