数据挖掘-信用卡预测问题

题目背景与目标

题目背景:

GAMMA银行是一家私人银行,经营各种银行产品,如储蓄账户、活期账户、投资产品、信贷产品等。该行还向现有客户交叉销售产品,为此,客户使用不同的通信方式,如电视广播、电子邮件、网上银行推荐、手机银行等。在这种情况下,GAMMA客户银行希望将其信用卡交叉销售给现有客户。银行已经确定了一组有资格使用这些信用卡的客户。现在,银行正在寻求您的帮助,以确定可能对推荐的信用卡表现出更高意向的客户。

分析目标:

通过银行收集到的客户属性数据,预测客户是否对当前推出的信用卡感兴趣

资料获取

训练集

测试集

题目描述

分析与实现

特征工程

存在字段ID ,Age ,Region_Code ,Occupation ,Channel_Code ,Vintage ,Credit_Product ,AvgAccountBalance ,Is_Active ,Is_Lead(Target)

很明显,ID字段不可能和目标有任何关系,于是乎直接移除。而地区字段仅仅是RG+数字,于是把RG去除就是Region_Code的数字特征

1
2
train=train.drop(["ID"],axis=1)
train['Region_Code']=train['Region_Code'].apply(lambda x:x[2:]).astype('int64')

而后,将可能有关系的字段先可视化一下

这两个字段应该也有办法处理一下,但是我不会(´;ω;)

根据图片,需要处理Is_Lead,因为Is_Lead=1的样本量太少了,需要欠采样多数类

1
2
3
4
5
6
7
8
# 欠采样多数类
df_majority = train[train['Is_Lead']==0]
df_minority = train[train['Is_Lead']==1]
df_majority_undersampled = resample(df_majority,replace=True,n_samples=len(df_minority),random_state=0)
# 结合少数类和过采样的多数类
df_undersampled = pd.concat([df_minority, df_majority_undersampled])
df_undersampled['Is_Lead'].value_counts()
train=df_undersampled

首先对字段下为String的部分转换为数字特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 将Gender、Credit_Product、Is_Active改为0,1表示
train["Gender"] = train["Gender"].replace({"Male": 1, "Female": 0}).astype("int32")
train["Is_Active"] = train["Is_Active"].replace({"Yes": 1, "No": 0}).astype("int32")
train["Credit_Product"] = train["Credit_Product"].replace({"Yes": 1, "No": 0})
encoded_data = pd.get_dummies(train["Occupation"], prefix="Occupation").replace({True: 1, False: 0})
train = pd.concat([train.iloc[:, :-1], encoded_data, train.iloc[:, -1]], axis=1)
train=train.drop(["Occupation"],axis=1)
# 将Channel_Code进行one-hot编码
train = pd.get_dummies(train, columns=['Channel_Code'], prefix=['X']).replace({True: 1, False: 0})
# 获取倒数第五列的列名
column_to_move = train.columns[-5]
# 将倒数第五列保存到变量中
column_data = train[column_to_move]
# 删除倒数第五列
train = train.drop(column_to_move, axis=1)
# 将倒数第五列重新添加到DataFrame的最后
train[column_to_move] = column_datav

之后打算查看各特征值与是否可能成为信用卡客户的相关性

1
2
df_corr_lead=train.corr()[u'Is_Lead'].sort_values(ascending=False)
df_corr_lead=df_corr_lead.drop(['Is_Lead'],axis=0)

发现Channel_Code ,Vintage ,Credit_Product这三个与Is_Lead关系很密切

那么检查一下是否有缺失吧

1
2
# 判断是否存在缺失值
pd.isnull(train).any()

结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Gender                      False
Age False
Region_Code False
Vintage False
Credit_Product True
Avg_Account_Balance False
Is_Active False
Occupation_Entrepreneur False
Occupation_Other False
Occupation_Salaried False
Occupation_Self_Employed False
X_X1 False
X_X2 False
X_X3 False
X_X4 False
Is_Lead False
dtype: bool

oops,Credit_Product居然有缺失,它又是和Is_Lead紧密相关,肯定要预测一下

Credit_Product的处理

首先分出Credit_Product的测试集和训练集

1
2
3
4
# 筛选空值部分
missing_credit_product = train[train["Credit_Product"].isnull()]
# 筛选非空值部分
non_missing_credit_product = train[train["Credit_Product"].notnull()]

然后查看各特征值与是否可能成为信用卡客户的相关性

感觉并没有特别突出的,那就全部训练吧

PS:我尝试对Credit_Product字段也进行欠采样多数类,但是AUC反而变低了,应该是欠采样多数类后样本较少的缘故

1
2
3
4
5
6
7
8
9
# 对数据集进行划分
from sklearn.model_selection import train_test_split
data=non_missing_credit_product.drop(["Is_Lead", "Credit_Product"], axis=1)
x_train,x_test,y_train,y_test=train_test_split(data,non_missing_credit_product["Credit_Product"],test_size=0.2,random_state=22)
# 训练集与测试集标准化
from sklearn.preprocessing import StandardScaler
transfer=StandardScaler()
x_train_new=transfer.fit_transform(x_train)
x_test_new=transfer.transform(x_test)

模型测试了KNN,逻辑回归、随机森林和lightgbm,lightgbm效果最好

1
2
3
4
5
6
7
8
9
10
11
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
# 创建LightGBM分类器
lgb_classifier = lgb.LGBMClassifier(n_estimators=80, random_state=40)
# 训练LightGBM分类器
lgb_classifier.fit(x_train_new, y_train)
# 在测试集上进行预测
y_pred_prob = lgb_classifier.predict_proba(x_test_new)[:, 1] # 预测样本属于正类的概率
# 计算AUC
auc_score = roc_auc_score(y_test, y_pred_prob)
print("AUC Score:", auc_score)

AUC Score: 0.7671483718923551

填充好missing_credit_product的值后就开始预测Is_Lead

预测Is_Lead

1
2
3
4
5
6
7
# 开始预测目标值
data=train.drop(["Is_Lead"], axis=1)
x_train,x_test,y_train,y_test=train_test_split(data,train["Is_Lead"],test_size=0.25,random_state=42)
# 训练集与测试集标准化
transfer=StandardScaler()
x_train_new=transfer.fit_transform(x_train)
x_test_new=transfer.transform(x_test)

模型依旧选择了lightgbm,但是采用了贝叶斯优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from bayes_opt import BayesianOptimization
# Define the function to optimize (AUC score) using Bayesian optimization
def lgb_cv(n_estimators, learning_rate, max_depth, num_leaves, min_child_samples, subsample, colsample_bytree):
# Convert integer hyperparameters to integers
n_estimators = int(n_estimators)
max_depth = int(max_depth)
num_leaves = int(num_leaves)
min_child_samples = int(min_child_samples)
# Create a LightGBM classifier with the given hyperparameters
lgb_classifier = lgb.LGBMClassifier(
n_estimators=n_estimators,
learning_rate=learning_rate,
max_depth=max_depth,
num_leaves=num_leaves,
min_child_samples=min_child_samples,
subsample=subsample,
colsample_bytree=colsample_bytree,
random_state=40
)
# Train the classifier on the training data
lgb_classifier.fit(x_train_new, y_train)
# Make predictions on the test data
y_pred_prob = lgb_classifier.predict_proba(x_test_new)[:, 1]
# Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_prob)
return auc_score
# Define the hyperparameter ranges for Bayesian optimization
pbounds = {
'n_estimators': (150, 220),
'learning_rate': (0.02, 0.09),
'max_depth': (10, 15),
'num_leaves': (20, 100),
'min_child_samples': (5, 50),
'subsample': (0.5, 0.7),
'colsample_bytree': (0.6, 1.0)
}
# Create the BayesianOptimization object with the function to optimize and the hyperparameter bounds
optimizer = BayesianOptimization(
f=lgb_cv,
pbounds=pbounds,
random_state=42
)
# Perform Bayesian optimization
optimizer.maximize(init_points=10, n_iter=100)
# Get the optimal hyperparameters and corresponding AUC score
best_params = optimizer.max['params']
best_auc = optimizer.max['target']
print("Optimal Hyperparameters:")
print(best_params)
print("Best AUC Score:", best_auc)

超参数:
‘colsample_bytree’: 0.610865775300284

‘learning_rate’: 0.0805284415086373

‘max_depth’: 13.707039129713081

‘min_child_samples’: 7.262616569392997

‘n_estimators’: 215.9489533887009

‘num_leaves’: 87.00033587605007

‘subsample’: 0.620344288616325

Best AUC Score: 0.8755665739616039

后记

这个题给我的感觉是特征工程比较重要,最开始选择把Credit_Product填为平均值0.333333,没有进行任何其他操作,效果很差劲,进行预测之后,虽然AUC仅仅0.76,但是对于最终AUC提升还是比较明显的

源码