首先大致说一下各个Model
逻辑回归模型
它是一种线性模型,适用于二分类问题,因为他的决策函数可以是sigmoid函数,经过它的转换之后,就会变成一个0/1值,这就是为什么适合二分类的原因,它的优点也很多,比如训练速度较快,因为它在做分类的时候,计算量仅仅只和特征的数目相关,再比如它内存资源占用小,只需要存储各个维度的特征值,但也有很多缺点,比如逻辑回归需要预先处理缺失值和异常值,因为它无法处理缺失值。
决策树模型
它最大的优点就是可视化之后十分直观,可以清晰地知道它分类的指标是什么,而且数据不需要预处理,不需要归一化,不需要处理缺失数据,决策树有回归和分类决策树两种,但缺点也很明显,因为它极其容易过拟合,所以有了很多剪枝算法,大致很为两类,预剪枝和后剪枝,由于采用的是贪心算法,容易得到局部最优解,此时有很多种方法跳出局部最优解,比如模拟退火等
Ensemble Model
通过组合多个学习器来完成学习任务,通过集成方法,可以将多个弱学习器组合成一个强分类器,因此集成学习的泛化能力一般比单一分类器要好。集成方法主要包括Bagging和Boosting,Bagging和Boosting都是将已有的分类或回归算法通过一定方式组合起来,形成一个更加强大的分类。两种方法都是把若干个分类器整合为一个分类器的方法,只是整合的方式不一样,最终得到不一样的效果。常见的基于Baggin思想的集成模型有:随机森林、基于Boosting思想的集成模型有:Adaboost、GBDT、XgBoost、LightGBM等
模型评价标准
这次选用的是AUC作为评选标准,什么是ROC呢?这又牵扯到分类问题中的混淆矩阵、召回率,查全率等,西瓜书里面第二章有很详细的介绍,这里就默认读者已经看过了,下面来解释什么是AUC,在逻辑回归里面,对于正负例的界定,通常会设一个阈值,大于阈值的为正类,小于阈值为负类。如果我们减小这个阀值,更多的样本会被识别为正类,提高正类的识别率,但同时也会使得更多的负类被错误识别为正类。为了直观表示这一现象,引入ROC。根据分类结果计算得到ROC空间中相应的点,连接这些点就形成ROC curve,横坐标为False Positive Rate(FPR:假正率),纵坐标为True Positive Rate(TPR:真正率)。 一般情况下,这个曲线都应该处于(0,0)和(1,1)连线的上方ROC曲线中的四个点:
-
点(0,1):即FPR=0, TPR=1,意味着FN=0且FP=0,将所有的样本都正确分类;
-
点(1,0):即FPR=1,TPR=0,最差分类器,避开了所有正确答案;
-
点(0,0):即FPR=TPR=0,FP=TP=0,分类器把每个实例都预测为负类;
-
点(1,1):分类器把每个实例都预测为正类
-
总之:ROC曲线越接近左上角,该分类器的性能越好,其泛化性能就越好。而且一般来说,如果ROC是光滑的,那么基本可以判断没有太大的overfitting。
但是对于两个模型,我们如何判断哪个模型的泛化性能更优呢?这里我们有主要以下两种方法:
如果模型A的ROC曲线完全包住了模型B的ROC曲线,那么我们就认为模型A要优于模型B;
如果两条曲线有交叉的话,我们就通过比较ROC与X,Y轴所围得曲线的面积来判断,面积越大,模型的性能就越优,这个面积我们称之为AUC(area under ROC curve)
import pandas as pd
import numpy as np
import warnings
import os
import seaborn as sns
import matplotlib.pyplot as plt
"""
sns 相关设置
@return:
"""
# 声明使用 Seaborn 样式
sns.set()
# 有五种seaborn的绘图风格,它们分别是:darkgrid, whitegrid, dark, white, ticks。默认的主题是darkgrid。
sns.set_style("whitegrid")
# 有四个预置的环境,按大小从小到大排列分别为:paper, notebook, talk, poster。其中,notebook是默认的。
sns.set_context('talk')
# 中文字体设置-黑体
plt.rcParams['-serif'] = ['SimHei']
# 解决保存图像是负号'-'显示为方块的问题
plt.rcParams['axes.unicode_minus'] = False
# 解决Seaborn中文显示问题并调整字体大小
sns.set(font='SimHei')
#数据压缩
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
data = pd.read_csv('data_for_model01.csv')
data = reduce_mem_usage(data)
D:\Anaconda1\lib\site-packages\IPython\core\:3063: DtypeWarning: Columns (46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86) have mixed dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Memory usage of dataframe is 793236320.00 MB
Memory usage after optimization is: 181245298.00 MB
Decreased by 77.2%
data.head()
loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | annualIncome | ... | grade_to_std_n11 | grade_to_mean_n12 | grade_to_std_n12 | grade_to_mean_n13 | grade_to_std_n13 | grade_to_mean_n14 | grade_to_std_n14 | sample | n2.2 | n2.3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 35008.0 | 5 | 19.515625 | 918.0000 | 5 | 21 | 161280 | 2.0 | 2 | 110000.0 | ... | 4.011719 | 1.852539 | 4.011719 | 1.857422 | 4.003906 | 1.856445 | 3.992188 | train | NaN | NaN |
1 | 18000.0 | 5 | 18.484375 | 462.0000 | 4 | 16 | 89538 | 5.0 | 0 | 46000.0 | ... | 3.207031 | 1.482422 | 3.207031 | 1.486328 | 3.205078 | 1.485352 | 3.193359 | train | NaN | NaN |
2 | 12000.0 | 5 | 16.984375 | 298.2500 | 4 | 17 | 159367 | 8.0 | 0 | 74000.0 | ... | 3.207031 | 1.482422 | 3.207031 | 1.486328 | 3.205078 | 1.315430 | 3.146484 | train | NaN | NaN |
3 | 2050.0 | 3 | 7.691406 | 63.9375 | 1 | 3 | 59830 | 9.0 | 0 | 35000.0 | ... | 0.801758 | 0.370605 | 0.801758 | 0.371582 | 0.801270 | 0.344238 | 0.793457 | train | NaN | NaN |
4 | 11504.0 | 3 | 14.976562 | 398.5000 | 3 | 12 | 85242 | 1.0 | 1 | 30000.0 | ... | 2.406250 | 1.111328 | 2.406250 | 1.114258 | 2.402344 | 1.114258 | 2.394531 | train | NaN | NaN |
5 rows × 122 columns
from sklearn.model_selection import KFold
# 分离数据集,方便进行交叉验证
X_train = data.loc[data['sample']=='train', :].drop(['isDefault', 'sample'], axis=1)
X_test = data.loc[data['sample']=='test', :].drop(['isDefault', 'sample'], axis=1)
y_train = data.loc[data['sample']=='train', 'isDefault']
# 5折交叉验证
folds = 5
seed = 2020
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
"""对训练集数据进行划分,分成训练集和验证集,并进行相应的操作"""
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# 数据集划分
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'learning_rate': 0.1,
'metric': 'auc',
'min_child_weight': 1e-3,
'num_leaves': 31,
'max_depth': -1,
'reg_lambda': 0,
'reg_alpha': 0,
'feature_fraction': 1,
'bagging_fraction': 1,
'bagging_freq': 0,
'seed': 2020,
'nthread': 8,
'silent': True,
'verbose': -1,
}
"""使用训练集数据进行模型训练"""
model = lgb.train(params, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=20000, verbose_eval=1000, early_stopping_rounds=200)
D:\Anaconda1\lib\site-packages\lightgbm\:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[330] valid_0's auc: 0.731887
from sklearn import metrics
from sklearn.metrics import roc_auc_score
"""预测并计算roc的相关指标"""
val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('未调参前lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
"""画出roc曲线图"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 画出对角线
plt.plot([0,1],[0,1],'r--')
plt.show()
未调参前lightgbm单模型在验证集上的AUC:0.7318871300593701
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vNOFrVFI-1600963034277)(output_7_1.png)]
import lightgbm as lgb
"""使用lightgbm 5折交叉验证进行建模预测"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
print('************************************ {} ************************************'.format(str(i+1)))
X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'learning_rate': 0.1,
'metric': 'auc',
'min_child_weight': 1e-3,
'num_leaves': 31,
'max_depth': -1,
'reg_lambda': 0,
'reg_alpha': 0,
'feature_fraction': 1,
'bagging_fraction': 1,
'bagging_freq': 0,
'seed': 2020,
'nthread': 8,
'silent': True,
'verbose': -1,
}
model = lgb.train(params, train_set=train_matrix, num_boost_round=20000, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
cv_scores.append(roc_auc_score(y_val, val_pred))
print(cv_scores)
print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
************************************ 1 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[308] valid_0's auc: 0.729253
[0.729252686605049]
************************************ 2 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[337] valid_0's auc: 0.730723
[0.729252686605049, 0.7307233610934907]
************************************ 3 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[527] valid_0's auc: 0.732105
[0.729252686605049, 0.7307233610934907, 0.7321048628412448]
************************************ 4 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[381] valid_0's auc: 0.727511
[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779]
************************************ 5 ************************************
D:\Anaconda1\lib\site-packages\lightgbm\:794: UserWarning: silent keyword has been found in `params` and will be ignored.
Please use silent argument of the Dataset constructor to pass this parameter.
.format(key))
Training until validation scores don't improve for 200 rounds
Early stopping, best iteration is:
[462] valid_0's auc: 0.732217
[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779, 0.7322174754202134]
lgb_scotrainre_list:[0.729252686605049, 0.7307233610934907, 0.7321048628412448, 0.7275111359476779, 0.7322174754202134]
lgb_score_mean:0.7303619043815351
lgb_score_std:0.0017871174424543119
from sklearn.model_selection import GridSearchCV
def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0,
feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001,
min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
# 设置5折交叉验证
cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
n_estimators=n_estimators,
num_leaves=num_leaves,
max_depth=max_depth,
bagging_fraction=bagging_fraction,
feature_fraction=feature_fraction,
bagging_freq=bagging_freq,
min_data_in_leaf=min_data_in_leaf,
min_child_weight=min_child_weight,
min_split_gain=min_split_gain,
reg_lambda=reg_lambda,
reg_alpha=reg_alpha,
n_jobs= 8
)
grid_search = GridSearchCV(estimator=model_lgb,
cv=cv_fold,
param_grid=param_grid,
scoring='roc_auc'
)
grid_search.fit(X_train, y_train)
print('模型当前最优参数为:{}'.format(grid_search.best_params_))
print('模型当前最优得分为:{}'.format(grid_search.best_score_))
# 设置5折交叉验证
from sklearn.model_selection import KFold,StratifiedKFold
cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
final_params = {
'boosting_type': 'gbdt',
'learning_rate': 0.01,
'num_leaves': 29,
'max_depth': 7,
'min_data_in_leaf':45,
'min_child_weight':0.001,
'bagging_fraction': 0.9,
'feature_fraction': 0.9,
'bagging_freq': 40,
'min_split_gain': 0,
'reg_lambda':0,
'reg_alpha':0,
'nthread': 6
}
cv_result = lgb.cv(train_set=lgb_train,
early_stopping_rounds=20,
num_boost_round=5000,
nfold=5,
stratified=True,
shuffle=True,
params=final_params,
metrics='auc',
seed=0,
)
print('迭代次数{}'.format(len(cv_result['auc-mean'])))
print('交叉验证的AUC为{}'.format(max(cv_result['auc-mean'])))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-b7fa2feb7fa4> in <module>
19 }
20
---> 21 cv_result = (train_set=lgb_train,
22 early_stopping_rounds=20,
23 num_boost_round=5000,
NameError: name 'lgb_train' is not defined
pip install bayesian-optimization
Collecting bayesian-optimization
Downloading bayesian-optimization-1.2. (14 kB)
Requirement already satisfied: numpy>=1.9.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (1.18.1)
Requirement already satisfied: scipy>=0.14.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (1.4.1)
Requirement already satisfied: scikit-learn>=0.18.0 in d:\anaconda1\lib\site-packages (from bayesian-optimization) (0.22.1)
Requirement already satisfied: joblib>=0.11 in d:\anaconda1\lib\site-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.15.1)
Building wheels for collected packages: bayesian-optimization
Building wheel for bayesian-optimization (): started
Building wheel for bayesian-optimization (): finished with status 'done'
Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2. size=11689 sha256=92f6d72f1257c45277321db01836ffce0c63dac8f591db2b3db6e9c47e6d07c1
Stored in directory: c:\users\苗苗\appdata\local\pip\cache\wheels\fd\9b\71\f127d694e02eb40bcf18c7ae9613b88a6be4470f57a8528c5b
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.2.0
Note: you may need to restart the kernel to use updated packages.
from sklearn.model_selection import cross_val_score
"""定义优化函数"""
def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf,
min_child_weight, min_split_gain, reg_lambda, reg_alpha):
# 建立模型
model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', bjective='binary', metric='auc',
learning_rate=0.1, n_estimators=5000,
num_leaves=int(num_leaves), max_depth=int(max_depth),
bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2),
bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf),
min_child_weight=min_child_weight, min_split_gain=min_split_gain,
reg_lambda=reg_lambda, reg_alpha=reg_alpha,
n_jobs= 8
)
val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean()
return val
from bayes_opt import BayesianOptimization
"""定义优化参数"""
bayes_lgb = BayesianOptimization(
rf_cv_lgb,
{
'num_leaves':(10, 200),
'max_depth':(3, 20),
'bagging_fraction':(0.5, 1.0),
'feature_fraction':(0.5, 1.0),
'bagging_freq':(0, 100),
'min_data_in_leaf':(10,100),
'min_child_weight':(0, 10),
'min_split_gain':(0.0, 1.0),
'reg_alpha':(0.0, 10),
'reg_lambda':(0.0, 10),
}
)
"""开始优化"""
bayes_lgb.maximize(n_iter=10)
| iter | target | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1 [0m | [0m 0.7171 [0m | [0m 0.5841 [0m | [0m 45.89 [0m | [0m 0.9789 [0m | [0m 15.1 [0m | [0m 4.607 [0m | [0m 48.88 [0m | [0m 0.4838 [0m | [0m 16.29 [0m | [0m 1.699 [0m | [0m 1.449 [0m |
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
D:\Anaconda1\lib\site-packages\bayes_opt\target_space.py in probe(self, params)
190 try:
--> 191 target = self._cache[_hashable(x)]
192 except KeyError:
KeyError: (0.9808468358238472, 95.31683577641724, 0.6846527338078261, 15.254621027977167, 6.084315056472179, 23.81958341293199, 0.6162173085058286, 42.95894924047164, 1.5295440304650598, 6.915985580798569)
During handling of the above exception, another exception occurred:
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-23-2c2786145eac> in <module>
18
19 """开始优化"""
---> 20 bayes_lgb.maximize(n_iter=10)
D:\Anaconda1\lib\site-packages\bayes_opt\bayesian_optimization.py in maximize(self, init_points, n_iter, acq, kappa, kappa_decay, kappa_decay_delay, xi, **gp_params)
183 iteration += 1
184
--> 185 (x_probe, lazy=False)
186
187 if self._bounds_transformer:
D:\Anaconda1\lib\site-packages\bayes_opt\bayesian_optimization.py in probe(self, params, lazy)
114 self._queue.add(params)
115 else:
--> 116 self._space.probe(params)
117 (Events.OPTIMIZATION_STEP)
118
D:\Anaconda1\lib\site-packages\bayes_opt\target_space.py in probe(self, params)
192 except KeyError:
193 params = dict(zip(self._keys, x))
--> 194 target = self.target_func(**params)
195 (x, target)
196 return target
<ipython-input-22-f352aad073e3> in rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, min_child_weight, min_split_gain, reg_lambda, reg_alpha)
15 )
16
---> 17 val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring='roc_auc').mean()
18
19 return val
D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
388 fit_params=fit_params,
389 pre_dispatch=pre_dispatch,
--> 390 error_score=error_score)
391 return cv_results['test_score']
392
D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
234 return_times=True, return_estimator=return_estimator,
235 error_score=error_score)
--> 236 for train, test in (X, y, groups))
237
238 zipped_scores = list(zip(*scores))
D:\Anaconda1\lib\site-packages\joblib\ in __call__(self, iterable)
1030 self._iterating = self._original_iterator is not None
1031
-> 1032 while self.dispatch_one_batch(iterator):
1033 pass
1034
D:\Anaconda1\lib\site-packages\joblib\ in dispatch_one_batch(self, iterator)
845 return False
846 else:
--> 847 self._dispatch(tasks)
848 return True
849
D:\Anaconda1\lib\site-packages\joblib\ in _dispatch(self, batch)
763 with self._lock:
764 job_idx = len(self._jobs)
--> 765 job = self._backend.apply_async(batch, callback=cb)
766 # A job can complete so quickly than its callback is
767 # called before we get here, causing self._jobs to
D:\Anaconda1\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
204 def apply_async(self, func, callback=None):
205 """Schedule a func to be run"""
--> 206 result = ImmediateResult(func)
207 if callback:
208 callback(result)
D:\Anaconda1\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
568 # Don't delay the application, to avoid keeping the input
569 # arguments in memory
--> 570 = batch()
571
572 def get(self):
D:\Anaconda1\lib\site-packages\joblib\ in __call__(self)
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
252 return [func(*args, **kwargs)
--> 253 for func, args, kwargs in ]
254
255 def __reduce__(self):
D:\Anaconda1\lib\site-packages\joblib\ in <listcomp>(.0)
251 with parallel_backend(self._backend, n_jobs=self._n_jobs):
252 return [func(*args, **kwargs)
--> 253 for func, args, kwargs in ]
254
255 def __reduce__(self):
D:\Anaconda1\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
513 (X_train, **fit_params)
514 else:
--> 515 (X_train, y_train, **fit_params)
516
517 except Exception as e:
D:\Anaconda1\lib\site-packages\lightgbm\ in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
798 verbose=verbose, feature_name=feature_name,
799 categorical_feature=categorical_feature,
--> 800 callbacks=callbacks)
801 return self
802
D:\Anaconda1\lib\site-packages\lightgbm\ in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
593 verbose_eval=verbose, feature_name=feature_name,
594 categorical_feature=categorical_feature,
--> 595 callbacks=callbacks)
596
597 if evals_result:
D:\Anaconda1\lib\site-packages\lightgbm\ in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
247 evaluation_result_list=None))
248
--> 249 (fobj=fobj)
250
251 evaluation_result_list = []
D:\Anaconda1\lib\site-packages\lightgbm\ in update(self, train_set, fobj)
1924 _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
1925 ,
-> 1926 (is_finished)))
1927 self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
1928 return is_finished.value == 1
KeyboardInterrupt:
bayes_lgb.max
{'target': 0.7170845006643078,
'params': {'bagging_fraction': 0.5841220522935171,
'bagging_freq': 45.89371469870785,
'feature_fraction': 0.9788842825399383,
'max_depth': 15.098220845321368,
'min_child_weight': 4.606814369239687,
'min_data_in_leaf': 48.875222916404226,
'min_split_gain': 0.4837879568993534,
'num_leaves': 16.292948242912633,
'reg_alpha': 1.699317625022757,
'reg_lambda': 1.4494033099871717}}
base_params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 14,
'max_depth': 19,
'min_data_in_leaf': 37,
'min_child_weight':1.6,
'bagging_fraction': 0.98,
'feature_fraction': 0.69,
'bagging_freq': 96,
'reg_lambda': 9,
'reg_alpha': 7,
'min_split_gain': 0.4,
'nthread': 8,
'seed': 2020,
'silent': True,
'verbose': -1,
}
cv_result_lgb = lgb.cv(
train_set=train_matrix,
early_stopping_rounds=1000,
num_boost_round=20000,
nfold=5,
stratified=True,
shuffle=True,
params=base_params_lgb,
metrics='auc',
seed=0
)
print('迭代次数{}'.format(len(cv_result_lgb['auc-mean'])))
print('最终模型的AUC为{}'.format(max(cv_result_lgb['auc-mean'])))
import lightgbm as lgb
"""使用lightgbm 5折交叉验证进行建模预测"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
print('************************************ {} ************************************'.format(str(i+1)))
X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 14,
'max_depth': 19,
'min_data_in_leaf': 37,
'min_child_weight':1.6,
'bagging_fraction': 0.98,
'feature_fraction': 0.69,
'bagging_freq': 96,
'reg_lambda': 9,
'reg_alpha': 7,
'min_split_gain': 0.4,
'nthread': 8,
'seed': 2020,
'silent': True,
}
model = lgb.train(params, train_set=train_matrix, num_boost_round=14269, valid_sets=valid_matrix, verbose_eval=1000, early_stopping_rounds=200)
val_pred = model.predict(X_val, num_iteration=model.best_iteration)
cv_scores.append(roc_auc_score(y_val, val_pred))
print(cv_scores)
print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
base_params_lgb = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.01,
'num_leaves': 14,
'max_depth': 19,
'min_data_in_leaf': 37,
'min_child_weight':1.6,
'bagging_fraction': 0.98,
'feature_fraction': 0.69,
'bagging_freq': 96,
'reg_lambda': 9,
'reg_alpha': 7,
'min_split_gain': 0.4,
'nthread': 8,
'seed': 2020,
'silent': True,
}
"""使用训练集数据进行模型训练"""
final_model_lgb = lgb.train(base_params_lgb, train_set=train_matrix, valid_sets=valid_matrix, num_boost_round=13000, verbose_eval=1000, early_stopping_rounds=200)
"""预测并计算roc的相关指标"""
val_pre_lgb = final_model_lgb.predict(X_val)
fpr, tpr, threshold = metrics.roc_curve(y_val, val_pre_lgb)
roc_auc = metrics.auc(fpr, tpr)
print('调参后lightgbm单模型在验证集上的AUC:{}'.format(roc_auc))
"""画出roc曲线图"""
plt.figure(figsize=(8, 8))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 画出对角线
plt.plot([0,1],[0,1],'r--')
plt.show()
import pickle
pickle.dump(final_model_lgb, open('dataset/model_lgb_best.pkl', 'wb'))