Gini系数的原理

转载：https://blog.csdn.net/u010665216/article/details/78528261

首先，我们直接构造赛题结果：真实数据与预测数据：

predictions = [0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]

actual = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

我们将预测值从小到大排列：

data = zip(actual, predictions)

sorted_data = sorted(data, key=lambda d: d[1])

sorted_actual = [d[0] for d in sorted_data]

print('Sorted Actual Values', sorted_actual)

我们对排序后的真实值累计求和：

cumulative_actual = np.cumsum(sorted_actual)

cumulative_index = np.arange(1, len(cumulative_actual)+1)

plt.plot(cumulative_index, cumulative_actual)

plt.xlabel('Cumulative Number of Predictions')

plt.ylabel('Cumulative Actual Values')

plt.show()

我们将数据Normalization到0，1之间，并画出45度线：

cumulative_actual_shares = cumulative_actual / sum(actual)

cumulative_index_shares = cumulative_index / len(predictions)

#Add (0, 0) to the plot

x_values = [0] + list(cumulative_index_shares)

y_values = [0] + list(cumulative_actual_shares)

#Display the 45° line stacked on top of the y values

diagonal = [x - y for (x, y) in zip(x_values, y_values)]

plt.stackplot(x_values, y_values, diagonal)

plt.xlabel('Cumulative Share of Predictions')

plt.ylabel('Cumulative Share of Actual Values')

plt.show()

计算橙色区域面积：

fy = scipy.interpolate.interp1d(x_values, y_values)

blue_area, _ = scipy.integrate.quad(fy, 0, 1, points=x_values)

orange_area = 0.5 - blue_area

print('Orange Area: %.3f' % orange_area)

最大可能的基尼系数：

前面我们是按照预测值对真实值排序，得到一个基尼系数；现在我们按照真实值给真实值排序，得到最大可能的基尼系数：

cumulative_actual_shares_perfect = np.cumsum(sorted(actual)) / sum(actual)

y_values_perfect = [0] + list(cumulative_actual_shares_perfect)

#Display the 45° line stacked on top of the y values

diagonal = [x - y for (x, y) in zip(x_values, y_values_perfect)]

plt.stackplot(x_values, y_values_perfect, diagonal)

plt.xlabel('Cumulative Share of Predictions')

plt.ylabel('Cumulative Share of Actual Values')

plt.show()

# Integrate the the curve function

fy = scipy.interpolate.interp1d(x_values, y_values_perfect)

blue_area, _ = scipy.integrate.quad(fy, 0, 1, points=x_values)

orange_area = 0.5 - blue_area

print('Orange Area: %.3f' % orange_area)

数据挖掘中的Scoring Metric的实现：

def gini(actual, pred):

    assert (len(actual) == len(pred))

    all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)

    all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]

    totalLosses = all[:, 0].sum()

    giniSum = all[:, 0].cumsum().sum() / totalLosses

    giniSum -= (len(actual) + 1) / 2.

    return giniSum / len(actual)

def gini_normalized(actual, pred):

    return gini(actual, pred) / gini(actual, actual)

gini_predictions = gini(actual, predictions)

gini_max = gini(actual, actual)

ngini= gini_normalized(actual, predictions)

print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions, gini_max, ngini))

秒客网

Gini系数的原理

计算橙色区域面积：

最大可能的基尼系数：

数据挖掘中的Scoring Metric的实现：

相关文章