使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

逻辑斯谛回归

关于逻辑斯谛回归，这篇文章http://blog.****.net/zouxy09/article/details/20319673 讲的很好；Andrew Ng的机器学习公开课也很不错（中文笔记也很好http://blog.****.net/stdcoutzyx ）；还有《机器学习实战》，都是不错资料。

在逻辑斯谛回归中，因为使用梯度上升(gradient ascent)收敛较慢，固本文采用牛顿法(Newton’s Method)进行参数求解，试验发现通常迭代10次左右就可达到收敛，而梯度上升法则需要迭代上百甚至上千次，当然实际的迭代次数也要视实际数据而定。

牛顿法

牛顿法与梯度下降法的功能一样，都是最优化的常用方法。
对于一个函数，如果要求函数值为0时的值，如图所示：

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

先随机选一个点，然后求出该点的切线，即导数，延长切线与横轴相交，以相交时的的值作为下一次迭代的值，更新规则如下

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

对于逻辑斯谛回归，需要求的是似然函数L(θ)的最大值，当L(θ)的导数L’(θ)为0时即为L(θ)的最大值，即求L’(θ)=0的参数，则可使用牛顿法进行求解，此时参数更新规则为

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

使用牛顿法的另一个好处是不需要像梯度法一样指定学习率（即步长）。但是牛顿法需要对二阶导（Hessian矩阵）进行求逆，不过随着拟牛顿法（BFGS）以及限域拟牛顿法（LBFGS）的提出，大大减少了求逆的计算量，不过在本文还是使用牛顿法进行参数求解。

牛顿法求解逻辑斯谛回归参数

迭代中需要进行的主要步骤包括如下：

(1) 初始化参数θ

(2) 获取数据x

(3) 对数据进行预测h

(4) 得到对数似然函数L(θ)

(5) 根据L(θ)计算梯度g

(6) 根据L(θ)计算Hessian矩阵H

(7) 更新参数θ

具体计算

(1) 初始化θ=(b, θ⁽¹⁾, θ⁽²⁾, …, θ⁽ⁿ⁾)^T，初始时θ=(0, 0, 0, …, 0)^T

(2) 获取x=(1, x⁽¹⁾, x⁽²⁾, …, x⁽ⁿ⁾)^T，其中x⁽¹⁾到x⁽ⁿ⁾为数据的n个特征的值

(3) 使用逻辑斯谛函数计算预测值h

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

(4) 得到似然函数

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

对数似然函数为

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

(5) 根据L(θ)计算梯度，一个维度的梯度计算公式为

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

n+1维的梯度向量为

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

(6) 根据L(θ)计算Hessian矩阵H，其中j行k列的值为

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

(n+1)*(n+1)的Hessian矩阵为

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

(7) 更新参数θ，θ = θ – H^-1g

本文使用的迭代规则是求出m个样本的梯度g以及Hessian矩阵H，对m个g和H求平均，然后更新θ，伪代码：

while(iterNum)                       //迭代次数
{
<span style="white-space:pre"></span>for i range(1,m)                 //m个样本
<span style="white-space:pre"></span>{

<span style="white-space:pre"></span>(2) (3) (4) (5) (6)      //具体计算的步骤标号}g = (1/m)gH = (1/m)Hθ = θ – H-1giterNum--}

测试

(1)使用Andrew Ng教授的成绩数据进行测试，对原始数据进行了一些预处理http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex4materials/ex4Data.zip

测试效果如下：

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

(2)使用马疝气病数据http://archive.ics.uci.edu/ml/datasets/Horse+Colic 进行了测试，迭代7次后收敛，分类错误率0.253731，使用的数据是《机器学习实战》中预处理后的数据。

tie dai ma

依赖包括numpy库和matplotlib库。

代码比较乱，因为开始时并没有使用numpy库，所以向量的点积、相加等计算都是自己实现的。后来使用牛顿法需要求矩阵逆，自己偷懒不想写，所以用了numpy。中间还有一些计算本来可以使用numpy中的方法直接进行计算，但可能是对矩阵的初始化方法不对，始终无法用numpy计算矩阵乘法，无奈只能自己用笨方法进行计算。代码较乱，做好瞎眼准备，ready! go!

# -*- coding: gb18030 -*-
__author__ = 'jinyu'
from numpy import *
import math

##
# 读取训练数据
# #
def loadDataSet(dataFileName):
    dataMat = []
    labelMat = []
    fr = open(dataFileName)
    for line in fr:
        lineArr = line.strip().split()
        dataMat.append([1.0, 10.0*float(lineArr[0]), 10.0*float(lineArr[1])])
        labelMat.append(int(lineArr[2]))

    return dataMat, labelMat
##
# sigmod函数
# #
def sigmoid(x):
    return 1.0 / (1+math.exp(-x))

##
# 梯度上升法
# #
def gradientAscent(dataMat, labelMat, alpha):
    m = len(dataMat)        #训练集个数
    n = len(dataMat[0])     #数据特征纬度
    theta = [0] * n

    iter = 1000
    while(iter):
        for i in range(m):
            hypothesis = sigmoid(computeDotProduct(dataMat[i], theta))
            error = labelMat[i] - hypothesis
            gradient = computeTimesVect(dataMat[i], error)
            theta = computeVectPlus(theta, computeTimesVect(gradient, alpha))
        iter -= 1
    return theta

##
# 牛顿法
# #
def newtonMethod(dataMat, labelMat, iterNum=10):
    m = len(dataMat)        #训练集个数
    n = len(dataMat[0])     #数据特征纬度
    theta = [0.0] * n

    while(iterNum):
        gradientSum = [0.0] * n
        hessianMatSum = [[0.0] * n]*n
        for i in range(m):
            try:
                hypothesis = sigmoid(computeDotProduct(dataMat[i], theta))
            except:
                continue
            error = labelMat[i] - hypothesis
            gradient = computeTimesVect(dataMat[i], error/m)
            gradientSum = computeVectPlus(gradientSum, gradient)
            hessian = computeHessianMatrix(dataMat[i], hypothesis/m)
            for j in range(n):
                hessianMatSum[j] = computeVectPlus(hessianMatSum[j], hessian[j])

        #计算hessian矩阵的逆矩阵有可能异常，如果捕获异常则忽略此轮迭代
        try:
            hessianMatInv = mat(hessianMatSum).I.tolist()
        except:
            continue
        for k in range(n):
            theta[k] -= computeDotProduct(hessianMatInv[k], gradientSum)

        iterNum -= 1
    return theta

##
# 计算hessian矩阵
# #
def computeHessianMatrix(data, hypothesis):
    hessianMatrix = []
    n = len(data)

    for i in range(n):
        row = []
        for j in range(n):
            row.append(-data[i]*data[j]*(1-hypothesis)*hypothesis)
        hessianMatrix.append(row)
    return hessianMatrix

##
# 计算两个向量的点积
# #
def computeDotProduct(a, b):
    if len(a) != len(b):
        return False
    n = len(a)
    dotProduct = 0
    for i in range(n):
        dotProduct += a[i] * b[i]
    return dotProduct

##
# 计算两个向量的和
# #
def computeVectPlus(a, b):
    if len(a) != len(b):
        return False
    n = len(a)
    sum = []
    for i in range(n):
        sum.append(a[i]+b[i])
    return sum

##
# 计算某个向量的n倍
# #
def computeTimesVect(vect, n):
    nTimesVect = []
    for i in range(len(vect)):
        nTimesVect.append(n * vect[i])
    return nTimesVect

def plotBestFit(dataMat, labelMat, weights):
    import matplotlib.pyplot as plt
    dataArr = array(dataMat)
    n = shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1:
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = arange(0.0, 70.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x, y)
    plt.xlabel('X1'); plt.ylabel('X2');
    plt.show()

def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5: return 1.0
    else: return 0.0

def colicTest(tainFileName, testFileName):
    frTrain = open(tainFileName); frTest = open(testFileName)
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = newtonMethod(trainingSet, trainingLabels, 10)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print "the error rate of this test is: %f" % errorRate
    return errorRate

dataMat, labelMat = loadDataSet("ex4x.dat")
theta = newtonMethod(dataMat, labelMat, 10)
print theta
plotBestFit(dataMat, labelMat, array(theta))

#colicTest('horseColicTraining.txt', 'horseColicTest.txt')

代码和数据

代码和数据可以在https://github.com/0zone/LogisticRegression 下载

秒客网

使用牛顿法确定逻辑斯谛回归（Logistic Regression）最佳回归系数

相关文章