电影评论分类——朴素贝叶斯

时间:2023-02-14 12:25:43

Before We Classify

  • 给定一个电影的评论(文本信息),我们想要知道这个评论的语气是积极(+1)的还是消极的(-1)。本文利用 naive bayes分类模型来解决这个问题。朴素贝叶斯的原理是计算某个样本属于某个类的概率。计算公式是基于贝叶斯理论:P(A∣B)=P(B∣A)/P(A)P(B),意思是给定B,计算A的概率。
# Here's a running history for the past week.
# For each day, it contains whether or not the person ran, and whether or not they were tired.
days = [["ran", "was tired"], ["ran", "was not tired"], ["didn't run", "was tired"], ["ran", "was tired"], ["didn't run", "was not tired"], ["ran", "was not tired"], ["ran", "was tired"]]


# This is P(A):the probability of being tired
prob_tired = len([d for d in days if d[1] == "was tired"]) / len(days)
# This is P(B):the probability of running
prob_ran = len([d for d in days if d[0] == "ran"]) / len(days)
# This is P(B|A):the probability of running given that you are tired
prob_ran_given_tired = len([d for d in days if d[0] == "ran" and d[1] == "was tired"]) / len([d for d in days if d[1] == "was tired"])

# Now we can calculate P(A|B).
prob_tired_given_ran = (prob_ran_given_tired * prob_tired) / prob_ran

print("Probability of being tired given that you ran: {0}".format(prob_tired_given_ran))
'''
Probability of being tired given that you ran: 0.6
'''

Naive Bayes Intro

  • 上一个例子中只有一个属性:跑步,而是否累是预测变量,所以可以使用贝叶斯公式:P(A∣B)=P(B∣A)/P(A)P(B),但是当属性多余一个时,这个公式就不好计算了,此时就引出了朴素贝叶斯理论。朴素贝叶斯有一个条件独立假设,公式如下:

电影评论分类——朴素贝叶斯

  • 下面这个例子中有两个属性,是否跑步以及是否早起,给定一个样本[“ran”, “didn’t wake up early”],预测是否tired:
# Here's our data, but with "woke up early" or "didn't wake up early" added.
days = [["ran", "was tired", "woke up early"], ["ran", "was not tired", "didn't wake up early"], ["didn't run", "was tired", "woke up early"], ["ran", "was tired", "didn't wake up early"], ["didn't run", "was tired", "woke up early"], ["ran", "was not tired", "didn't wake up early"], ["ran", "was tired", "woke up early"]]

# We're trying to predict whether or not the person was tired on this day.
new_day = ["ran", "didn't wake up early"]

def calc_y_probability(y_label, days):
return len([d for d in days if d[1] == y_label]) / len(days)

def calc_ran_probability_given_y(ran_label, y_label, days):
return len([d for d in days if d[1] == y_label and d[0] == ran_label]) / len(days)

def calc_woke_early_probability_given_y(woke_label, y_label, days):
return len([d for d in days if d[1] == y_label and d[2] == woke_label]) / len(days)
# the probability of the sampling
denominator = len([d for d in days if d[0] == new_day[0] and d[2] == new_day[1]]) / len(days)
# Plug all the values into our formula. Multiply the class (y) probability, and the probability of the x-values occuring given that class.
prob_tired = (calc_y_probability("was tired", days) * calc_ran_probability_given_y(new_day[0], "was tired", days) * calc_woke_early_probability_given_y(new_day[1], "was tired", days)) / denominator

prob_not_tired = (calc_y_probability("was not tired", days) * calc_ran_probability_given_y(new_day[0], "was not tired", days) * calc_woke_early_probability_given_y(new_day[1], "was not tired", days)) / denominator

# Make a classification decision based on the probabilities.
classification = "was tired"
if prob_not_tired > prob_tired:
classification = "was not tired"
print("Final classification for new day: {0}. Tired probability: {1}. Not tired probability: {2}.".format(classification, prob_tired, prob_not_tired))
'''
Final classification for new day: was tired.
Tired probability: 0.10204081632653061.
Not tired probability: 0.054421768707482984.
'''

Finding Word Counts

  • 对于上面那个计算公式,可以稍作修改。由于在求解每个样本属于正类负类的过程中分母都是计算的样本的概率,这两个式子中的分母相同,而我们需要得到的是属于正类负类的一个概率比较,因此可以忽略对分母的求解。对于文本分类问题,它的特征取值通常是单词频数:
'''
评论样本
[['plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what\'s the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt',
'-1'],...
'''

# A nice python class that lets you count how many times items occur in a list
from collections import Counter
import csv
import re

# Read in the training data.
with open("train.csv", 'r') as file:
reviews = list(csv.reader(file))

def get_text(reviews, score):
# Join together the text in the reviews for a particular tone.
# We lowercase to avoid "Not" and "not" being seen as different words, for example.
return " ".join([r[0].lower() for r in reviews if r[1] == str(score)])

def count_text(text):
# Split text into words based on whitespace. Simple but effective.
words = re.split("\s+", text)
# Count up the occurence of each word.
return Counter(words)

negative_text = get_text(reviews, -1)
positive_text = get_text(reviews, 1)
# Generate word counts for negative tone.
negative_counts = count_text(negative_text)
# Generate word counts for positive tone.
positive_counts = count_text(positive_text)

print("Negative text sample: {0}".format(negative_text[:100]))
print("Positive text sample: {0}".format(positive_text[:100]))
'''
Negative text sample: plot : two teen couples go to a church party drink and then drive . they get into an accident . one
Positive text sample: films adapted from comic books have had plenty of success whether they're about superheroes ( batman
'''
  • Counter这个方法就是传入一个元素列表,返回每个元素出现的频数,字典格式。

Making Predictions

  • 在进行分类预测时候,需要计算该样本属于每个类的概率。P(A|B) = P(B|A)P(A)=P(w1,w2…|A)P(A)=P(w1|A)P(w2|A)…P(A)。其中P(wi|A)表示A类中wi出现的概率,为了避免P(wi|A)为0,需要进行拉普拉斯平滑,求概率的时候分子+1,分母+类别的个数。
import re
from collections import Counter

def get_y_count(score):
# Compute the count of each classification occuring in the data.
return len([r for r in reviews if r[1] == str(score)])

# We need these counts to use for smoothing when computing the prediction.
positive_review_count = get_y_count(1)
negative_review_count = get_y_count(-1)

# These are the class probabilities (we saw them in the formula as P(y)).
prob_positive = positive_review_count / len(reviews)
prob_negative = negative_review_count / len(reviews)

def make_class_prediction(text, counts, class_prob, class_count):
prediction = 1
text_counts = Counter(re.split("\s+", text))
for word in text_counts:
# For every word in the text, we get the number of times that word occured in the reviews for a given class, add 1 to smooth the value, and divide by the total number of words in the class (plus the class_count to also smooth the denominator). 拉普拉斯平滑
# Smoothing ensures that we don't multiply the prediction by 0 if the word didn't exist in the training data.
# We also smooth the denominator counts to keep things even.
prediction *= text_counts.get(word) * ((counts.get(word, 0) + 1) / (sum(counts.values()) + class_count))
# Now we multiply by the probability of the class existing in the documents.
return prediction * class_prob

# As you can see, we can now generate probabilities for which class a given review is part of.
# The probabilities themselves aren't very useful -- we make our classification decision based on which value is greater.
print("Review: {0}".format(reviews[0][0]))
print("Negative prediction: {0}".format(make_class_prediction(reviews[0][0], negative_counts, prob_negative, negative_review_count)))
print("Positive prediction: {0}".format(make_class_prediction(reviews[0][0], positive_counts, prob_positive, positive_review_count)))
'''
Review: plot : two teen couples go to a church party drink and then drive . they get into an accident . one of the guys dies but his girlfriend continues to see him in her life and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea but presents it in a very bad package . which is what makes this review an even harder one to write since i generally applaud films which attempt
Negative prediction: 3.0050530362356515e-221
Positive prediction: 1.3071705466906787e-226
'''

Predicting The Test Set

  • 将正负概率进行比较,给出最终分类标签。
import csv

def make_decision(text, make_class_prediction):
# Compute the negative and positive probabilities.
negative_prediction = make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
positive_prediction = make_class_prediction(text, positive_counts, prob_positive, positive_review_count)

# We assign a classification based on which probability is greater.
if negative_prediction > positive_prediction:
return -1
return 1

with open("test.csv", 'r') as file:
test = list(csv.reader(file))

predictions = [make_decision(r[0], make_class_prediction) for r in test]
'''
predictions : list (<class 'list'>)
[-1,
-1,
-1,
1,
'''

Computing Error

actual = [int(r[1]) for r in test]

from sklearn import metrics

# Generate the roc curve using scikits-learn.
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)

# Measure the area under the curve. The closer to 1, the "better" the predictions.
print("AUC of the predictions: {0}".format(metrics.auc(fpr, tpr)))
'''
AUC of the predictions: 0.680701754385965
'''

A Faster Way To Predict

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# Generate counts from text using a vectorizer. There are other vectorizers available, and lots of options you can set.
# This performs our step of computing word counts.
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in reviews])
test_features = vectorizer.transform([r[0] for r in test])

# Fit a naive bayes model to the training data.
# This will train the model using the word counts we computer, and the existing classifications in the training set.
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews])

# Now we can use the model to predict classifications for our test features.
predictions = nb.predict(test_features)

# Compute the error. It is slightly different from our model because the internals of this process work differently from our implementation.
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)
print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
'''
Multinomal naive bayes AUC: 0.6509287925696594
'''