SkLearn 中的准确率、精确度、召回率和 F1 分数

SkLearn 已经内置了我们前面提到过的各个对于模型评估的指标,包括准确率(Accuracy)、精确度(Precision)、召回率(Recall)以及 F1 分数(F1 Score)。

现在我们将尝试调用他们为我们工作。现在,让我们重新回忆泰坦尼克号数据集逻辑回归模型的构建方法:

from sklearn.linear_model import LogisticRegression
import pandas as pd

df = pd.read_csv('https://kingsmai.github.io/uploads/@files/datasets/titanic/train.csv')
# 数据预处理填补缺失值
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'SibSp', 'Parch', 'Fare']].values
y = df['Survived'].values
model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)

我们读取 train.csv 然后保存在 df DataFrame 中,然后对数据进行预处理,将 Age 列的缺失值添加为 Age 的平均值。再创建 male 变量,保存数据点是否为男性。再来是创建特征值 X 和目标值 y。接下来创建一个 LogisticRegression 类来构建我们的模型,并将特征值和目标值传递进去训练模型。最后创建一个 y_pred 变量来保存我们预测的结果

现在,我们来使用指标函数(metric functions),先引用它。

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

每个函数接受两个一维的 numpy 数组:目标的真实值和目标的预测值。我们有目标的真实值和目标的预测值。因此,我们可以使用以下的度量函数。

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

df = pd.read_csv('https://kingsmai.github.io/uploads/@files/datasets/titanic/train.csv')
# 数据预处理填补缺失值
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'SibSp', 'Parch', 'Fare']].values
y = df['Survived'].values
model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)

print("accuracy:", accuracy_score(y, y_pred))
print("precision:", precision_score(y, y_pred))
print("recall:", recall_score(y, y_pred))
print("f1 score:", f1_score(y, y_pred))

返回值

accuracy: 0.797979797979798
precision: 0.7515527950310559
recall: 0.7076023391812866
f1 score: 0.7289156626506023

我们可以看到准确率为 79%,这意味着模型的预测有 80% 是正确的。精确度为 75%,这是指模型正类别预测中正确的百分比。召回率为 70%,表示模型正确预测的正类别占所有正类别的百分比。F1 分数为 73%,是精确度和召回率的平均值。

对于单个模型,度量值并不能提供很多信息。对于某些问题,60% 的值可能是不错的,而对于其他问题,90% 的值可能是不错的,这取决于问题的难度。我们将使用度量值来比较不同的模型,以选择最佳的模型。

from sklearn.metrics import classification_report 

print(classification_report(y,y_pred))

The above code give precision, recall & f1_score

Sklearn 中的混淆矩阵

Scikit-learn提供了一个混淆矩阵函数,我们可以使用它来获取混淆矩阵中的四个值(真正例、假正例、假负例和真负例)。假设y是我们的真实目标值,y_pred是预测值,我们可以如下使用confusion_matrix函数:

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_pred))

返回值如下:

[[469  80]
[100 242]]
# output order of matrix :
# [[TN FP]
# [FN TP]]

Sklearn 反转矩阵使其先显示负值,以下就是返回值的结构:

Predicted Negative Predicted Positive
Actual Negative 469 80
Actual Positive 100 242

这是我们通常绘制混淆矩阵的方式:

Actual Positive Actual Negative
Predicted Positive 242 80
Predicted Negative 100 469

Since negative target values correspond to 0 and positive to 1, scikit-learn has ordered them in this order. Make sure you double check that you are interpreting the values correctly!

由于负的目标值对应0,正的目标值对应1,scikit-learn以这个顺序排列它们。确保您仔细检查并正确解释这些值!

衍生介绍矩阵反转的原理
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
df['Male'] = df['Sex'] == 'male'

X = df[['Pclass', 'Male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values
y = df['Survived'].values

model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)

# TP = True Positive
# FP = False Positive
# TN = True Negative
# FN = False Negative

print(confusion_matrix(y, y_pred)) # output confusion matrix, actual targets 'y' passed first, predicted targets 'y_pred' passed second
print()

# Scikit-learn convention for confusion matrix is to show predicted labels along columns, actual labels along rows. Scikit-learn convention considers 0 to be negative class and 1 to be positive class.
# Order of 0 and 1 in confusion matrix depends on which appears first in your dataset/target values. In our dataset/target values 0 appears first, so 0 is first row and column in our confusion matrix.

# Therefore, the confusion matrix shows:

# A 0 TN | FP
# c ___|___
# t |
# u 1 FN | TP
# a
# l 0 1
# P r e d i c t e d


# Automatic ordering of 0 and 1 in confusion matrix can be overrided and set with 'labels='
# lets make 1 appear first with 'labels=[1,0]'
print(confusion_matrix(y, y_pred, labels = [1,0]))
print()

# Now, the confusion matrix shows:

# A 1 TP | FN
# c ___|___
# t |
# u 0 FP | TN
# a
# l 1 0
# P r e d i c t e d


# To flip matrix so predicted labels along rows, actual labels along columns, we must pass predicted targets first, actual targets second
print(confusion_matrix(y_pred, y, labels = [1,0])) # 'y_pred' passed first, 'y' passed second

# Now, the confusion matrix shows:

# P
# r
# e 1 TP | FP
# d ___|___
# i |
# c 0 FN | TN
# t
# e
# d 1 0
# A c t u a l


# github.com/alandavidgrunberg