乳腺癌是世界上女性中最常见的癌症。它占据了所有癌症病例的25%，仅在2015年就影响了超过210万人。它始于乳房中的细胞开始失控生长。这些细胞通常形成肿瘤，可以通过X射线检查或在乳房区域触摸时感觉到。

对其检测的关键挑战是如何将肿瘤分类为恶性（癌症）或良性（非癌症）。我们请您完成对使用机器学习（使用SVMs）和威斯康辛州（诊断性）乳腺癌数据集进行肿瘤分类的分析。^[1]

介绍乳腺癌数据集

现在我们已经掌握了构建用于分类数据集的逻辑回归（Logistics Regression）模型的工具，我们将引入一个新的数据集。

在乳腺癌数据集中，每个数据点都包含来自乳腺肿块图像的测量数据，以及它是否为癌性。我们的目标是使用这些测量值来预测肿块是否为癌性。

这个数据集已经内置在 scikit-learn 中，因此我们不需要读取 csv 文件。scikit_learn 乳癌数据集官方文档

让我们从加载数据集开始，查看数据及其格式。

from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()

返回的对象（我们将其存储在 cancer_data 变量中）类似于 Python 字典。我们可以使用 cancer_data.keys() 方法查看可用的键。

print(cancer_data.keys())

我们首先查看 DESCR，其中提供了对数据集的详细描述。

print(cancer_data['DESCR'])

它返回的结果如下：

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

|details-start|
**References**
|details-split|

- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
  for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
  Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
  San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
  prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
  July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
  to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
  163-171.

|details-end|

我们可以看到有 30 个特征、569 个数据点，目标是恶性（癌性）或良性（非癌性）。对于每个数据点，我们有乳腺肿块的测量值（半径、质地、周长等）。对于这 10 个测量值，计算了多个值，因此我们有均值、标准误差和最差值。这导致了 10 * 3 或 30 个总特征。

在乳腺癌数据集中，有一些特征是基于其他列计算出来的。确定要计算哪些附加特征的过程称为特征工程（Feature engineering）。

乳癌数据集导入代码

可以在计算机中运行此代码以查看结果：

import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer_data = load_breast_cancer()
print(cancer_data.keys())
print(cancer_data['DESCR'])

将数据加载到 Pandas 中

让我们从 cancer_data 对象中提取特征和目标数据。

首先，特征数据存储在 data 键中。当我们查看它时，我们看到它是一个具有 569 行和 30 列的 numpy 数组。这是因为我们有 569 个数据点和 30 个特征。我们可以使用 shape 属性来验证这一点。

cancer_data['data'].shape

为了将其放入 Pandas DataFrame 使其更易读，我们需要列名，这些数据将存储在 feature_names 键中。

cancer_data['feature_names']

现在我们可以使用所有特征数据创建一个 Pandas DataFrame。

df = pd.DataFrame(cancer_data['data'], columns=cancer_data['feature_names'])
print(df.head())

返回值如下：

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0                 0.07871  ...         25.38          17.33           184.60   
1                 0.05667  ...         24.99          23.41           158.80   
2                 0.05999  ...         23.57          25.53           152.50   
3                 0.09744  ...         14.91          26.50            98.87   
4                 0.05883  ...         22.54          16.67           152.20   

   worst area  worst smoothness  worst compactness  worst concavity  \
0      2019.0            0.1622             0.6656           0.7119   
1      1956.0            0.1238             0.1866           0.2416   
2      1709.0            0.1444             0.4245           0.4504   
3       567.7            0.2098             0.8663           0.6869   
4      1575.0            0.1374             0.2050           0.4000   

   worst concave points  worst symmetry  worst fractal dimension  
0                0.2654          0.4601                  0.11890  
1                0.1860          0.2750                  0.08902  
2                0.2430          0.3613                  0.08758  
3                0.2575          0.6638                  0.17300  
4                0.1625          0.2364                  0.07678

我们可以看到 DataFrame 中有 30 列，因为我们有 30 个特征。我们使用了 head 方法，所以我们的结果只有 5 个数据点。

我们仍然需要将目标数据放入我们的 DataFrame 中，可以在 target 键中找到。我们可以看到目标是一个由 1 和 0 组成的一维 numpy 数组。

cancer_data['target']

如果我们查看数组的形状，我们会看到它是一个具有 569 个值的一维数组（这就是我们有多少数据点）。

cancer_data['target'].shape

为了解释这些1和0，我们需要知道1或0表示良性或恶性。我们可以使用 target_names 得到这一数据

cancer_data['target_names']

这给出了数组[‘malignant’, ‘benign’]，告诉我们 0 表示恶性，1 表示良性。让我们将这些数据添加到 Pandas DataFrame 中。

df['target'] = cancer_data['target']
df.head()

将数据加载到 Pandas 中的完整代码

import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer_data = load_breast_cancer()

df = pd.DataFrame(cancer_data['data'], columns=cancer_data['feature_names'])
df['target'] = cancer_data['target']
print(df.head())

在将数据加载到 Pandas 中时，最重要的就是仔细检查是否正确解释了布尔列或其他目标值。

构建逻辑回归（Logistic Regression）模型

现在我们已经查看了我们的数据并将其整理成了一个舒适的格式，现在我们可以构建我们的特征矩阵 X 和目标数组 y，以便构建逻辑回归模型。

X = df[cancer_data.feature_names].values
y = df['target'].values

现在我们创建一个逻辑回归对象，并使用 fit 方法构建模型。

model = LogisticRegression()
model.fit(X, y)

别忘了导包！

当我们运行这段代码时，会得到一个收敛警告 (Convergence Warning)。这意味着模型需要更多时间来找到最优解。一种选择是增加迭代次数。您还可以切换到另一个求解器（solver），这就是我们将要做的。求解器是模型用来找到线性方程的算法。您可以在逻辑回归文档中看到可能的求解器。

model = LogisticRegression(solver='liblinear')
model.fit(X, y)

让我们看看模型对数据集中第一个数据点的预测。回想一下，predict 方法接受一个二维数组，因此我们必须将数据点放入列表中。

model.predict([X[0]])

因此，模型预测第一个数据点是良性的。

为了查看模型在整个数据集上的表现如何，我们使用 score 方法来查看模型的准确性。

model.score(X, y)

我们看到模型正确预测了 96% 的数据点。

有了我们开发的工具，我们可以为任何分类数据集构建模型。

构建逻辑回归模型的完整代码

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

cancer_data = load_breast_cancer()
df = pd.DataFrame(cancer_data['data'], columns=cancer_data['feature_names'])
df['target'] = cancer_data['target']

X = df[cancer_data.feature_names].values
y = df['target'].values

model = LogisticRegression(solver='liblinear')
model.fit(X, y)
print("prediction for datapoint 0:", model.predict([X[0]]))
print(model.score(X, y))