ML-分类-2_良性恶性乳腺癌肿瘤分类

源码部分

# ML-Cssifier-2-breast_cancer_wisconsin
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata
#print(breast_cancer_wisconsin_diagnostic.metadata)

# variable information
#print(breast_cancer_wisconsin_diagnostic.variables)

column_names = ['Sample code number',
                'Clump Thickness',
                'Uniformity of Cell Size',
                'Uniformity of Cell Shape',
                'Marginal Adhesion',
                'Single Epithelial Cell Size',
                'Bare Nuclei',
                'Bland Chromatin',
                'Normal Nucleoli',
                'Mitoses',
                'Class']

data = pd.read_csv('/home/zero/下载/breast+cancer+wisconsin+original/breast-cancer-wisconsin.data',names=column_names)
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna(how='any')
print(data.shape)
print(data)

# repare for the training
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data[column_names[1:10]],
    data[column_names[10]],
    test_size = 0.25, random_state=33)
print("X_train shape:{}".format(X_train.shape))
print("Y_train shape:{}".format(y_train.shape))
print("X_test shape:{}".format(X_test.shape))
print("Y_test shape:{}".format(y_test.shape))
print(y_train.value_counts())
print(y_test.value_counts())

# program try with Logistic and SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
sgdc = SGDClassifier()

lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)

print(lr_y_predict)
print(sgdc_y_predict)

# accuracy
from sklearn.metrics import classification_report

print('Accuracy of LR Classifier:{}'.format(lr.score(X_test,y_test)))
print('Accuracy of SGD Classifier:{}'.format(sgdc.score(X_test,y_test)))

print(classification_report(y_test, lr_y_predict, target_names=['Benign','Malignant']))
print(classification_report(y_test, sgdc_y_predict, target_names=['Benign','Malignant']))

output

/home/zero/anaconda3/envs/AI/bin/python /home/zero/PycharmProjects/ML/main.py
--------------------------------------------------
(683, 11)
     Sample code number  Clump Thickness  ...  Mitoses  Class
0               1000025                5  ...        1      2
1               1002945                5  ...        1      2
2               1015425                3  ...        1      2
3               1016277                6  ...        1      2
4               1017023                4  ...        1      2
..                  ...              ...  ...      ...    ...
694              776715                3  ...        1      2
695              841769                2  ...        1      2
696              888820                5  ...        2      4
697              897471                4  ...        1      4
698              897471                4  ...        1      4

[683 rows x 11 columns]
X_train shape:(512, 9)
Y_train shape:(512,)
X_test shape:(171, 9)
Y_test shape:(171,)
Class
2    344
4    168
Name: count, dtype: int64
Class
2    100
4     71
Name: count, dtype: int64
[2 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 2 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
 4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 2
 2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
 2 2 2 2 2 2 4 2 2 4 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
 2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]
[4 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 4 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
 4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 4
 2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
 2 2 2 2 2 2 4 2 2 4 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
 2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]
Accuracy of LR Classifier:0.9883040935672515
Accuracy of SGD Classifier:0.9824561403508771
              precision    recall  f1-score   support

      Benign       0.99      0.99      0.99       100
   Malignant       0.99      0.99      0.99        71

    accuracy                           0.99       171
   macro avg       0.99      0.99      0.99       171
weighted avg       0.99      0.99      0.99       171

              precision    recall  f1-score   support

      Benign       1.00      0.97      0.98       100
   Malignant       0.96      1.00      0.98        71

    accuracy                           0.98       171
   macro avg       0.98      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171


Process finished with exit code 0

代码精解

1.认识数据集

https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original

Breast Cancer Wisconsin (Original)

Additional Information

Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:

Group 1: 367 instances (January 1989)

Group 2: 70 instances (October 1989)

Group 3: 31 instances (February 1990)

Group 4: 17 instances (April 1990)

Group 5: 48 instances (August 1990)

Group 6: 49 instances (Updated January 1991)

Group 7: 31 instances (June 1991)

Group 8: 86 instances (November 1991)

Total: 699 points (as of the donated datbase on 15 July 1992)

Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following statements summarizes changes to the original Group 1's set of data:

##### Group 1 : 367 points: 200B 167M (January 1989)

##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805

##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record

##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial

##### : Changed 0 to 1 in field 6 of sample 1219406

##### : Changed 0 to 1 in field 8 of following sample:

##### : 1182404,2,3,1,1,1,2,0,1,1,1

Donated on 7/14/1992

用代码打印数据集信息

from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata
print(breast_cancer_wisconsin_diagnostic.metadata)

# variable information
print(breast_cancer_wisconsin_diagnostic.variables)

output

------output------
{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'published_in': 'Electronic imaging', 'year': 1993, 'url': 'https://www.semanticscholar.org/paper/53f0fbb425bc14468eb3bf96b2e1d41ba8087f36', 'doi': '10.1117/12.148698'}, 'additional_info': {'summary': 'Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/\r\n\r\nSeparating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree.  Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.\r\n\r\nThe actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].\r\n\r\nThis database is also available through the UW CS ftp server:\r\nftp ftp.cs.wisc.edu\r\ncd math-prog/cpo-dataset/machine-learn/WDBC/', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': '1) ID number\r\n2) Diagnosis (M = malignant, B = benign)\r\n3-32)\r\n\r\nTen real-valued features are computed for each cell nucleus:\r\n\r\n\ta) radius (mean of distances from center to points on the perimeter)\r\n\tb) texture (standard deviation of gray-scale values)\r\n\tc) perimeter\r\n\td) area\r\n\te) smoothness (local variation in radius lengths)\r\n\tf) compactness (perimeter^2 / area - 1.0)\r\n\tg) concavity (severity of concave portions of the contour)\r\n\th) concave points (number of concave portions of the contour)\r\n\ti) symmetry \r\n\tj) fractal dimension ("coastline approximation" - 1)', 'citation': None}}
                  name     role         type  ... description units missing_values
0                   ID       ID  Categorical  ...        None  None             no
1            Diagnosis   Target  Categorical  ...        None  None             no
2              radius1  Feature   Continuous  ...        None  None             no
3             texture1  Feature   Continuous  ...        None  None             no
4           perimeter1  Feature   Continuous  ...        None  None             no
5                area1  Feature   Continuous  ...        None  None             no
6          smoothness1  Feature   Continuous  ...        None  None             no
7         compactness1  Feature   Continuous  ...        None  None             no
8           concavity1  Feature   Continuous  ...        None  None             no
9      concave_points1  Feature   Continuous  ...        None  None             no
10           symmetry1  Feature   Continuous  ...        None  None             no
11  fractal_dimension1  Feature   Continuous  ...        None  None             no
12             radius2  Feature   Continuous  ...        None  None             no
13            texture2  Feature   Continuous  ...        None  None             no
14          perimeter2  Feature   Continuous  ...        None  None             no
15               area2  Feature   Continuous  ...        None  None             no
16         smoothness2  Feature   Continuous  ...        None  None             no
17        compactness2  Feature   Continuous  ...        None  None             no
18          concavity2  Feature   Continuous  ...        None  None             no
19     concave_points2  Feature   Continuous  ...        None  None             no
20           symmetry2  Feature   Continuous  ...        None  None             no
21  fractal_dimension2  Feature   Continuous  ...        None  None             no
22             radius3  Feature   Continuous  ...        None  None             no
23            texture3  Feature   Continuous  ...        None  None             no
24          perimeter3  Feature   Continuous  ...        None  None             no
25               area3  Feature   Continuous  ...        None  None             no
26         smoothness3  Feature   Continuous  ...        None  None             no
27        compactness3  Feature   Continuous  ...        None  None             no
28          concavity3  Feature   Continuous  ...        None  None             no
29     concave_points3  Feature   Continuous  ...        None  None             no
30           symmetry3  Feature   Continuous  ...        None  None             no
31  fractal_dimension3  Feature   Continuous  ...        None  None             no

[32 rows x 7 columns]

2.数据集载入与处理

import pandas as pd
import numpy as np

# 创建特征列表
column_names = ['Sample code number',
                'Clump Thickness',
                'Uniformity of Cell Size',
                'Uniformity of Cell Shape',
                'Marginal Adhesion',
                'Single Epithelial Cell Size',
                'Bare Nuclei',
                'Bland Chromatin',
                'Normal Nucleoli',
                'Mitoses',
                'Class']

# 使用pandas.read_csv读取指定数据
data = pd.read_csv('/home/zero/下载/breast+cancer+wisconsin+original/breast-cancer-wisconsin.data',names=column_names)
#将“？”替换为标准缺失值表示
data = data.replace(to_replace='?', value=np.nan)
#丢弃带有缺失值的数据
data = data.dropna(how='any')
#输出data的数据量和维度
print(data.shape)
print(data)

(683, 11) #数据量
     Sample code number  Clump Thickness  ...  Mitoses  Class
0               1000025                5  ...        1      2
1               1002945                5  ...        1      2
2               1015425                3  ...        1      2
3               1016277                6  ...        1      2
4               1017023                4  ...        1      2
..                  ...              ...  ...      ...    ...
694              776715                3  ...        1      2
695              841769                2  ...        1      2
696              888820                5  ...        2      4
697              897471                4  ...        1      4
698              897471                4  ...        1      4

[683 rows x 11 columns] #维度

3.数据集分割和准备

将原有数据集分为训练集和测试集

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data[column_names[1:10]],
    data[column_names[10]],
    test_size = 0.25, random_state=33)
print("X_train shape:{}".format(X_train.shape))
print("Y_train shape:{}".format(y_train.shape))
print("X_test shape:{}".format(X_test.shape))
print("Y_test shape:{}".format(y_test.shape))
print(y_train.value_counts())
print(y_test.value_counts())

------output------
X_train shape:(512, 9)
Y_train shape:(512,)
X_test shape:(171, 9)
Y_test shape:(171,)
Class
2    344
4    168
Name: count, dtype: int64
Class
2    100
4     71
Name: count, dtype: int64

4.训练模型 Logistic and SGDClassifier + 作出预测

进行标准化归一化

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
sgdc = SGDClassifier()

lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)

print(lr_y_predict)
print(sgdc_y_predict)

------output------
[2 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 2 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
 4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 2
 2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
 2 2 2 2 2 2 4 2 2 4 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
 2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]
[2 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 2 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
 4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 2
 2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
 2 2 2 2 2 2 4 2 2 2 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
 2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]

5.评估模型

# accuracy
from sklearn.metrics import classification_report

print('Accuracy of LR Classifier:{}'.format(lr.score(X_test,y_test)))
print('Accuracy of SGD Classifier:{}'.format(sgdc.score(X_test,y_test)))

print(classification_report(y_test, lr_y_predict, target_names=['Benign','Malignant']))
print(classification_report(y_test, sgdc_y_predict, target_names=['Benign','Malignant']))

Accuracy of LR Classifier:0.9883040935672515
Accuracy of SGD Classifier:0.9824561403508771
              precision    recall  f1-score   support

      Benign       0.99      0.99      0.99       100
   Malignant       0.99      0.99      0.99        71

    accuracy                           0.99       171
   macro avg       0.99      0.99      0.99       171
weighted avg       0.99      0.99      0.99       171

              precision    recall  f1-score   support

      Benign       0.98      0.99      0.99       100
   Malignant       0.99      0.97      0.98        71

    accuracy                           0.98       171
   macro avg       0.98      0.98      0.98       171
weighted avg       0.98      0.98      0.98       171

小结

本文针对良性/恶性乳腺癌肿瘤分类问题展开代码解释和数据呈现。

其中最重要的是分类方法。本文采用两种分类方法：LogisticRegression罗杰斯特回归模型 和 SGDClassifier。两种分类方法的运用均是通过直接的API调用，后续学习过程中要对两种算法的具体实现展开进一步的学习。