ML-分类-2_良性恶性乳腺癌肿瘤分类

源码部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# ML-Cssifier-2-breast_cancer_wisconsin
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata
#print(breast_cancer_wisconsin_diagnostic.metadata)

# variable information
#print(breast_cancer_wisconsin_diagnostic.variables)

column_names = ['Sample code number',
'Clump Thickness',
'Uniformity of Cell Size',
'Uniformity of Cell Shape',
'Marginal Adhesion',
'Single Epithelial Cell Size',
'Bare Nuclei',
'Bland Chromatin',
'Normal Nucleoli',
'Mitoses',
'Class']

data = pd.read_csv('/home/zero/下载/breast+cancer+wisconsin+original/breast-cancer-wisconsin.data',names=column_names)
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna(how='any')
print(data.shape)
print(data)

# repare for the training
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data[column_names[1:10]],
data[column_names[10]],
test_size = 0.25, random_state=33)
print("X_train shape:{}".format(X_train.shape))
print("Y_train shape:{}".format(y_train.shape))
print("X_test shape:{}".format(X_test.shape))
print("Y_test shape:{}".format(y_test.shape))
print(y_train.value_counts())
print(y_test.value_counts())

# program try with Logistic and SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
sgdc = SGDClassifier()

lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)

print(lr_y_predict)
print(sgdc_y_predict)

# accuracy
from sklearn.metrics import classification_report

print('Accuracy of LR Classifier:{}'.format(lr.score(X_test,y_test)))
print('Accuracy of SGD Classifier:{}'.format(sgdc.score(X_test,y_test)))

print(classification_report(y_test, lr_y_predict, target_names=['Benign','Malignant']))
print(classification_report(y_test, sgdc_y_predict, target_names=['Benign','Malignant']))

output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/home/zero/anaconda3/envs/AI/bin/python /home/zero/PycharmProjects/ML/main.py
--------------------------------------------------
(683, 11)
Sample code number Clump Thickness ... Mitoses Class
0 1000025 5 ... 1 2
1 1002945 5 ... 1 2
2 1015425 3 ... 1 2
3 1016277 6 ... 1 2
4 1017023 4 ... 1 2
.. ... ... ... ... ...
694 776715 3 ... 1 2
695 841769 2 ... 1 2
696 888820 5 ... 2 4
697 897471 4 ... 1 4
698 897471 4 ... 1 4

[683 rows x 11 columns]
X_train shape:(512, 9)
Y_train shape:(512,)
X_test shape:(171, 9)
Y_test shape:(171,)
Class
2 344
4 168
Name: count, dtype: int64
Class
2 100
4 71
Name: count, dtype: int64
[2 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 2 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 2
2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
2 2 2 2 2 2 4 2 2 4 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]
[4 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 4 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 4
2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
2 2 2 2 2 2 4 2 2 4 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]
Accuracy of LR Classifier:0.9883040935672515
Accuracy of SGD Classifier:0.9824561403508771
precision recall f1-score support

Benign 0.99 0.99 0.99 100
Malignant 0.99 0.99 0.99 71

accuracy 0.99 171
macro avg 0.99 0.99 0.99 171
weighted avg 0.99 0.99 0.99 171

precision recall f1-score support

Benign 1.00 0.97 0.98 100
Malignant 0.96 1.00 0.98 71

accuracy 0.98 171
macro avg 0.98 0.98 0.98 171
weighted avg 0.98 0.98 0.98 171


Process finished with exit code 0

代码精解

1.认识数据集

https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original

Breast Cancer Wisconsin (Original)

Additional Information

Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:

Group 1: 367 instances (January 1989)

Group 2: 70 instances (October 1989)

Group 3: 31 instances (February 1990)

Group 4: 17 instances (April 1990)

Group 5: 48 instances (August 1990)

Group 6: 49 instances (Updated January 1991)

Group 7: 31 instances (June 1991)

Group 8: 86 instances (November 1991)


Total: 699 points (as of the donated datbase on 15 July 1992)

Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following statements summarizes changes to the original Group 1's set of data:

##### Group 1 : 367 points: 200B 167M (January 1989)

##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805

##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record

##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial

##### : Changed 0 to 1 in field 6 of sample 1219406

##### : Changed 0 to 1 in field 8 of following sample:

##### : 1182404,2,3,1,1,1,2,0,1,1,1

Donated on 7/14/1992

用代码打印数据集信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from ucimlrepo import fetch_ucirepo

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata
print(breast_cancer_wisconsin_diagnostic.metadata)

# variable information
print(breast_cancer_wisconsin_diagnostic.variables)

output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
------output------
{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'published_in': 'Electronic imaging', 'year': 1993, 'url': 'https://www.semanticscholar.org/paper/53f0fbb425bc14468eb3bf96b2e1d41ba8087f36', 'doi': '10.1117/12.148698'}, 'additional_info': {'summary': 'Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/\r\n\r\nSeparating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.\r\n\r\nThe actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].\r\n\r\nThis database is also available through the UW CS ftp server:\r\nftp ftp.cs.wisc.edu\r\ncd math-prog/cpo-dataset/machine-learn/WDBC/', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': '1) ID number\r\n2) Diagnosis (M = malignant, B = benign)\r\n3-32)\r\n\r\nTen real-valued features are computed for each cell nucleus:\r\n\r\n\ta) radius (mean of distances from center to points on the perimeter)\r\n\tb) texture (standard deviation of gray-scale values)\r\n\tc) perimeter\r\n\td) area\r\n\te) smoothness (local variation in radius lengths)\r\n\tf) compactness (perimeter^2 / area - 1.0)\r\n\tg) concavity (severity of concave portions of the contour)\r\n\th) concave points (number of concave portions of the contour)\r\n\ti) symmetry \r\n\tj) fractal dimension ("coastline approximation" - 1)', 'citation': None}}
name role type ... description units missing_values
0 ID ID Categorical ... None None no
1 Diagnosis Target Categorical ... None None no
2 radius1 Feature Continuous ... None None no
3 texture1 Feature Continuous ... None None no
4 perimeter1 Feature Continuous ... None None no
5 area1 Feature Continuous ... None None no
6 smoothness1 Feature Continuous ... None None no
7 compactness1 Feature Continuous ... None None no
8 concavity1 Feature Continuous ... None None no
9 concave_points1 Feature Continuous ... None None no
10 symmetry1 Feature Continuous ... None None no
11 fractal_dimension1 Feature Continuous ... None None no
12 radius2 Feature Continuous ... None None no
13 texture2 Feature Continuous ... None None no
14 perimeter2 Feature Continuous ... None None no
15 area2 Feature Continuous ... None None no
16 smoothness2 Feature Continuous ... None None no
17 compactness2 Feature Continuous ... None None no
18 concavity2 Feature Continuous ... None None no
19 concave_points2 Feature Continuous ... None None no
20 symmetry2 Feature Continuous ... None None no
21 fractal_dimension2 Feature Continuous ... None None no
22 radius3 Feature Continuous ... None None no
23 texture3 Feature Continuous ... None None no
24 perimeter3 Feature Continuous ... None None no
25 area3 Feature Continuous ... None None no
26 smoothness3 Feature Continuous ... None None no
27 compactness3 Feature Continuous ... None None no
28 concavity3 Feature Continuous ... None None no
29 concave_points3 Feature Continuous ... None None no
30 symmetry3 Feature Continuous ... None None no
31 fractal_dimension3 Feature Continuous ... None None no

[32 rows x 7 columns]

2.数据集载入与处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np

# 创建特征列表
column_names = ['Sample code number',
'Clump Thickness',
'Uniformity of Cell Size',
'Uniformity of Cell Shape',
'Marginal Adhesion',
'Single Epithelial Cell Size',
'Bare Nuclei',
'Bland Chromatin',
'Normal Nucleoli',
'Mitoses',
'Class']

# 使用pandas.read_csv读取指定数据
data = pd.read_csv('/home/zero/下载/breast+cancer+wisconsin+original/breast-cancer-wisconsin.data',names=column_names)
#将“?”替换为标准缺失值表示
data = data.replace(to_replace='?', value=np.nan)
#丢弃带有缺失值的数据
data = data.dropna(how='any')
#输出data的数据量和维度
print(data.shape)
print(data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(683, 11) #数据量
Sample code number Clump Thickness ... Mitoses Class
0 1000025 5 ... 1 2
1 1002945 5 ... 1 2
2 1015425 3 ... 1 2
3 1016277 6 ... 1 2
4 1017023 4 ... 1 2
.. ... ... ... ... ...
694 776715 3 ... 1 2
695 841769 2 ... 1 2
696 888820 5 ... 2 4
697 897471 4 ... 1 4
698 897471 4 ... 1 4

[683 rows x 11 columns] #维度

3.数据集分割和准备

将原有数据集分为训练集和测试集

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data[column_names[1:10]],
data[column_names[10]],
test_size = 0.25, random_state=33)
print("X_train shape:{}".format(X_train.shape))
print("Y_train shape:{}".format(y_train.shape))
print("X_test shape:{}".format(X_test.shape))
print("Y_test shape:{}".format(y_test.shape))
print(y_train.value_counts())
print(y_test.value_counts())

1
2
3
4
5
6
7
8
9
10
11
12
13
14
------output------
X_train shape:(512, 9)
Y_train shape:(512,)
X_test shape:(171, 9)
Y_test shape:(171,)
Class
2 344
4 168
Name: count, dtype: int64
Class
2 100
4 71
Name: count, dtype: int64

4.训练模型 Logistic and SGDClassifier + 作出预测

进行标准化归一化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

lr = LogisticRegression()
sgdc = SGDClassifier()

lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)
sgdc.fit(X_train, y_train)
sgdc_y_predict = sgdc.predict(X_test)

print(lr_y_predict)
print(sgdc_y_predict)

1
2
3
4
5
6
7
8
9
10
11
------output------
[2 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 2 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 2
2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
2 2 2 2 2 2 4 2 2 4 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]
[2 2 4 4 2 2 2 4 2 2 2 2 4 2 4 4 4 4 4 2 2 4 4 2 4 4 2 2 4 4 4 4 4 4 4 4 2
4 4 4 4 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 4 2 2 2
2 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 4 4 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 2 2 4
2 2 2 2 2 2 4 2 2 2 4 2 4 2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 2 4 2 4 2 2 2 2
2 4 4 2 4 4 2 4 2 2 2 2 4 4 4 2 4 2 2 4 2 4 4]

5.评估模型

1
2
3
4
5
6
7
8
# accuracy
from sklearn.metrics import classification_report

print('Accuracy of LR Classifier:{}'.format(lr.score(X_test,y_test)))
print('Accuracy of SGD Classifier:{}'.format(sgdc.score(X_test,y_test)))

print(classification_report(y_test, lr_y_predict, target_names=['Benign','Malignant']))
print(classification_report(y_test, sgdc_y_predict, target_names=['Benign','Malignant']))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Accuracy of LR Classifier:0.9883040935672515
Accuracy of SGD Classifier:0.9824561403508771
precision recall f1-score support

Benign 0.99 0.99 0.99 100
Malignant 0.99 0.99 0.99 71

accuracy 0.99 171
macro avg 0.99 0.99 0.99 171
weighted avg 0.99 0.99 0.99 171

precision recall f1-score support

Benign 0.98 0.99 0.99 100
Malignant 0.99 0.97 0.98 71

accuracy 0.98 171
macro avg 0.98 0.98 0.98 171
weighted avg 0.98 0.98 0.98 171

小结

本文针对良性/恶性乳腺癌肿瘤分类问题展开代码解释和数据呈现。

其中最重要的是分类方法。本文采用两种分类方法:LogisticRegression罗杰斯特回归模型SGDClassifier。两种分类方法的运用均是通过直接的API调用,后续学习过程中要对两种算法的具体实现展开进一步的学习。