data = pd.read_csv('/home/zero/下载/breast+cancer+wisconsin+original/breast-cancer-wisconsin.data',names=column_names) data = data.replace(to_replace='?', value=np.nan) data = data.dropna(how='any') print(data.shape) print(data)
# repare for the training #from sklearn.cross_validation import train_test_split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( data[column_names[1:10]], data[column_names[10]], test_size = 0.25, random_state=33) print("X_train shape:{}".format(X_train.shape)) print("Y_train shape:{}".format(y_train.shape)) print("X_test shape:{}".format(X_test.shape)) print("Y_test shape:{}".format(y_test.shape)) print(y_train.value_counts()) print(y_test.value_counts())
# program try with Logistic and SGDClassifier from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.linear_model import SGDClassifier
ss = StandardScaler() X_train = ss.fit_transform(X_train) X_test = ss.transform(X_test)
Samples arrive periodically as Dr. Wolberg reports his clinical
cases. The database therefore reflects this chronological grouping of
the data. This grouping information appears immediately below, having
been removed from the data itself:
Group 1: 367 instances (January 1989)
Group 2: 70 instances (October 1989)
Group 3: 31 instances (February 1990)
Group 4: 17 instances (April 1990)
Group 5: 48 instances (August 1990)
Group 6: 49 instances (Updated January 1991)
Group 7: 31 instances (June 1991)
Group 8: 86 instances (November 1991)
Total: 699 points (as of the donated datbase on 15 July 1992)
Note that the results summarized above in Past Usage refer to a
dataset of size 369, while Group 1 has only 367 instances. This is
because it originally contained 369 instances; 2 were removed. The
following statements summarizes changes to the original Group 1's set of
data:
##### Group 1 : 367 points: 200B 167M (January 1989)
##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185
& 1187805
##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no
record
##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
##### : Changed 0 to 1 in field 6 of sample 1219406
##### : Changed 0 to 1 in field 8 of following sample:
------output------ {'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'published_in': 'Electronic imaging', 'year': 1993, 'url': 'https://www.semanticscholar.org/paper/53f0fbb425bc14468eb3bf96b2e1d41ba8087f36', 'doi': '10.1117/12.148698'}, 'additional_info': {'summary': 'Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/\r\n\r\nSeparating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.\r\n\r\nThe actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].\r\n\r\nThis database is also available through the UW CS ftp server:\r\nftp ftp.cs.wisc.edu\r\ncd math-prog/cpo-dataset/machine-learn/WDBC/', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': '1) ID number\r\n2) Diagnosis (M = malignant, B = benign)\r\n3-32)\r\n\r\nTen real-valued features are computed for each cell nucleus:\r\n\r\n\ta) radius (mean of distances from center to points on the perimeter)\r\n\tb) texture (standard deviation of gray-scale values)\r\n\tc) perimeter\r\n\td) area\r\n\te) smoothness (local variation in radius lengths)\r\n\tf) compactness (perimeter^2 / area - 1.0)\r\n\tg) concavity (severity of concave portions of the contour)\r\n\th) concave points (number of concave portions of the contour)\r\n\ti) symmetry \r\n\tj) fractal dimension ("coastline approximation" - 1)', 'citation': None}} name role type ... description units missing_values 0 ID ID Categorical ... None None no 1 Diagnosis Target Categorical ... None None no 2 radius1 Feature Continuous ... None None no 3 texture1 Feature Continuous ... None None no 4 perimeter1 Feature Continuous ... None None no 5 area1 Feature Continuous ... None None no 6 smoothness1 Feature Continuous ... None None no 7 compactness1 Feature Continuous ... None None no 8 concavity1 Feature Continuous ... None None no 9 concave_points1 Feature Continuous ... None None no 10 symmetry1 Feature Continuous ... None None no 11 fractal_dimension1 Feature Continuous ... None None no 12 radius2 Feature Continuous ... None None no 13 texture2 Feature Continuous ... None None no 14 perimeter2 Feature Continuous ... None None no 15 area2 Feature Continuous ... None None no 16 smoothness2 Feature Continuous ... None None no 17 compactness2 Feature Continuous ... None None no 18 concavity2 Feature Continuous ... None None no 19 concave_points2 Feature Continuous ... None None no 20 symmetry2 Feature Continuous ... None None no 21 fractal_dimension2 Feature Continuous ... None None no 22 radius3 Feature Continuous ... None None no 23 texture3 Feature Continuous ... None None no 24 perimeter3 Feature Continuous ... None None no 25 area3 Feature Continuous ... None None no 26 smoothness3 Feature Continuous ... None None no 27 compactness3 Feature Continuous ... None None no 28 concavity3 Feature Continuous ... None None no 29 concave_points3 Feature Continuous ... None None no 30 symmetry3 Feature Continuous ... None None no 31 fractal_dimension3 Feature Continuous ... None None no