Cancer cell classification is a crucial task in medical diagnostics, helping in the early detection and treatment of various cancers. Machine learning techniques, particularly those provided by Scikit-learn, can significantly enhance the accuracy and efficiency of these classifications. In this article, we will delve into how to use Python’s Scikit-learn library to classify cancer cells.
Understanding the Dataset
For this tutorial, we’ll use the Breast Cancer Wisconsin dataset, which is available in Scikit-learn’s datasets module. This dataset includes various features derived from digitized images of fine needle aspirate (FNA) of breast masses. Each instance has 30 features and a target variable indicating whether the tumor is malignant or benign.
Steps to Classify Cancer Cells
- Import Required Libraries
- Load and Explore the Dataset
- Data Preprocessing
- Split the Data into Training and Test Sets
- Train a Classification Model
- Evaluate the Model
- Make Predictions
Let’s go through each of these steps in detail.
Step 1: Import Required Libraries
First, we need to import the necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Step 2: Load and Explore the Dataset
Next, we load the Breast Cancer Wisconsin dataset and take a look at its structure.
# Load the dataset
cancer_data = load_breast_cancer()
# Convert to a DataFrame
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target
# Display the first few rows
print(df.head())
Step 3: Data Preprocessing
Before training the model, we need to preprocess the data. This involves scaling the features to ensure that each feature contributes equally to the result.
# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 4: Train a Classification Model
We will use the K-Nearest Neighbors (KNN) classifier for this task. KNN is a simple, yet effective, classification algorithm.
# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
# Train the model
knn.fit(X_train, y_train)
Step 5: Evaluate the Model
After training the model, we need to evaluate its performance using the test set.
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))
Step 6: Make Predictions
Finally, we can use the trained model to make predictions on new data.
# Example of making a prediction on a new sample
sample_data = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])
# Standardize the sample data
sample_data = scaler.transform(sample_data)
# Make the prediction
prediction = knn.predict(sample_data)
print("Prediction (0 = malignant, 1 = benign):", prediction)
Full Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Load the dataset
cancer_data = load_breast_cancer()
# Convert to a DataFrame
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target
# Display the first few rows
print(df.head())
# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
# Train the model
knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))
# Example of making a prediction on a new sample
sample_data = pd.DataFrame([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]],
columns=cancer_data.feature_names)
# Standardize the sample data
sample_data = scaler.transform(sample_data)
# Make the prediction
prediction = knn.predict(sample_data)
print("Prediction (0 = malignant, 1 = benign):", prediction)
Output:

Let us check out the explanation of the above code first and then the explanation of the output:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
These lines import various libraries needed for the task:
numpyandpandasfor data handling and manipulation.matplotlib.pyplotandseabornfor data visualization.sklearn.datasets,sklearn.model_selection,sklearn.preprocessing,sklearn.neighbors, andsklearn.metricsfor machine learning tasks including loading datasets, splitting data, scaling features, building models, and evaluating models.
# Load the dataset
cancer_data = load_breast_cancer()
This line loads the breast cancer dataset from Scikit-learn, which contains data about breast cancer cases including features and labels indicating whether a tumor is malignant or benign.
# Convert to a DataFrame
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target
Here, the dataset is converted into a pandas DataFrame for easier data manipulation. The features of the dataset are stored in df, and a new column called target is added to the DataFrame, containing the labels (0 for malignant and 1 for benign).
# Display the first few rows
print(df.head())
This line prints the first few rows of the DataFrame to give a quick look at the data structure and content.
# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']
The features (data points) and the target variable (labels) are separated into X and y. X contains all the columns except target, while y contains only the target column.
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The data is split into training and test sets using an 80-20 split. 80% of the data is used for training the model, and 20% is used for testing it. The random_state=42 ensures reproducibility of the split.
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
The features are standardized to have a mean of 0 and a standard deviation of 1, which helps improve the performance of many machine learning algorithms. The scaler is fitted on the training data and then applied to both the training and test data.
# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
A K-Nearest Neighbors classifier is initialized with n_neighbors=5, meaning the algorithm will consider the 5 nearest data points when making a prediction.
# Train the model
knn.fit(X_train, y_train)
The KNN model is trained using the training data (X_train and y_train).
# Make predictions on the test set
y_pred = knn.predict(X_test)
The trained model makes predictions on the test data.
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))
The model’s performance is evaluated using several metrics:
- Confusion Matrix: A table that describes the performance of a classification model.
- Classification Report: Provides precision, recall, F1 score, and support for the model.
- Accuracy Score: The ratio of correctly predicted instances to the total instances.
# Example of making a prediction on a new sample
sample_data = pd.DataFrame([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]],
columns=cancer_data.feature_names)
A new sample of data is created as a pandas DataFrame. This sample is structured similarly to the training data, with the same feature names.
# Standardize the sample data
sample_data = scaler.transform(sample_data)
The new sample data is standardized using the same scaler that was fitted on the training data.
# Make the prediction
prediction = knn.predict(sample_data)
print("Prediction (0 = malignant, 1 = benign):", prediction)
The model makes a prediction on the new sample data, and the prediction is printed out. The output indicates whether the tumor is predicted to be malignant (0) or benign (1).
Now, let us understand the output:
DataFrame Display
mean radius mean texture ... worst fractal dimension target
0 17.99 10.38 ... 0.11890 0
1 20.57 17.77 ... 0.08902 0
2 19.69 21.25 ... 0.08758 0
3 11.42 20.38 ... 0.17300 0
4 20.29 14.34 ... 0.07678 0
[5 rows x 31 columns]
This part shows the first 5 rows of the DataFrame containing the breast cancer dataset. Each row represents a different patient’s tumor data. Each column represents a different feature measured from the tumor, such as “mean radius”, “mean texture”, etc. The column “target” indicates whether the tumor is malignant (0) or benign (1). There are 31 columns in total.
Confusion Matrix
Confusion Matrix:
[[40 3]
[ 3 68]]
The confusion matrix is a table used to evaluate the performance of a classification model. It compares the predicted classifications to the actual classifications.
- The first row
[40 3]tells us:40cases were correctly predicted as malignant (true positives).3cases were incorrectly predicted as benign (false negatives).
- The second row
[3 68]tells us:3cases were incorrectly predicted as malignant (false positives).68cases were correctly predicted as benign (true negatives).
Classification Report
Classification Report:
precision recall f1-score support
0 0.93 0.93 0.93 43
1 0.96 0.96 0.96 71
accuracy 0.95 114
macro avg 0.94 0.94 0.94 114
weighted avg 0.95 0.95 0.95 114
The classification report provides several metrics to evaluate the model’s performance:
- Precision: The ratio of correctly predicted positive observations to the total predicted positives.
- For malignant tumors (0): 0.93 (93%)
- For benign tumors (1): 0.96 (96%)
- Recall: The ratio of correctly predicted positive observations to all observations in the actual class.
- For malignant tumors (0): 0.93 (93%)
- For benign tumors (1): 0.96 (96%)
- F1-Score: The weighted average of precision and recall. It considers both false positives and false negatives.
- For malignant tumors (0): 0.93 (93%)
- For benign tumors (1): 0.96 (96%)
- Support: The number of actual occurrences of the class in the dataset.
- For malignant tumors (0): 43
- For benign tumors (1): 71
- Accuracy: The overall ratio of correctly predicted observations to the total observations: 0.95 (95%).
- Macro Average: The average of precision, recall, and F1-score for both classes (0 and 1).
- Precision: 0.94 (94%)
- Recall: 0.94 (94%)
- F1-Score: 0.94 (94%)
- Weighted Average: The average of precision, recall, and F1-score, weighted by the number of true instances for each label.
- Precision: 0.95 (95%)
- Recall: 0.95 (95%)
- F1-Score: 0.95 (95%)
Accuracy Score
Accuracy Score:
0.9473684210526315
The accuracy score is the proportion of true results (both true positives and true negatives) among the total number of cases examined. Here, it is approximately 0.947 (94.74%), indicating the model correctly classified about 94.74% of the test cases.
Prediction on New Data
Prediction (0 = malignant, 1 = benign): [0]
This line shows the prediction result for a new sample of tumor data. The model predicted [0], which means the tumor is classified as malignant.
Conclusion
In this article, we explored how to use Python’s Scikit-learn library to classify cancer cells using the Breast Cancer Wisconsin dataset. We went through the steps of loading and exploring the dataset, preprocessing the data, training a K-Nearest Neighbors classifier, evaluating the model, and making predictions. Machine learning techniques like these can play a vital role in medical diagnostics, potentially saving lives by enabling early detection of diseases.





Leave a Reply