Python ML - Cancer cell classification using Scikit-learn Library

Cancer cell classification is a crucial task in medical diagnostics, helping in the early detection and treatment of various cancers. Machine learning techniques, particularly those provided by Scikit-learn, can significantly enhance the accuracy and efficiency of these classifications. In this article, we will delve into how to use Python’s Scikit-learn library to classify cancer cells.

Understanding the Dataset

For this tutorial, we’ll use the Breast Cancer Wisconsin dataset, which is available in Scikit-learn’s datasets module. This dataset includes various features derived from digitized images of fine needle aspirate (FNA) of breast masses. Each instance has 30 features and a target variable indicating whether the tumor is malignant or benign.

Steps to Classify Cancer Cells

Import Required Libraries
Load and Explore the Dataset
Data Preprocessing
Split the Data into Training and Test Sets
Train a Classification Model
Evaluate the Model
Make Predictions

Let’s go through each of these steps in detail.

Step 1: Import Required Libraries

First, we need to import the necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Step 2: Load and Explore the Dataset

Next, we load the Breast Cancer Wisconsin dataset and take a look at its structure.

# Load the dataset
cancer_data = load_breast_cancer()

# Convert to a DataFrame
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target

# Display the first few rows
print(df.head())

Step 3: Data Preprocessing

Before training the model, we need to preprocess the data. This involves scaling the features to ensure that each feature contributes equally to the result.

# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Train a Classification Model

We will use the K-Nearest Neighbors (KNN) classifier for this task. KNN is a simple, yet effective, classification algorithm.

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

Step 5: Evaluate the Model

After training the model, we need to evaluate its performance using the test set.

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

Step 6: Make Predictions

Finally, we can use the trained model to make predictions on new data.

# Example of making a prediction on a new sample
sample_data = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871, 
                         1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193, 
                         25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])

# Standardize the sample data
sample_data = scaler.transform(sample_data)

# Make the prediction
prediction = knn.predict(sample_data)
print("Prediction (0 = malignant, 1 = benign):", prediction)

Full Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load the dataset
cancer_data = load_breast_cancer()

# Convert to a DataFrame
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target

# Display the first few rows
print(df.head())

# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

# Example of making a prediction on a new sample
sample_data = pd.DataFrame([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871, 
                             1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193, 
                             25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]], 
                            columns=cancer_data.feature_names)

# Standardize the sample data
sample_data = scaler.transform(sample_data)

# Make the prediction
prediction = knn.predict(sample_data)
print("Prediction (0 = malignant, 1 = benign):", prediction)

Output:

Let us check out the explanation of the above code first and then the explanation of the output:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

These lines import various libraries needed for the task:

numpy and pandas for data handling and manipulation.
matplotlib.pyplot and seaborn for data visualization.
sklearn.datasets, sklearn.model_selection, sklearn.preprocessing, sklearn.neighbors, and sklearn.metrics for machine learning tasks including loading datasets, splitting data, scaling features, building models, and evaluating models.

# Load the dataset
cancer_data = load_breast_cancer()

This line loads the breast cancer dataset from Scikit-learn, which contains data about breast cancer cases including features and labels indicating whether a tumor is malignant or benign.

# Convert to a DataFrame
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = cancer_data.target

Here, the dataset is converted into a pandas DataFrame for easier data manipulation. The features of the dataset are stored in df, and a new column called target is added to the DataFrame, containing the labels (0 for malignant and 1 for benign).

# Display the first few rows
print(df.head())

This line prints the first few rows of the DataFrame to give a quick look at the data structure and content.

# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']

The features (data points) and the target variable (labels) are separated into X and y. X contains all the columns except target, while y contains only the target column.

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The data is split into training and test sets using an 80-20 split. 80% of the data is used for training the model, and 20% is used for testing it. The random_state=42 ensures reproducibility of the split.

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

The features are standardized to have a mean of 0 and a standard deviation of 1, which helps improve the performance of many machine learning algorithms. The scaler is fitted on the training data and then applied to both the training and test data.

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

A K-Nearest Neighbors classifier is initialized with n_neighbors=5, meaning the algorithm will consider the 5 nearest data points when making a prediction.

# Train the model
knn.fit(X_train, y_train)

The KNN model is trained using the training data (X_train and y_train).

# Make predictions on the test set
y_pred = knn.predict(X_test)

The trained model makes predictions on the test data.

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))

The model’s performance is evaluated using several metrics:

Confusion Matrix: A table that describes the performance of a classification model.
Classification Report: Provides precision, recall, F1 score, and support for the model.
Accuracy Score: The ratio of correctly predicted instances to the total instances.

# Example of making a prediction on a new sample
sample_data = pd.DataFrame([[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871, 
                             1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193, 
                             25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189]], 
                            columns=cancer_data.feature_names)

A new sample of data is created as a pandas DataFrame. This sample is structured similarly to the training data, with the same feature names.

# Standardize the sample data
sample_data = scaler.transform(sample_data)

The new sample data is standardized using the same scaler that was fitted on the training data.

# Make the prediction
prediction = knn.predict(sample_data)
print("Prediction (0 = malignant, 1 = benign):", prediction)

The model makes a prediction on the new sample data, and the prediction is printed out. The output indicates whether the tumor is predicted to be malignant (0) or benign (1).

Now, let us understand the output:

DataFrame Display

  mean radius  mean texture  ...  worst fractal dimension  target
0        17.99         10.38  ...                  0.11890       0
1        20.57         17.77  ...                  0.08902       0
2        19.69         21.25  ...                  0.08758       0
3        11.42         20.38  ...                  0.17300       0
4        20.29         14.34  ...                  0.07678       0

[5 rows x 31 columns]

This part shows the first 5 rows of the DataFrame containing the breast cancer dataset. Each row represents a different patient’s tumor data. Each column represents a different feature measured from the tumor, such as “mean radius”, “mean texture”, etc. The column “target” indicates whether the tumor is malignant (0) or benign (1). There are 31 columns in total.

Confusion Matrix

Confusion Matrix:
[[40  3]
 [ 3 68]]

The confusion matrix is a table used to evaluate the performance of a classification model. It compares the predicted classifications to the actual classifications.

The first row [40 3] tells us:
- 40 cases were correctly predicted as malignant (true positives).
- 3 cases were incorrectly predicted as benign (false negatives).
The second row [3 68] tells us:
- 3 cases were incorrectly predicted as malignant (false positives).
- 68 cases were correctly predicted as benign (true negatives).

Classification Report

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

The classification report provides several metrics to evaluate the model’s performance:

Precision: The ratio of correctly predicted positive observations to the total predicted positives.
- For malignant tumors (0): 0.93 (93%)
- For benign tumors (1): 0.96 (96%)
Recall: The ratio of correctly predicted positive observations to all observations in the actual class.
- For malignant tumors (0): 0.93 (93%)
- For benign tumors (1): 0.96 (96%)
F1-Score: The weighted average of precision and recall. It considers both false positives and false negatives.
- For malignant tumors (0): 0.93 (93%)
- For benign tumors (1): 0.96 (96%)
Support: The number of actual occurrences of the class in the dataset.
- For malignant tumors (0): 43
- For benign tumors (1): 71
Accuracy: The overall ratio of correctly predicted observations to the total observations: 0.95 (95%).
Macro Average: The average of precision, recall, and F1-score for both classes (0 and 1).
- Precision: 0.94 (94%)
- Recall: 0.94 (94%)
- F1-Score: 0.94 (94%)
Weighted Average: The average of precision, recall, and F1-score, weighted by the number of true instances for each label.
- Precision: 0.95 (95%)
- Recall: 0.95 (95%)
- F1-Score: 0.95 (95%)

Accuracy Score

Accuracy Score:
0.9473684210526315

The accuracy score is the proportion of true results (both true positives and true negatives) among the total number of cases examined. Here, it is approximately 0.947 (94.74%), indicating the model correctly classified about 94.74% of the test cases.

Prediction on New Data

Prediction (0 = malignant, 1 = benign): [0]

This line shows the prediction result for a new sample of tumor data. The model predicted [0], which means the tumor is classified as malignant.

Conclusion

In this article, we explored how to use Python’s Scikit-learn library to classify cancer cells using the Breast Cancer Wisconsin dataset. We went through the steps of loading and exploring the dataset, preprocessing the data, training a K-Nearest Neighbors classifier, evaluating the model, and making predictions. Machine learning techniques like these can play a vital role in medical diagnostics, potentially saving lives by enabling early detection of diseases.

CodeMagnet

CodeMagnet

Python ML – Cancer cell classification using Scikit-learn Library

Understanding the Dataset

Steps to Classify Cancer Cells

Step 1: Import Required Libraries

Step 2: Load and Explore the Dataset

Step 3: Data Preprocessing

Step 4: Train a Classification Model

Step 5: Evaluate the Model

Step 6: Make Predictions

Confusion Matrix

Classification Report

Conclusion

Like this:

Author

One response to “Python ML – Cancer cell classification using Scikit-learn Library”

Leave a ReplyCancel reply

Python Humanize Package Tutorial: Complete Guide

Caesar Cipher in Python: Complete Guide with Examples, Code, and Explanation

Python tqdm Tutorial: Easily Track Loop Progress with Examples

Trending

Python Humanize Package Tutorial: Complete Guide

Caesar Cipher in Python: Complete Guide with Examples, Code, and Explanation

Python tqdm Tutorial: Easily Track Loop Progress with Examples

AI-Powered Healthcare Diagnosis with Python: A Complete Guide

CodeMagnet

Subscribe to CodeMagnet! 🔔

Python ML – Cancer cell classification using Scikit-learn Library

Understanding the Dataset

Steps to Classify Cancer Cells

Step 1: Import Required Libraries

Step 2: Load and Explore the Dataset

Step 3: Data Preprocessing

Step 4: Train a Classification Model

Step 5: Evaluate the Model

Step 6: Make Predictions

Confusion Matrix

Classification Report

Conclusion

Share this:

Like this:

Author

One response to “Python ML – Cancer cell classification using Scikit-learn Library”

Leave a ReplyCancel reply

Trending