Skip to content
Snippets Groups Projects
Commit ce745944 authored by Jannis Klinkenberg's avatar Jannis Klinkenberg
Browse files

added clustering example using scikit-learn

parent 28797f8b
No related branches found
No related tags found
No related merge requests found
# Scikit-Learn - Clustering Example
## Interactive usage
To interactively work, debug and execute codes you can either use our [HPC JupyterHub](https://jupyterhub.hpc.itc.rwth-aachen.de:9651/) (More information to that in our [Help](https://help.itc.rwth-aachen.de/service/rhr4fjjutttf/article/689934fec5a34c909c54606f6bc2e827/)) or use regular shell sessions on the cluster frontends or within interactivly batch jobs.
As an example, in an interactive session on our HPC cluster, execute the following:
```bash
# load the container module on the cluster
module load datascience-notebook
# run the container and get a shell inside
# more information on Apptainer flags: https://apptainer.org/docs/user/main/cli/apptainer_shell.html
apptainer shell -e ${DATASCIENCENOTEBOOK_IMAGE}
# within the newly opened shell in the container
Apptainer> python scikit-learn_clustering.py
```
# Batch usage
To asynchronously execute scripts in batch, there is an example contained in this folder
\ No newline at end of file
# Clustering Iris Dataset
### Step 1: Load desired Python modules
import time
import matplotlib.pyplot as plt
from sklearn import cluster, datasets
from sklearn.decomposition import PCA
### Step 2: Loading the dataset
# load the dataset
dataset = datasets.load_iris()
# print desired information
print(f"Data set description:\n{dataset.DESCR}")
print(f"Shape of feature / training data: {dataset.data.shape}")
print(f"Shape of label data: {dataset.target.shape}")
print(f"Feature names: {dataset.feature_names}")
print(f"Target names: {dataset.target_names}")
### Step 3: Train a KMeans clustering model
# create and intialize the clustering model
model = cluster.KMeans(n_clusters=3, init="k-means++", random_state=42)
# train / fit the model
elapsed_time = time.time()
model = model.fit(dataset.data)
elapsed_time = time.time() - elapsed_time
print(f"Elapsed time for preprocessing and training (original data): {elapsed_time} sec")
### Step 4: Visualization of results + comparison to original classes
# transform data to new 2D feature space
pca = PCA(n_components=2)
X_pca = pca.fit(dataset.data).transform(dataset.data)
# define class colors
colors = ["navy", "turquoise", "darkorange"]
# =================================================================
# == plot original classes
# == using 2D feature space representation
# =================================================================
fig1 = plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], dataset.target_names):
plt.scatter(
X_pca[dataset.target == i, 0], # x coordinates in new 2D feature space
X_pca[dataset.target == i, 1], # y coordinates in new 2D feature space
color=color, alpha=0.8, lw=2,
label=target_name
)
plt.title("Original IRIS dataset classes (after applying PCA)")
plt.legend(loc="best", shadow=False, scatterpoints=1)
fig1.savefig("plot_original_classes_2D_PCA_representation.png", dpi=None, facecolor='w', edgecolor='w',
format="png", transparent=False, bbox_inches='tight', pad_inches=0, metadata=None)
# =================================================================
# == plot classes resulting from clustering
# == using 2D feature space representation
# =================================================================
# get cluster numbers for the different data samples
y_pred = model.predict(dataset.data)
# plot
fig2 = plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], ["Cluster 0", "Cluster 1", "Cluster 2"]):
plt.scatter(
X_pca[y_pred == i, 0], # x coordinates in new 2D feature space
X_pca[y_pred == i, 1], # y coordinates in new 2D feature space
color=color, alpha=0.8, lw=2,
label=target_name
)
plt.title("Clustered IRIS dataset classes (after applying PCA)")
plt.legend(loc="best", shadow=False, scatterpoints=1)
# plt.show()
fig2.savefig("plot_clustering_classes_2D_PCA_representation.png", dpi=None, facecolor='w', edgecolor='w',
format="png", transparent=False, bbox_inches='tight', pad_inches=0, metadata=None)
#!/usr/bin/zsh
############################################################
### Slurm flags
############################################################
#SBATCH --time=00:15:00
#SBATCH --partition=c23ms
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
############################################################
### Load modules or software
############################################################
# load module for datascience-notebook container
module load datascience-notebook
module list
############################################################
### Parameters and Settings
############################################################
# print some information about current system
echo "Job nodes: ${SLURM_JOB_NODELIST}"
echo "Current machine: $(hostname)"
############################################################
### Execution (Model Training)
############################################################
# run the python script inside the container
apptainer exec -e ${DATASCIENCENOTEBOOK_IMAGE} \
bash -c "python scikit-learn_clustering.py"
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment