MaldiAMRKit - Exploration & Similarity#

This notebook demonstrates how to use MaldiAMRKit’s exploratory visualisation and spectral similarity functions to inspect MALDI-TOF datasets before building classifiers.

It covers:

  • PCA scatter plots coloured by metadata

  • t-SNE embeddings

  • UMAP embeddings (requires pip install maldiamrkit[batch])

  • Spectral similarity: pairwise distance matrices, heatmaps, clustering, and dendrograms

  • Brief introduction to batch effect correction with combatlearn

Dataset#

This notebook uses the full multi-site version of the MALDI-Kleb-AI dataset (Rocchi et al., 2026; Zenodo DOI 10.5281/zenodo.17405072) - all three Italian centres (Rome, Milan, Catania), about 770 spectra. The other notebooks restrict themselves to the Rome sub-cohort, but exploration is precisely where you want the full multi-centre cohort: unsupervised methods will quickly tell you whether site / batch effects dominate the resistance signal. The helper in `notebooks/_demo.py <_demo.py>`__ caches the 370 MB tarball under ~/.cache/maldiamrkit/ on first use.

[1]:
import pathlib
import sys

from maldiamrkit.visualization import plot_pca, plot_tsne

sys.path.insert(0, str(pathlib.Path.cwd()))  # _demo.py sits next to this notebook
from _demo import load_maldi_kleb_ai
/home/ettore/.venvs/maldiamrkit/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Load the Dataset#

Exploration is the one notebook where the full multi-site cohort is more interesting than the Rome-only slice: with three acquisition centres (Rome, Milan, Catania) we can see whether unsupervised structure tracks the resistance phenotype or the acquisition site. Pass city=None to load all three centres.

[2]:
ds = load_maldi_kleb_ai(antibiotic="Amikacin", city=None, verbose=True)
data = ds.maldi_set
labels = ds.meta["Amikacin"]
sites = ds.meta["City"]
print(f"Samples: {ds.X.shape[0]}, Features: {ds.X.shape[1]}")
print(f"Labels: {labels.value_counts().to_dict()}")
print(f"Sites:  {sites.value_counts().to_dict()}")
Processing spectra: 100%|██████████| 743/743 [00:00<00:00, 4637.49spectrum/s]
Samples: 741, Features: 6000
Labels: {'S': 372, 'R': 369}
Sites:  {'Rome': 470, 'Milan': 184, 'Catania': 87}

PCA#

PCA is the simplest starting point. Axis labels automatically show the explained variance percentage.

[3]:
_ = plot_pca(ds.X, color_by=labels)
../../_images/tutorials_notebooks_05_exploration_6_0.png

You can pass any matplotlib keyword arguments through **pca_kwargs, or provide a custom palette:

[4]:
_ = plot_pca(
    ds.X,
    color_by=labels,
    palette={"S": "#2196F3", "R": "#F44336"},
    title="PCA - Amikacin Resistance (multi-site)",
    alpha=0.8,
    s=35,
)
../../_images/tutorials_notebooks_05_exploration_8_0.png

Now colour by the acquisition centre instead - if the three sites separate, you are seeing batch / instrument effects rather than biology, and the unsupervised plots are not telling you about resistance.

[5]:
_ = plot_pca(
    ds.X,
    color_by=sites,
    title="PCA - Acquisition site",
    alpha=0.8,
    s=35,
)
../../_images/tutorials_notebooks_05_exploration_10_0.png

t-SNE#

For a non-linear embedding, use plot_tsne. With ~470 samples, perplexity in the 20-40 range is a reasonable default.

[6]:
_ = plot_tsne(ds.X, color_by=labels, perplexity=30, random_state=42)
../../_images/tutorials_notebooks_05_exploration_12_0.png

UMAP#

UMAP requires the optional umap-learn package:

pip install maldiamrkit[batch]
[7]:
try:
    from maldiamrkit.visualization import plot_umap

    _ = plot_umap(ds.X, color_by=labels, n_neighbors=15, random_state=42)
except ImportError as e:
    print(e)
/home/ettore/.venvs/maldiamrkit/lib/python3.10/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
../../_images/tutorials_notebooks_05_exploration_14_1.png

Spectral Similarity#

MaldiAMRKit provides spectral distance metrics, pairwise distance matrix computation, clustering algorithms, and dedicated plots for similarity analysis.

Available metrics (see SpectralMetric):

  • cosine - cosine distance (binned spectra)

  • spectral_contrast_angle - spectral contrast angle (binned spectra)

  • pearson - 1 - Pearson correlation (binned spectra)

  • wasserstein - Wasserstein / earth mover’s distance (raw spectra)

  • dtw - dynamic time warping distance (raw spectra)

For the pairwise heatmap and dendrogram we subsample to keep the figures readable; the same calls scale to the full 470-spectrum matrix.

[8]:
from maldiamrkit.similarity import (
    cluster_metadata_concordance,
    cluster_spectra,
    hierarchical_clustering,
    pairwise_distances,
    plot_dendrogram,
    plot_distance_heatmap,
    silhouette_scores,
)

Pairwise Distance Matrix#

Compute a symmetric distance matrix from the binned feature matrix. For binned spectra, cosine distance is a natural choice.

[9]:
import numpy as np

rng = np.random.default_rng(0)
subset_idx = rng.choice(ds.X.shape[0], size=80, replace=False)
X_subset = ds.X.iloc[subset_idx]
labels_subset = labels.iloc[subset_idx]

D = pairwise_distances(X_subset, metric="cosine")
print(f"Distance matrix shape: {D.shape}")
print(f"Range: [{D[D > 0].min():.4f}, {D.max():.4f}]")
Distance matrix shape: (80, 80)
Range: [0.0449, 0.8068]

Distance Heatmap#

Visualise the pairwise distance matrix. Darker values indicate more similar spectra.

[10]:
_ = plot_distance_heatmap(
    D,
    labels=X_subset.index.tolist(),
    title="Cosine Distance Heatmap (80-spectrum subset)",
)
../../_images/tutorials_notebooks_05_exploration_20_0.png

Hierarchical Clustering & Dendrogram#

Build a linkage matrix and visualise the cluster hierarchy.

[11]:
linkage = hierarchical_clustering(D, method="average")
_ = plot_dendrogram(
    linkage,
    labels=X_subset.index.tolist(),
    title="Hierarchical Clustering (average linkage)",
)
../../_images/tutorials_notebooks_05_exploration_22_0.png

Cluster Assignment & Evaluation#

Use cluster_spectra for a one-call clustering workflow. We ask for three clusters - matching the three acquisition centres - and then score the partition against both the resistance phenotype and the site label with silhouette scores and metadata concordance (ARI, NMI). This is a clean diagnostic: if clusters track the site but not the phenotype, batch effects are dominating the unsupervised signal - a classic motivator for ComBat-style correction.

We use method='kmedoids' here: with very uneven cluster sizes (Rome alone is ~470 spectra) average-linkage hierarchical clustering tends to peel off a couple of singletons instead of recovering the three centres. K-medoids respects the requested cluster count and produces balanced groups that closely match the site labels.

[12]:
import pandas as pd

D_full = pairwise_distances(ds.X, metric="cosine")
cluster_labels = cluster_spectra(
    D_full,
    method="kmedoids",
    n_clusters=3,
    random_state=42,
)

sil = silhouette_scores(D_full, cluster_labels)
print(f"Silhouette score:               {sil:.3f}")

conc_pheno = cluster_metadata_concordance(cluster_labels, labels)
print("\nConcordance vs. Amikacin (R/S):")
print(f"  Adjusted Rand Index:          {conc_pheno['adjusted_rand_index']:.3f}")
print(f"  Normalized Mutual Info:       {conc_pheno['normalized_mutual_info']:.3f}")

conc_site = cluster_metadata_concordance(cluster_labels, sites)
print("\nConcordance vs. acquisition site (City):")
print(f"  Adjusted Rand Index:          {conc_site['adjusted_rand_index']:.3f}")
print(f"  Normalized Mutual Info:       {conc_site['normalized_mutual_info']:.3f}")

print("\nClusters x City:")
pd.crosstab(
    pd.Series(cluster_labels, index=sites.index, name="Cluster"),
    sites,
)
Silhouette score:               0.294

Concordance vs. Amikacin (R/S):
  Adjusted Rand Index:          0.029
  Normalized Mutual Info:       0.042

Concordance vs. acquisition site (City):
  Adjusted Rand Index:          0.688
  Normalized Mutual Info:       0.541

Clusters x City:
[12]:
City Catania Milan Rome
Cluster
0 7 7 439
1 80 110 21
2 0 67 10

Comparing Metrics#

Try different distance metrics to see which one recovers the site structure most cleanly. We compare each metric against both the resistance phenotype and the acquisition site.

[13]:
results = []
for metric in ["cosine", "spectral_contrast_angle", "pearson"]:
    D_m = pairwise_distances(ds.X, metric=metric)
    cl = cluster_spectra(D_m, method="kmedoids", n_clusters=3, random_state=42)
    sil_m = silhouette_scores(D_m, cl)
    conc_pheno = cluster_metadata_concordance(cl, labels)
    conc_site = cluster_metadata_concordance(cl, sites)
    results.append(
        {
            "metric": metric,
            "silhouette": sil_m,
            "ARI_pheno": conc_pheno["adjusted_rand_index"],
            "NMI_pheno": conc_pheno["normalized_mutual_info"],
            "ARI_site": conc_site["adjusted_rand_index"],
            "NMI_site": conc_site["normalized_mutual_info"],
        }
    )

pd.DataFrame(results).set_index("metric")
[13]:
silhouette ARI_pheno NMI_pheno ARI_site NMI_site
metric
cosine 0.294047 0.029394 0.041665 0.687924 0.541444
spectral_contrast_angle 0.152448 0.019211 0.016085 0.449982 0.410990
pearson 0.250229 0.016489 0.013949 0.391881 0.361815

Batch Effect Correction#

When combining data from multiple sites or instruments, batch effects can dominate the signal. MaldiAMRKit on its own focuses on the single- site case (that is why this notebook restricts to Rome). For multi- centre harmonisation use combatlearn directly, or use the dedicated MaldiBatchKit sister package which builds on top of it:

from combatlearn import Combat

combat = Combat(method='fortin')
combat.fit(X, y=batch_labels)
X_corrected = combat.transform(X, y=batch_labels)

Install with pip install maldiamrkit[batch]. See the combatlearn documentation for full usage.

Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). doi:10.1186/s12866-025-04657-2