MaldiAMRKit - Drift Monitoring#

This notebook demonstrates maldiamrkit.drift.DriftMonitor, which anchors a baseline on the earliest timestamps and tracks temporal drift through three complementary views:

  • Reference similarity - distance of per-window median spectrum to the baseline reference (no labels needed)

  • PCA centroid trajectory - how windows move in a baseline-fitted PCA space

  • Peak-selection stability + effect-size drift - Jaccard overlap of top-k discriminative peaks per window vs. baseline, and Cohen’s d of specific peaks over time

Dataset#

These notebooks use the MALDI-Kleb-AI dataset (Rocchi et al., 2026; Zenodo DOI 10.5281/zenodo.17405072), a curated archive of MALDI-TOF spectra of Klebsiella clinical isolates from three Italian centres (Rome, Milan, Catania) with Amikacin / Meropenem resistance annotations. For simplicity we restrict the demo to the Rome sub-cohort (single site, no batch correction needed). The helper in `notebooks/_demo.py <_demo.py>`__ caches the 370 MB tarball under ~/.cache/maldiamrkit/ (or $MALDIAMRKIT_CACHE_DIR) on first use.

Load the Dataset#

We load the Rome cohort. The Zenodo metadata does not expose an acquisition timestamp, so we simulate dates by spreading the real spectra uniformly across a six-month window. The spectra themselves are real clinical isolates; only the time axis is synthetic, which lets us demonstrate the DriftMonitor API end-to-end without fabricating data.

[1]:
import pathlib
import sys

import numpy as np
import pandas as pd

from maldiamrkit.susceptibility import LabelEncoder

sys.path.insert(0, str(pathlib.Path.cwd()))  # _demo.py sits next to this notebook
from _demo import load_maldi_kleb_ai

ds = load_maldi_kleb_ai(antibiotic="Amikacin", verbose=True)
encoder = LabelEncoder(intermediate="nan")
y = (
    pd.Series(
        encoder.fit_transform(ds.meta["Amikacin"]),
        index=ds.meta.index,
        name="Amikacin",
    )
    .dropna()
    .astype(int)
)
X = ds.X.loc[y.index]
print(f"Rows: {X.shape}, class counts: {y.value_counts().to_dict()}")
/home/ettore/.venvs/maldiamrkit/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Processing spectra: 100%|██████████| 472/472 [00:00<00:00, 4947.34spectrum/s]
Rows: (470, 6000), class counts: {1: 273, 0: 197}

Attach a Synthetic Acquisition Date#

We spread the spectra evenly across 180 days, ordered by sample ID, and wrap them in a tiny shim exposing .X, .meta, and .get_y_single()

  • the only attributes DriftMonitor needs.

[2]:
from dataclasses import dataclass


@dataclass
class TimestampedSet:
    """Adapter that exposes the small surface DriftMonitor expects."""

    X: pd.DataFrame
    meta: pd.DataFrame

    def get_y_single(self, antibiotic: str | None = None) -> pd.Series:
        """Return the label column for ``antibiotic`` (default Amikacin)."""
        col = antibiotic or "Amikacin"
        return self.meta.loc[self.X.index, col]


ordered = X.sort_index()
start = pd.Timestamp("2025-01-01")
days = np.linspace(0, 180, num=len(ordered), dtype=int)
dates = [start + pd.Timedelta(days=int(d)) for d in days]

meta_full = pd.DataFrame(
    {
        "Amikacin": y.loc[ordered.index].astype(int),
        "acquisition_date": dates,
    },
    index=ordered.index,
)
dataset = TimestampedSet(X=ordered, meta=meta_full)
lo = meta_full["acquisition_date"].min().date()
hi = meta_full["acquisition_date"].max().date()
print(f"Date range: {lo}{hi}")
print("Class counts:", meta_full["Amikacin"].value_counts().to_dict())
Date range: 2025-01-01 → 2025-06-30
Class counts: {1: 273, 0: 197}

Fit the DriftMonitor#

By default, the baseline is the earliest 20% of sorted timestamps. Below we use a 30-day sliding window.

[3]:
from maldiamrkit.drift import DriftMonitor

monitor = DriftMonitor(
    time_column="acquisition_date",
    window="30D",
    metric="cosine",
    min_samples=5,
).fit(dataset)
print(f"Baseline cutoff: {monitor.baseline_end_.date()}")
Baseline cutoff: 2025-02-05

1. Reference Similarity#

For each 30-day window we compute the median spectrum and measure its cosine distance to the baseline reference (element-wise median of baseline rows). A rising line signals that the population is drifting away from the baseline fingerprint.

[4]:
from maldiamrkit.drift import plot_reference_drift

ref_df = monitor.monitor(dataset)
ref_df.head()
[4]:
window_start window_end n_spectra distance_to_reference
0 2025-02-06 2025-03-08 78 0.100245
1 2025-03-08 2025-04-07 79 0.030005
2 2025-04-07 2025-05-07 78 0.068501
3 2025-05-07 2025-06-06 78 0.063464
4 2025-06-06 2025-07-06 63 0.157319
[5]:
_ = plot_reference_drift(
    ref_df,
    title="Cosine distance to baseline reference - Rome cohort",
)
../../_images/tutorials_notebooks_07_drift_monitoring_10_0.png

2. PCA Centroid Trajectory#

monitor_pca projects each window’s rows onto a PCA space fitted on the baseline, then reports the per-window centroid (PC1, PC2) and dispersion (mean distance from centroid).

plot_pca_drift draws arrows between consecutive centroids and colours each point by time (early → late).

[6]:
from maldiamrkit.drift import plot_pca_drift

pca_df = monitor.monitor_pca(dataset)
pca_df.head()
[6]:
window_start window_end centroid_pc1 centroid_pc2 dispersion n_spectra
0 2025-02-06 2025-03-08 15.638410 7.322260 28.561470 78
1 2025-03-08 2025-04-07 11.465971 0.634700 29.147355 79
2 2025-04-07 2025-05-07 17.153615 2.299735 30.926929 78
3 2025-05-07 2025-06-06 16.887356 -1.988096 35.732134 78
4 2025-06-06 2025-07-06 3.947033 0.726848 33.505330 63
[7]:
_ = plot_pca_drift(pca_df, title="PCA centroid trajectory - Rome cohort")
../../_images/tutorials_notebooks_07_drift_monitoring_13_0.png

3. Peak-Selection Stability#

For each window, monitor_peak_stability fits a fresh DifferentialAnalysis and computes the Jaccard overlap of its top-k peaks with the baseline top-k. A falling curve means discriminative peaks are shifting over time - a red flag for biomarker studies.

[8]:
from maldiamrkit.differential import DifferentialAnalysis
from maldiamrkit.drift import plot_peak_stability

baseline_analysis = DifferentialAnalysis(ordered, meta_full["Amikacin"]).run()
stability_df = monitor.monitor_peak_stability(
    dataset,
    baseline_analysis,
    antibiotic="Amikacin",
    n_top=15,
)
stability_df.head()
[8]:
window_start stability_score n_spectra
0 2025-02-06 0.000000 78
1 2025-03-08 0.071429 79
2 2025-04-07 0.000000 78
3 2025-05-07 0.000000 78
4 2025-06-06 0.200000 63
[9]:
_ = plot_peak_stability(
    stability_df,
    title="Top-15 peak-selection Jaccard vs. baseline",
)
../../_images/tutorials_notebooks_07_drift_monitoring_16_0.png

4. Per-Peak Effect-Size Drift#

If you already have a short list of peaks of interest (e.g., from the baseline analysis), monitor_effect_sizes tracks their Cohen’s d between R and S in each window. A peak whose d drops toward zero (or flips sign) is losing discriminative power.

[10]:
from maldiamrkit.drift import plot_effect_size_drift

top5 = baseline_analysis.top_peaks(n=5)["mz_bin"].astype(str).tolist()
effect_df = monitor.monitor_effect_sizes(
    dataset,
    peaks=top5,
    antibiotic="Amikacin",
)
effect_df.head()
[10]:
window_start 4703.0 4712.0 4715.0 11114.0 11108.0
0 2025-02-06 0.556368 0.352241 0.450814 0.401412 0.537368
1 2025-03-08 0.715240 0.448572 0.488254 0.763298 0.807354
2 2025-04-07 0.509200 0.468215 0.467595 0.612715 0.360646
3 2025-05-07 0.616484 0.580740 0.681273 0.528699 0.634580
4 2025-06-06 0.752635 0.653076 0.665924 0.785004 0.589966
[11]:
_ = plot_effect_size_drift(
    effect_df,
    title="Cohen's d over time (top-5 baseline peaks)",
)
../../_images/tutorials_notebooks_07_drift_monitoring_19_0.png

See Also#