MaldiAMRKit - Spectral Alignment#

This notebook covers spectral alignment (warping) methods to correct for mass calibration drift.

Import Libraries#

[1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from maldiamrkit import MaldiSet
from maldiamrkit.alignment import RawWarping, Warping, create_raw_input

Load Dataset#

[2]:
data = MaldiSet.from_directory(
    "../data/",
    "../data/metadata/metadata.csv",
    aggregate_by=dict(antibiotics="Drug"),
)
X = data.X
y = data.y["Drug"].map({"S": 0, "I": 1, "R": 1})

print(f"Features shape: {X.shape}")
Features shape: (29, 6000)

Warping Methods#

MaldiAMRKit supports multiple alignment methods:

  • shift: Global median shift (fast, simple)

  • linear: Least-squares linear transformation

  • piecewise: Local shifts across spectrum segments (most flexible)

  • dtw: Dynamic Time Warping (best for non-linear drift, slowest)

[3]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Shift method (fastest)
pipe_shift = Pipeline(
    [
        ("warp", Warping(method="shift")),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe_shift, X, y, cv=cv, scoring="roc_auc")
print(f"Shift - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
Shift - CV ROC AUC: 0.400 +/- 0.255
[4]:
# Linear method
pipe_linear = Pipeline(
    [
        ("warp", Warping(method="linear")),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe_linear, X, y, cv=cv, scoring="roc_auc")
print(f"Linear - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
Linear - CV ROC AUC: 0.400 +/- 0.289
[5]:
# Piecewise method (often best trade-off)
pipe_piecewise = Pipeline(
    [
        ("warp", Warping(method="piecewise", n_segments=10)),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe_piecewise, X, y, cv=cv, scoring="roc_auc")
print(f"Piecewise - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
Piecewise - CV ROC AUC: 0.400 +/- 0.289

Alignment Quality Assessment#

Use get_alignment_quality() to measure how well spectra were aligned to the reference.

[6]:
# Fit warping and check alignment quality
warper = Warping(method="piecewise", n_segments=10)
warper.fit(X)
X_aligned = warper.transform(X)

# Get alignment quality metrics
quality = warper.get_alignment_quality(X, X_aligned)
print(f"Mean correlation improvement: {quality['improvement'].mean():.4f}")
quality.head()
Mean correlation improvement: 0.0056
[6]:
correlation_before correlation_after improvement rmse_before rmse_after
10s 0.850781 0.850781 0.000000 0.000137 0.000137
11s 0.854397 0.854397 0.000000 0.000185 0.000185
12s 0.898360 0.898360 0.000000 0.000192 0.000192
13s 0.817404 0.817404 0.000000 0.000240 0.000240
14s 0.825112 0.825087 -0.000024 0.000177 0.000177

Raw Spectra Warping#

RawWarping performs alignment at full m/z resolution (before binning) for higher precision. It loads raw spectra files during fit/transform and outputs properly binned data.

Key workflow:

  1. Use create_raw_input() to create input DataFrame with file paths

  2. Pass this DataFrame to RawWarping in your pipeline

  3. Get properly binned, aligned spectra as output

This design makes RawWarping fully compatible with sklearn pipelines.

[7]:
# Create input DataFrame from raw spectra directory
X_raw = create_raw_input("../data/")
print(f"Input DataFrame shape: {X_raw.shape}")
print(f"Columns: {X_raw.columns.tolist()}")
X_raw.head()
Input DataFrame shape: (29, 1)
Columns: ['path']
[7]:
path
10s ../data/10s.txt
11s ../data/11s.txt
12s ../data/12s.txt
13s ../data/13s.txt
14s ../data/14s.txt
[8]:
# RawWarping in a pipeline - outputs binned spectra
raw_warper = RawWarping(
    method="piecewise",
    bin_width=3,
    max_shift_da=10.0,
    n_segments=5,
)

# Fit and transform - loads raw files, warps at full resolution, bins output
raw_warper.fit(X_raw)
X_raw_aligned = raw_warper.transform(X_raw)
print(f"Input shape:  {X_raw.shape} (single 'path' column)")
print(f"Output shape: {X_raw_aligned.shape} (binned spectra)")
print(
    f"Output columns are m/z bin starting points: {X_raw_aligned.columns[:5].tolist()}..."
)
Input shape:  (29, 1) (single 'path' column)
Output shape: (29, 6000) (binned spectra)
Output columns are m/z bin starting points: ['2000', '2003', '2006', '2009', '2012']...

Parallelization#

Use n_jobs parameter to enable parallel processing for faster computation.

[9]:
# Parallel warping (use all cores)
warper_parallel = Warping(method="piecewise", n_segments=10, n_jobs=-1)
warper_parallel.fit(X)
X_aligned_parallel = warper_parallel.transform(X)
print(f"Aligned {len(X)} spectra")
Aligned 29 spectra

RawWarping in sklearn Pipeline#

Since RawWarping accepts a path-based DataFrame and outputs binned spectra, it integrates seamlessly into sklearn pipelines.

[10]:
# Full pipeline: raw spectra -> alignment -> scaling -> classification
pipe_raw = Pipeline(
    [
        ("warp", RawWarping(method="piecewise", bin_width=3, n_segments=5)),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

# Cross-validation with RawWarping pipeline
# Note: X_raw contains file paths, y contains labels
scores = cross_val_score(pipe_raw, X_raw, y, cv=cv, scoring="roc_auc")
print(f"RawWarping Pipeline - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
RawWarping Pipeline - CV ROC AUC: 0.375 +/- 0.250