MaldiAMRKit - Spectral Alignment#
This notebook covers spectral alignment (warping) methods to correct for mass calibration drift.
Import Libraries#
[1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from maldiamrkit import MaldiSet
from maldiamrkit.alignment import RawWarping, Warping, create_raw_input
Load Dataset#
[2]:
data = MaldiSet.from_directory(
"../data/",
"../data/metadata/metadata.csv",
aggregate_by=dict(antibiotics="Drug"),
)
X = data.X
y = data.y["Drug"].map({"S": 0, "I": 1, "R": 1})
print(f"Features shape: {X.shape}")
Features shape: (29, 6000)
Warping Methods#
MaldiAMRKit supports multiple alignment methods:
shift: Global median shift (fast, simple)
linear: Least-squares linear transformation
piecewise: Local shifts across spectrum segments (most flexible)
dtw: Dynamic Time Warping (best for non-linear drift, slowest)
[3]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Shift method (fastest)
pipe_shift = Pipeline(
[
("warp", Warping(method="shift")),
("scaler", StandardScaler()),
("clf", LogisticRegression()),
]
)
scores = cross_val_score(pipe_shift, X, y, cv=cv, scoring="roc_auc")
print(f"Shift - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
Shift - CV ROC AUC: 0.400 +/- 0.255
[4]:
# Linear method
pipe_linear = Pipeline(
[
("warp", Warping(method="linear")),
("scaler", StandardScaler()),
("clf", LogisticRegression()),
]
)
scores = cross_val_score(pipe_linear, X, y, cv=cv, scoring="roc_auc")
print(f"Linear - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
Linear - CV ROC AUC: 0.400 +/- 0.289
[5]:
# Piecewise method (often best trade-off)
pipe_piecewise = Pipeline(
[
("warp", Warping(method="piecewise", n_segments=10)),
("scaler", StandardScaler()),
("clf", LogisticRegression()),
]
)
scores = cross_val_score(pipe_piecewise, X, y, cv=cv, scoring="roc_auc")
print(f"Piecewise - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
Piecewise - CV ROC AUC: 0.400 +/- 0.289
Alignment Quality Assessment#
Use get_alignment_quality() to measure how well spectra were aligned to the reference.
[6]:
# Fit warping and check alignment quality
warper = Warping(method="piecewise", n_segments=10)
warper.fit(X)
X_aligned = warper.transform(X)
# Get alignment quality metrics
quality = warper.get_alignment_quality(X, X_aligned)
print(f"Mean correlation improvement: {quality['improvement'].mean():.4f}")
quality.head()
Mean correlation improvement: 0.0056
[6]:
| correlation_before | correlation_after | improvement | rmse_before | rmse_after | |
|---|---|---|---|---|---|
| 10s | 0.850781 | 0.850781 | 0.000000 | 0.000137 | 0.000137 |
| 11s | 0.854397 | 0.854397 | 0.000000 | 0.000185 | 0.000185 |
| 12s | 0.898360 | 0.898360 | 0.000000 | 0.000192 | 0.000192 |
| 13s | 0.817404 | 0.817404 | 0.000000 | 0.000240 | 0.000240 |
| 14s | 0.825112 | 0.825087 | -0.000024 | 0.000177 | 0.000177 |
Raw Spectra Warping#
RawWarping performs alignment at full m/z resolution (before binning) for higher precision. It loads raw spectra files during fit/transform and outputs properly binned data.
Key workflow:
Use
create_raw_input()to create input DataFrame with file pathsPass this DataFrame to
RawWarpingin your pipelineGet properly binned, aligned spectra as output
This design makes RawWarping fully compatible with sklearn pipelines.
[7]:
# Create input DataFrame from raw spectra directory
X_raw = create_raw_input("../data/")
print(f"Input DataFrame shape: {X_raw.shape}")
print(f"Columns: {X_raw.columns.tolist()}")
X_raw.head()
Input DataFrame shape: (29, 1)
Columns: ['path']
[7]:
| path | |
|---|---|
| 10s | ../data/10s.txt |
| 11s | ../data/11s.txt |
| 12s | ../data/12s.txt |
| 13s | ../data/13s.txt |
| 14s | ../data/14s.txt |
[8]:
# RawWarping in a pipeline - outputs binned spectra
raw_warper = RawWarping(
method="piecewise",
bin_width=3,
max_shift_da=10.0,
n_segments=5,
)
# Fit and transform - loads raw files, warps at full resolution, bins output
raw_warper.fit(X_raw)
X_raw_aligned = raw_warper.transform(X_raw)
print(f"Input shape: {X_raw.shape} (single 'path' column)")
print(f"Output shape: {X_raw_aligned.shape} (binned spectra)")
print(
f"Output columns are m/z bin starting points: {X_raw_aligned.columns[:5].tolist()}..."
)
Input shape: (29, 1) (single 'path' column)
Output shape: (29, 6000) (binned spectra)
Output columns are m/z bin starting points: ['2000', '2003', '2006', '2009', '2012']...
Parallelization#
Use n_jobs parameter to enable parallel processing for faster computation.
[9]:
# Parallel warping (use all cores)
warper_parallel = Warping(method="piecewise", n_segments=10, n_jobs=-1)
warper_parallel.fit(X)
X_aligned_parallel = warper_parallel.transform(X)
print(f"Aligned {len(X)} spectra")
Aligned 29 spectra
RawWarping in sklearn Pipeline#
Since RawWarping accepts a path-based DataFrame and outputs binned spectra, it integrates seamlessly into sklearn pipelines.
[10]:
# Full pipeline: raw spectra -> alignment -> scaling -> classification
pipe_raw = Pipeline(
[
("warp", RawWarping(method="piecewise", bin_width=3, n_segments=5)),
("scaler", StandardScaler()),
("clf", LogisticRegression()),
]
)
# Cross-validation with RawWarping pipeline
# Note: X_raw contains file paths, y contains labels
scores = cross_val_score(pipe_raw, X_raw, y, cv=cv, scoring="roc_auc")
print(f"RawWarping Pipeline - CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
RawWarping Pipeline - CV ROC AUC: 0.375 +/- 0.250