MaldiAMRKit - Peak Detection#

This notebook covers peak detection methods including local maxima and persistent homology.

Import Libraries#

[ ]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from maldiamrkit import MaldiSet
from maldiamrkit.detection import MaldiPeakDetector
from maldiamrkit.visualization import plot_peaks

Load Dataset#

[2]:

data = MaldiSet.from_directory(
    "../data/",
    "../data/metadata/metadata.csv",
    aggregate_by=dict(antibiotics="Drug"),
)
X = data.X
y = data.y["Drug"].map({"S": 0, "I": 1, "R": 1})

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")

Features shape: (29, 6000)
Labels shape: (29,)

Peak Detection with Local Maxima#

The MaldiPeakDetector uses local maxima detection by default. It’s fast and works well for most cases.

[3]:

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

pipe = Pipeline(
    [
        ("peaks", MaldiPeakDetector(binary=False, prominence=1e-8)),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression()),
    ]
)

scores = cross_val_score(pipe, X, y, cv=cv, scoring="roc_auc")
print(f"CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

CV ROC AUC: 0.417 +/- 0.260

Peak Detection with Persistent Homology#

Persistent homology (method="ph") is a topological approach that can better handle noise. It’s slower but often more robust.

Note: Requires the gudhi package.

[4]:

peak_detector = MaldiPeakDetector(
    method="ph",
    binary=False,
    persistence_threshold=5e-4,
)

# Fit on a subset and get statistics
peak_detector.fit(X.iloc[:5])
peak_detector.get_peak_statistics(X.iloc[:5])

[4]:

	n_peaks	mean_intensity	max_intensity
10s	40	0.001594	0.005164
11s	43	0.001695	0.006086
12s	48	0.001844	0.007017
13s	56	0.001946	0.007615
14s	49	0.001784	0.005047

Visualize Detected Peaks#

[ ]:

_ = plot_peaks(peak_detector, X.iloc[:1])

Binary vs Intensity Mode#

The binary parameter controls whether peak positions are returned as binary (0/1) or preserve original intensities.

[6]:

# Binary mode - peaks are marked as 1, non-peaks as 0
detector_binary = MaldiPeakDetector(binary=True, prominence=1e-4)
peaks_binary = detector_binary.fit_transform(X.iloc[:1])
print(f"Binary mode unique values: {peaks_binary.values.flatten()[:20]}")

# Intensity mode - preserves original intensities at peak positions
detector_intensity = MaldiPeakDetector(binary=False, prominence=1e-4)
peaks_intensity = detector_intensity.fit_transform(X.iloc[:1])
print(
    f"Intensity mode - non-zero values: {peaks_intensity.values[peaks_intensity.values > 0][:10]}"
)

Binary mode unique values: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0.]
Intensity mode - non-zero values: [0.0006844  0.00347906 0.00088132 0.00204877 0.00098946 0.00162945
 0.00086379 0.00157396 0.00082802 0.00031478]

Parallelization#

Use n_jobs parameter to enable parallel processing for faster computation on large datasets.

[7]:

# Sequential processing
detector_seq = MaldiPeakDetector(prominence=1e-4, n_jobs=1)

# Parallel processing (use all cores)
detector_par = MaldiPeakDetector(prominence=1e-4, n_jobs=-1)

# Both produce the same results
result = detector_par.fit_transform(X)
print(f"Processed {len(X)} spectra")

Processed 29 spectra