Alignment Module#

Spectral alignment and warping transformers.

Both Warping and RawWarping support parallel processing via the n_jobs parameter. Use n_jobs=-1 to utilize all available CPU cores.

Warping (Binned Spectra)#

class maldiamrkit.alignment.Warping(peak_detector=None, reference='median', method=AlignmentMethod.shift, n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, min_reference_peaks=5, n_jobs=1)[source]#

Bases: BaseEstimator, TransformerMixin

Align MALDI-TOF spectra to a reference using different strategies.

Supports multiple alignment methods for correcting mass calibration drift in binned spectra.

Parameters:

peak_detector (MaldiPeakDetector, optional) – Peak detector used to find peaks in spectra. If None, a default detector is created with binary=True and prominence=1e-5.
reference (str or int, default="median") – How to choose the reference spectrum: - “median” : median spectrum across all samples - int : use that row index as reference
method (str, default="shift") – Alignment method: - “shift” : global median shift - “linear” : least-squares linear transform - “piecewise” : local median shifts across segments - “dtw” : dynamic time warping - “quadratic” : quadratic polynomial fit on matched peak pairs - “cubic” : cubic polynomial fit on matched peak pairs - “lowess” : LOWESS (Cleveland 1979) non-linear warping
n_segments (int, default=5) – Number of segments for piecewise warping.
max_shift (int, default=50) – Max allowed shift in bins (used as fallback for shift / linear / polynomial / LOWESS methods when too few peaks match).
dtw_radius (int, default=10) – Radius constraint for DTW to limit warping path search space.
smooth_sigma (float, default=2.0) – Gaussian smoothing parameter for piecewise segment shifts.
lowess_frac (float, default=0.3) – LOWESS smoothing bandwidth (fraction of matched peaks used for each local fit). Applies when method="lowess".
lowess_it (int, default=3) – Number of LOWESS robustness iterations. Applies when method="lowess".
min_reference_peaks (int, default=5) – Minimum number of peaks expected in reference for quality check.
n_jobs (int, default=1) – Number of parallel jobs for transform. Use -1 for all available cores, 1 for sequential processing. Parallelization is particularly beneficial for the “dtw” method which is CPU-intensive.

Variables:

ref_spec (np.ndarray) – The fitted reference spectrum (stored after fit()).

Examples

>>> from maldiamrkit.alignment import Warping
>>> warper = Warping(method="shift")
>>> warper.fit(X_train)
>>> X_aligned = warper.transform(X_test)

__init__(peak_detector=None, reference='median', method=AlignmentMethod.shift, n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, min_reference_peaks=5, n_jobs=1)[source]#

Parameters:

peak_detector (MaldiPeakDetector | None)
reference (str | int)
method (str | AlignmentMethod)
n_segments (int)
max_shift (int)
dtw_radius (int)
smooth_sigma (float)
lowess_frac (float)
lowess_it (int)
min_reference_peaks (int)
n_jobs (int)

Return type:

None

fit(X, y=None)[source]#

Fit the transformer by selecting or computing the reference spectrum.

Parameters:

X (pd.DataFrame) – Input spectra with shape (n_samples, n_bins).
y (array-like, optional) – Target values (ignored).

Returns:

self – Fitted transformer.

Return type:

Warping

Raises:

ValueError – If the input DataFrame is empty, the reference index is out of bounds, the reference specifier is unsupported, the warping method is unknown, or parameters are invalid.

transform(X)[source]#

Transform spectra by aligning them to the reference.

Parameters:

X (pd.DataFrame) – Input spectra with shape (n_samples, n_bins).

Returns:

X_aligned – Aligned spectra with same shape as input.

Return type:

pd.DataFrame

Raises:

RuntimeError – If the transformer has not been fitted.
ValueError – If the number of features in X does not match the reference spectrum length.

get_alignment_quality(X_original, X_aligned=None)[source]#

Compute alignment quality metrics.

Parameters:

X_original (pd.DataFrame) – Original (unaligned) spectra.
X_aligned (pd.DataFrame, optional) – Aligned spectra. If None, will compute by calling transform().

Returns:

Quality metrics with columns: - correlation_before: Pearson correlation with reference (before) - correlation_after: Pearson correlation with reference (after) - improvement: correlation_after - correlation_before - rmse_before: RMSE with reference (before) - rmse_after: RMSE with reference (after)

Return type:

pd.DataFrame

Raises:

RuntimeError – If the transformer has not been fitted.

RawWarping (Full Resolution)#

class maldiamrkit.alignment.RawWarping(method=AlignmentMethod.shift, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, max_shift_da=50.0, n_segments=5, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, reference='median', pipeline=None, peak_detector=None, min_reference_peaks=5, interp_step=0.5, n_jobs=1)[source]#

Bases: BaseEstimator, TransformerMixin

Align MALDI-TOF spectra using raw (full resolution) data.

Unlike Warping (which operates on binned data), RawWarping: - Loads original raw spectra from file paths - Performs warping at full m/z resolution - Outputs binned spectra for pipeline compatibility

This approach provides more accurate alignment by avoiding binning artifacts during the warping process.

Parameters:

method (str, default="shift") – Warping method: - “shift” : global m/z shift in Daltons - “linear” : linear m/z transformation (mz’ = a*mz + b) - “piecewise” : segment-wise m/z shifts with smoothing - “dtw” : dynamic time warping - “quadratic” : quadratic polynomial fit on matched peak pairs - “cubic” : cubic polynomial fit on matched peak pairs - “lowess” : LOWESS (Cleveland 1979) non-linear warping
bin_width (float, default=3) – Width of output bins in Daltons.
bin_method (str, default="uniform") – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
bin_kwargs (dict, optional) – Additional keyword arguments for binning.
max_shift_da (float, default=50.0) – Maximum allowed shift in Daltons.
n_segments (int, default=5) – Number of segments for piecewise warping.
dtw_radius (int, default=10) – Radius constraint for DTW.
smooth_sigma (float, default=2.0) – Gaussian smoothing for piecewise transitions.
lowess_frac (float, default=0.3) – LOWESS smoothing bandwidth (fraction of matched peaks used for each local fit). Applies when method="lowess".
lowess_it (int, default=3) – Number of LOWESS robustness iterations. Applies when method="lowess".
reference (str or int, default="median") – Reference selection: “median” or int index.
pipeline (PreprocessingPipeline, optional) – Settings for preprocessing raw spectra.
peak_detector (MaldiPeakDetector, optional) – Peak detector used to find peaks in spectra. If None, a default detector is created with binary=True and prominence=1e-5.
min_reference_peaks (int, default=5) – Minimum peaks expected in reference.
interp_step (float, default=0.5) – Step size in Daltons for the common m/z grid used when computing a median reference spectrum.
n_jobs (int, default=1) – Number of parallel jobs for transform. Use -1 for all available cores, 1 for sequential processing.

Variables:

ref_mz (np.ndarray) – Reference spectrum m/z values (after fit).
ref_intensity (np.ndarray) – Reference spectrum intensities (after fit).
ref_peaks_mz (np.ndarray) – Peak m/z positions in reference spectrum.
output_columns (pd.Index) – Column names for output DataFrame (m/z bin centers).
pipeline (PreprocessingPipeline) – Preprocessing configuration used.

Examples

>>> from maldiamrkit.alignment import RawWarping, create_raw_input
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>>
>>> # Create input DataFrame from directory
>>> X_raw = create_raw_input("spectra/")
>>>
>>> # Use in sklearn pipeline
>>> pipe = Pipeline([
...     ("warp", RawWarping(method="piecewise", bin_width=3)),
...     ("scaler", StandardScaler()),
...     ("clf", RandomForestClassifier())
... ])
>>> pipe.fit(X_raw, y)

Notes

Input DataFrame X must have: - Index: sample IDs - Column “path”: paths to raw spectrum files

Use create_raw_input() to easily create this DataFrame from a directory.

__init__(method=AlignmentMethod.shift, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, max_shift_da=50.0, n_segments=5, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, reference='median', pipeline=None, peak_detector=None, min_reference_peaks=5, interp_step=0.5, n_jobs=1)[source]#

Parameters:

method (str | AlignmentMethod)
bin_width (float)
bin_method (str | BinningMethod)
bin_kwargs (dict | None)
max_shift_da (float)
n_segments (int)
dtw_radius (int)
smooth_sigma (float)
lowess_frac (float)
lowess_it (int)
reference (str | int)
pipeline (PreprocessingPipeline | None)
peak_detector (MaldiPeakDetector | None)
min_reference_peaks (int)
interp_step (float)
n_jobs (int)

Return type:

None

fit(X, y=None)[source]#

Fit by computing the reference spectrum from raw data.

Parameters:

X (pd.DataFrame) – Input DataFrame with sample IDs as index and a “path” column containing paths to raw spectrum files. Use create_raw_input() to easily create this DataFrame.
y (array-like, optional) – Target values (ignored).

Returns:

self – Fitted transformer.

Return type:

RawWarping

Raises:

ValueError – If the input DataFrame is empty, lacks a ‘path’ column, or uses an unknown warping method.

transform(X)[source]#

Transform spectra by loading raw data, warping, and binning.

Parameters:

X (pd.DataFrame) – Input DataFrame with sample IDs as index and a “path” column containing paths to raw spectrum files.

Returns:

X_aligned – Aligned and binned spectra with sample IDs as index and m/z bin centers as columns.

Return type:

pd.DataFrame

Raises:

RuntimeError – If the transformer has not been fitted.
ValueError – If the input DataFrame lacks a ‘path’ column.

get_alignment_quality(X, X_aligned=None)[source]#

Compute alignment quality metrics.

Parameters:

X (pd.DataFrame) – Input DataFrame with “path” column.
X_aligned (pd.DataFrame, optional) – Aligned spectra. If None, will compute via transform.

Returns:

Quality metrics per sample with columns: - correlation_before: Pearson correlation with reference (before) - correlation_after: Pearson correlation with reference (after) - improvement: correlation_after - correlation_before

Return type:

pd.DataFrame

Raises:

RuntimeError – If the transformer has not been fitted.

Alignment Strategies#

class maldiamrkit.alignment.AlignmentStrategy[source]#

Base class for alignment strategies.

abstractmethod align_binned(row, peaks, ref_peaks, mz_axis)[source]#

Align a binned spectrum row to the reference.

Parameters:

row (np.ndarray) – Intensity values of the spectrum to align.
peaks (np.ndarray) – Detected peak indices in row.
ref_peaks (np.ndarray) – Detected peak indices in the reference spectrum.
mz_axis (np.ndarray) – Array of bin positions (e.g. np.arange(len(row))).

Returns:

Aligned intensity array with the same length as row.

Return type:

np.ndarray

abstractmethod align_raw(mz, intensity, peaks_mz, ref_peaks_mz, ref_mz, ref_intensity)[source]#

Align a raw spectrum to the reference.

Parameters:

mz (np.ndarray) – m/z values of the spectrum to align.
intensity (np.ndarray) – Intensity values of the spectrum to align.
peaks_mz (np.ndarray) – Detected peak m/z positions in the sample spectrum.
ref_peaks_mz (np.ndarray) – Detected peak m/z positions in the reference spectrum.
ref_mz (np.ndarray) – m/z values of the reference spectrum.
ref_intensity (np.ndarray) – Intensity values of the reference spectrum.

Return type:

tuple[ndarray, ndarray]

Returns:

aligned_mz (np.ndarray) – Aligned m/z values.
aligned_intensity (np.ndarray) – Aligned intensity values.

Alignment Methods#

class maldiamrkit.alignment.AlignmentMethod(value)[source]#

Bases: str, Enum

Supported alignment/warping methods.

Variables:

shift (str) – Rigid global shift alignment.
linear (str) – Linear (affine) recalibration.
piecewise (str) – Piecewise-linear recalibration.
dtw (str) – Dynamic time warping alignment.
quadratic (str) – Quadratic polynomial recalibration.
cubic (str) – Cubic polynomial recalibration.
lowess (str) – Non-linear LOWESS (Cleveland 1979) recalibration.

shift = 'shift'#

linear = 'linear'#

piecewise = 'piecewise'#

dtw = 'dtw'#

quadratic = 'quadratic'#

cubic = 'cubic'#

lowess = 'lowess'#

Utility Functions#

maldiamrkit.alignment.create_raw_input(spectra_dir, sample_ids=None, file_extension='.txt', duplicate_strategy=DuplicateStrategy.first)[source]#

Create input DataFrame for RawWarping from a directory of spectrum files.

This utility function creates a DataFrame suitable for use with RawWarping in sklearn pipelines. The DataFrame has sample IDs as index and file paths as values.

Parameters:

spectra_dir (str or Path) – Directory containing raw spectrum files.
sample_ids (list of str, optional) – List of sample IDs. If None, discovers all files matching the extension in spectra_dir and uses filenames (without extension) as sample IDs.
file_extension (str, default=".txt") – File extension for spectrum files.
duplicate_strategy (str or DuplicateStrategy, default "first") –
How to handle duplicate sample IDs (e.g. the same sample appearing in multiple year subdirectories):
- "first" – keep the first occurrence (default).
- "last" – keep the last occurrence.
- "drop" – remove all duplicates.
- "keep_all" – keep every replicate with _repN suffixes.
- "average" – keep all replicates and add an _original_id column so that RawWarping.transform() can group and average them.

Returns:

DataFrame with: - Index: sample IDs - Column “path”: full paths to spectrum files

Return type:

pd.DataFrame

Raises:

ValueError – If no files with the specified extension are found in the directory.

Examples

>>> # Discover all .txt files in directory
>>> X_raw = create_raw_input("spectra/")
>>>
>>> # Specify sample IDs explicitly
>>> X_raw = create_raw_input("spectra/", sample_ids=["s1", "s2", "s3"])
>>>
>>> # Use in pipeline
>>> from sklearn.pipeline import Pipeline
>>> pipe = Pipeline([
...     ("warp", RawWarping(method="piecewise")),
...     ("scaler", StandardScaler()),
... ])
>>> X_binned = pipe.fit_transform(X_raw)

maldiamrkit.alignment.align_peaks(peaks, ref_peaks_mz, *, method='shift', max_shift_da=50.0, n_segments=5, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3)[source]#

Align a peak set’s m/z positions to a reference peak list.

Fit-free: ref_peaks_mz is supplied by the caller. Builds the requested warping transform from the matched (sample, reference) peak pairs and applies it to the peak m/z; intensities are unchanged.

Parameters:

peaks (PeakSet) – Peak set to align.
ref_peaks_mz (array-like) – Reference peak m/z positions to align to.
method ({"shift", "linear", "piecewise", "quadratic", "cubic", "lowess"}, default="shift") – Warping strategy. "dtw" is unsupported for peak sets because it resamples onto a dense grid and would destroy the set structure.
max_shift_da (float, default=50.0) – Maximum allowed shift in Daltons (used by the shift fallback and the rigid/linear strategies).
n_segments (int, default=5) – Number of segments for method="piecewise".
smooth_sigma (float, default=2.0) – Gaussian smoothing (Da) for piecewise transitions.
lowess_frac (float, default=0.3) – LOWESS bandwidth for method="lowess".
lowess_it (int, default=3) – LOWESS robustness iterations for method="lowess".

Returns:

A new peak set with warped m/z and unchanged intensities.

Return type:

PeakSet

Raises:

ValueError – If method="dtw".

Notes

"shift", "linear", "quadratic", "cubic" and "lowess" operate directly on the matched peak m/z and are the well-behaved choices for sparse peak sets. "piecewise" derives its Gaussian smoothing width from the median spacing of the input points; on a sparse peak set that spacing is large, so the smoothing is effectively inactive and the warp reduces to an unsmoothed piecewise-linear transform.

Example Usage#

from maldiamrkit.alignment import Warping, RawWarping, create_raw_input
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Alignment on binned data
warper = Warping(method="piecewise", n_jobs=-1)
X_aligned = warper.fit_transform(X_binned)

# Raw warping: create input from directory, get binned output
X_raw = create_raw_input("spectra/")  # DataFrame with file paths
raw_warper = RawWarping(method="piecewise", bin_width=3, n_jobs=-1)
X_binned = raw_warper.fit_transform(X_raw)

# Use in sklearn pipeline
pipe = Pipeline([
    ("warp", RawWarping(method="piecewise", bin_width=3)),
    ("scaler", StandardScaler()),
])
X_processed = pipe.fit_transform(X_raw)