Alignment Module#

Spectral alignment and warping transformers.

Both Warping and RawWarping support parallel processing via the n_jobs parameter. Use n_jobs=-1 to utilize all available CPU cores.

Warping (Binned Spectra)#

class maldiamrkit.alignment.Warping(peak_detector=None, reference='median', method=AlignmentMethod.shift, n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, min_reference_peaks=5, n_jobs=1)[source]#

Bases: BaseEstimator, TransformerMixin

Align MALDI-TOF spectra to a reference using different strategies.

Supports multiple alignment methods for correcting mass calibration drift in binned spectra.

Parameters:
  • peak_detector (MaldiPeakDetector, optional) – Peak detector used to find peaks in spectra. If None, a default detector is created with binary=True and prominence=1e-5.

  • reference (str or int, default="median") – How to choose the reference spectrum: - “median” : median spectrum across all samples - int : use that row index as reference

  • method (str, default="shift") – Alignment method: - “shift” : global median shift - “linear” : least-squares linear transform - “piecewise” : local median shifts across segments - “dtw” : dynamic time warping - “quadratic” : quadratic polynomial fit on matched peak pairs - “cubic” : cubic polynomial fit on matched peak pairs - “lowess” : LOWESS (Cleveland 1979) non-linear warping

  • n_segments (int, default=5) – Number of segments for piecewise warping.

  • max_shift (int, default=50) – Max allowed shift in bins (used as fallback for shift / linear / polynomial / LOWESS methods when too few peaks match).

  • dtw_radius (int, default=10) – Radius constraint for DTW to limit warping path search space.

  • smooth_sigma (float, default=2.0) – Gaussian smoothing parameter for piecewise segment shifts.

  • lowess_frac (float, default=0.3) – LOWESS smoothing bandwidth (fraction of matched peaks used for each local fit). Applies when method="lowess".

  • lowess_it (int, default=3) – Number of LOWESS robustness iterations. Applies when method="lowess".

  • min_reference_peaks (int, default=5) – Minimum number of peaks expected in reference for quality check.

  • n_jobs (int, default=1) – Number of parallel jobs for transform. Use -1 for all available cores, 1 for sequential processing. Parallelization is particularly beneficial for the “dtw” method which is CPU-intensive.

Variables:

ref_spec (np.ndarray) – The fitted reference spectrum (stored after fit()).

Examples

>>> from maldiamrkit.alignment import Warping
>>> warper = Warping(method="shift")
>>> warper.fit(X_train)
>>> X_aligned = warper.transform(X_test)
__init__(peak_detector=None, reference='median', method=AlignmentMethod.shift, n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, min_reference_peaks=5, n_jobs=1)[source]#
Parameters:
Return type:

None

fit(X, y=None)[source]#

Fit the transformer by selecting or computing the reference spectrum.

Parameters:
  • X (pd.DataFrame) – Input spectra with shape (n_samples, n_bins).

  • y (array-like, optional) – Target values (ignored).

Returns:

self – Fitted transformer.

Return type:

Warping

Raises:

ValueError – If the input DataFrame is empty, the reference index is out of bounds, the reference specifier is unsupported, the warping method is unknown, or parameters are invalid.

transform(X)[source]#

Transform spectra by aligning them to the reference.

Parameters:

X (pd.DataFrame) – Input spectra with shape (n_samples, n_bins).

Returns:

X_aligned – Aligned spectra with same shape as input.

Return type:

pd.DataFrame

Raises:
  • RuntimeError – If the transformer has not been fitted.

  • ValueError – If the number of features in X does not match the reference spectrum length.

get_alignment_quality(X_original, X_aligned=None)[source]#

Compute alignment quality metrics.

Parameters:
  • X_original (pd.DataFrame) – Original (unaligned) spectra.

  • X_aligned (pd.DataFrame, optional) – Aligned spectra. If None, will compute by calling transform().

Returns:

Quality metrics with columns: - correlation_before: Pearson correlation with reference (before) - correlation_after: Pearson correlation with reference (after) - improvement: correlation_after - correlation_before - rmse_before: RMSE with reference (before) - rmse_after: RMSE with reference (after)

Return type:

pd.DataFrame

Raises:

RuntimeError – If the transformer has not been fitted.

RawWarping (Full Resolution)#

class maldiamrkit.alignment.RawWarping(method=AlignmentMethod.shift, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, max_shift_da=50.0, n_segments=5, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, reference='median', pipeline=None, peak_detector=None, min_reference_peaks=5, interp_step=0.5, n_jobs=1)[source]#

Bases: BaseEstimator, TransformerMixin

Align MALDI-TOF spectra using raw (full resolution) data.

Unlike Warping (which operates on binned data), RawWarping: - Loads original raw spectra from file paths - Performs warping at full m/z resolution - Outputs binned spectra for pipeline compatibility

This approach provides more accurate alignment by avoiding binning artifacts during the warping process.

Parameters:
  • method (str, default="shift") – Warping method: - “shift” : global m/z shift in Daltons - “linear” : linear m/z transformation (mz’ = a*mz + b) - “piecewise” : segment-wise m/z shifts with smoothing - “dtw” : dynamic time warping - “quadratic” : quadratic polynomial fit on matched peak pairs - “cubic” : cubic polynomial fit on matched peak pairs - “lowess” : LOWESS (Cleveland 1979) non-linear warping

  • bin_width (float, default=3) – Width of output bins in Daltons.

  • bin_method (str, default="uniform") – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.

  • bin_kwargs (dict, optional) – Additional keyword arguments for binning.

  • max_shift_da (float, default=50.0) – Maximum allowed shift in Daltons.

  • n_segments (int, default=5) – Number of segments for piecewise warping.

  • dtw_radius (int, default=10) – Radius constraint for DTW.

  • smooth_sigma (float, default=2.0) – Gaussian smoothing for piecewise transitions.

  • lowess_frac (float, default=0.3) – LOWESS smoothing bandwidth (fraction of matched peaks used for each local fit). Applies when method="lowess".

  • lowess_it (int, default=3) – Number of LOWESS robustness iterations. Applies when method="lowess".

  • reference (str or int, default="median") – Reference selection: “median” or int index.

  • pipeline (PreprocessingPipeline, optional) – Settings for preprocessing raw spectra.

  • peak_detector (MaldiPeakDetector, optional) – Peak detector used to find peaks in spectra. If None, a default detector is created with binary=True and prominence=1e-5.

  • min_reference_peaks (int, default=5) – Minimum peaks expected in reference.

  • interp_step (float, default=0.5) – Step size in Daltons for the common m/z grid used when computing a median reference spectrum.

  • n_jobs (int, default=1) – Number of parallel jobs for transform. Use -1 for all available cores, 1 for sequential processing.

Variables:
  • ref_mz (np.ndarray) – Reference spectrum m/z values (after fit).

  • ref_intensity (np.ndarray) – Reference spectrum intensities (after fit).

  • ref_peaks_mz (np.ndarray) – Peak m/z positions in reference spectrum.

  • output_columns (pd.Index) – Column names for output DataFrame (m/z bin centers).

  • pipeline (PreprocessingPipeline) – Preprocessing configuration used.

Examples

>>> from maldiamrkit.alignment import RawWarping, create_raw_input
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>>
>>> # Create input DataFrame from directory
>>> X_raw = create_raw_input("spectra/")
>>>
>>> # Use in sklearn pipeline
>>> pipe = Pipeline([
...     ("warp", RawWarping(method="piecewise", bin_width=3)),
...     ("scaler", StandardScaler()),
...     ("clf", RandomForestClassifier())
... ])
>>> pipe.fit(X_raw, y)

Notes

Input DataFrame X must have: - Index: sample IDs - Column “path”: paths to raw spectrum files

Use create_raw_input() to easily create this DataFrame from a directory.

__init__(method=AlignmentMethod.shift, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, max_shift_da=50.0, n_segments=5, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, reference='median', pipeline=None, peak_detector=None, min_reference_peaks=5, interp_step=0.5, n_jobs=1)[source]#
Parameters:
Return type:

None

fit(X, y=None)[source]#

Fit by computing the reference spectrum from raw data.

Parameters:
  • X (pd.DataFrame) – Input DataFrame with sample IDs as index and a “path” column containing paths to raw spectrum files. Use create_raw_input() to easily create this DataFrame.

  • y (array-like, optional) – Target values (ignored).

Returns:

self – Fitted transformer.

Return type:

RawWarping

Raises:

ValueError – If the input DataFrame is empty, lacks a ‘path’ column, or uses an unknown warping method.

transform(X)[source]#

Transform spectra by loading raw data, warping, and binning.

Parameters:

X (pd.DataFrame) – Input DataFrame with sample IDs as index and a “path” column containing paths to raw spectrum files.

Returns:

X_aligned – Aligned and binned spectra with sample IDs as index and m/z bin centers as columns.

Return type:

pd.DataFrame

Raises:
  • RuntimeError – If the transformer has not been fitted.

  • ValueError – If the input DataFrame lacks a ‘path’ column.

get_alignment_quality(X, X_aligned=None)[source]#

Compute alignment quality metrics.

Parameters:
  • X (pd.DataFrame) – Input DataFrame with “path” column.

  • X_aligned (pd.DataFrame, optional) – Aligned spectra. If None, will compute via transform.

Returns:

Quality metrics per sample with columns: - correlation_before: Pearson correlation with reference (before) - correlation_after: Pearson correlation with reference (after) - improvement: correlation_after - correlation_before

Return type:

pd.DataFrame

Raises:

RuntimeError – If the transformer has not been fitted.

Alignment Strategies#

class maldiamrkit.alignment.AlignmentStrategy[source]#

Base class for alignment strategies.

abstractmethod align_binned(row, peaks, ref_peaks, mz_axis)[source]#

Align a binned spectrum row to the reference.

Parameters:
  • row (np.ndarray) – Intensity values of the spectrum to align.

  • peaks (np.ndarray) – Detected peak indices in row.

  • ref_peaks (np.ndarray) – Detected peak indices in the reference spectrum.

  • mz_axis (np.ndarray) – Array of bin positions (e.g. np.arange(len(row))).

Returns:

Aligned intensity array with the same length as row.

Return type:

np.ndarray

abstractmethod align_raw(mz, intensity, peaks_mz, ref_peaks_mz, ref_mz, ref_intensity)[source]#

Align a raw spectrum to the reference.

Parameters:
  • mz (np.ndarray) – m/z values of the spectrum to align.

  • intensity (np.ndarray) – Intensity values of the spectrum to align.

  • peaks_mz (np.ndarray) – Detected peak m/z positions in the sample spectrum.

  • ref_peaks_mz (np.ndarray) – Detected peak m/z positions in the reference spectrum.

  • ref_mz (np.ndarray) – m/z values of the reference spectrum.

  • ref_intensity (np.ndarray) – Intensity values of the reference spectrum.

Return type:

tuple[ndarray, ndarray]

Returns:

  • aligned_mz (np.ndarray) – Aligned m/z values.

  • aligned_intensity (np.ndarray) – Aligned intensity values.

Alignment Methods#

class maldiamrkit.alignment.AlignmentMethod(value)[source]#

Bases: str, Enum

Supported alignment/warping methods.

Variables:
  • shift (str) – Rigid global shift alignment.

  • linear (str) – Linear (affine) recalibration.

  • piecewise (str) – Piecewise-linear recalibration.

  • dtw (str) – Dynamic time warping alignment.

  • quadratic (str) – Quadratic polynomial recalibration.

  • cubic (str) – Cubic polynomial recalibration.

  • lowess (str) – Non-linear LOWESS (Cleveland 1979) recalibration.

shift = 'shift'#
linear = 'linear'#
piecewise = 'piecewise'#
dtw = 'dtw'#
quadratic = 'quadratic'#
cubic = 'cubic'#
lowess = 'lowess'#

Utility Functions#

maldiamrkit.alignment.create_raw_input(spectra_dir, sample_ids=None, file_extension='.txt', duplicate_strategy=DuplicateStrategy.first)[source]#

Create input DataFrame for RawWarping from a directory of spectrum files.

This utility function creates a DataFrame suitable for use with RawWarping in sklearn pipelines. The DataFrame has sample IDs as index and file paths as values.

Parameters:
  • spectra_dir (str or Path) – Directory containing raw spectrum files.

  • sample_ids (list of str, optional) – List of sample IDs. If None, discovers all files matching the extension in spectra_dir and uses filenames (without extension) as sample IDs.

  • file_extension (str, default=".txt") – File extension for spectrum files.

  • duplicate_strategy (str or DuplicateStrategy, default "first") –

    How to handle duplicate sample IDs (e.g. the same sample appearing in multiple year subdirectories):

    • "first" – keep the first occurrence (default).

    • "last" – keep the last occurrence.

    • "drop" – remove all duplicates.

    • "keep_all" – keep every replicate with _repN suffixes.

    • "average" – keep all replicates and add an _original_id column so that RawWarping.transform() can group and average them.

Returns:

DataFrame with: - Index: sample IDs - Column “path”: full paths to spectrum files

Return type:

pd.DataFrame

Raises:

ValueError – If no files with the specified extension are found in the directory.

Examples

>>> # Discover all .txt files in directory
>>> X_raw = create_raw_input("spectra/")
>>>
>>> # Specify sample IDs explicitly
>>> X_raw = create_raw_input("spectra/", sample_ids=["s1", "s2", "s3"])
>>>
>>> # Use in pipeline
>>> from sklearn.pipeline import Pipeline
>>> pipe = Pipeline([
...     ("warp", RawWarping(method="piecewise")),
...     ("scaler", StandardScaler()),
... ])
>>> X_binned = pipe.fit_transform(X_raw)

Example Usage#

from maldiamrkit.alignment import Warping, RawWarping, create_raw_input
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Alignment on binned data
warper = Warping(method="piecewise", n_jobs=-1)
X_aligned = warper.fit_transform(X_binned)

# Raw warping: create input from directory, get binned output
X_raw = create_raw_input("spectra/")  # DataFrame with file paths
raw_warper = RawWarping(method="piecewise", bin_width=3, n_jobs=-1)
X_binned = raw_warper.fit_transform(X_raw)

# Use in sklearn pipeline
pipe = Pipeline([
    ("warp", RawWarping(method="piecewise", bin_width=3)),
    ("scaler", StandardScaler()),
])
X_processed = pipe.fit_transform(X_raw)