Alignment Module#
Spectral alignment and warping transformers.
Both Warping and RawWarping support parallel processing via the n_jobs parameter.
Use n_jobs=-1 to utilize all available CPU cores.
Warping (Binned Spectra)#
- class maldiamrkit.alignment.Warping(peak_detector=None, reference='median', method=AlignmentMethod.shift, n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, min_reference_peaks=5, n_jobs=1)[source]#
Bases:
BaseEstimator,TransformerMixinAlign MALDI-TOF spectra to a reference using different strategies.
Supports multiple alignment methods for correcting mass calibration drift in binned spectra.
- Parameters:
peak_detector (MaldiPeakDetector, optional) – Peak detector used to find peaks in spectra. If None, a default detector is created with binary=True and prominence=1e-5.
reference (str or int, default="median") – How to choose the reference spectrum: - “median” : median spectrum across all samples - int : use that row index as reference
method (str, default="shift") – Alignment method: - “shift” : global median shift - “linear” : least-squares linear transform - “piecewise” : local median shifts across segments - “dtw” : dynamic time warping - “quadratic” : quadratic polynomial fit on matched peak pairs - “cubic” : cubic polynomial fit on matched peak pairs - “lowess” : LOWESS (Cleveland 1979) non-linear warping
n_segments (int, default=5) – Number of segments for piecewise warping.
max_shift (int, default=50) – Max allowed shift in bins (used as fallback for shift / linear / polynomial / LOWESS methods when too few peaks match).
dtw_radius (int, default=10) – Radius constraint for DTW to limit warping path search space.
smooth_sigma (float, default=2.0) – Gaussian smoothing parameter for piecewise segment shifts.
lowess_frac (float, default=0.3) – LOWESS smoothing bandwidth (fraction of matched peaks used for each local fit). Applies when
method="lowess".lowess_it (int, default=3) – Number of LOWESS robustness iterations. Applies when
method="lowess".min_reference_peaks (int, default=5) – Minimum number of peaks expected in reference for quality check.
n_jobs (int, default=1) – Number of parallel jobs for transform. Use -1 for all available cores, 1 for sequential processing. Parallelization is particularly beneficial for the “dtw” method which is CPU-intensive.
- Variables:
ref_spec (np.ndarray) – The fitted reference spectrum (stored after fit()).
Examples
>>> from maldiamrkit.alignment import Warping >>> warper = Warping(method="shift") >>> warper.fit(X_train) >>> X_aligned = warper.transform(X_test)
- __init__(peak_detector=None, reference='median', method=AlignmentMethod.shift, n_segments=5, max_shift=50, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, min_reference_peaks=5, n_jobs=1)[source]#
- fit(X, y=None)[source]#
Fit the transformer by selecting or computing the reference spectrum.
- Parameters:
X (pd.DataFrame) – Input spectra with shape (n_samples, n_bins).
y (array-like, optional) – Target values (ignored).
- Returns:
self – Fitted transformer.
- Return type:
- Raises:
ValueError – If the input DataFrame is empty, the reference index is out of bounds, the reference specifier is unsupported, the warping method is unknown, or parameters are invalid.
- transform(X)[source]#
Transform spectra by aligning them to the reference.
- Parameters:
X (pd.DataFrame) – Input spectra with shape (n_samples, n_bins).
- Returns:
X_aligned – Aligned spectra with same shape as input.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the transformer has not been fitted.
ValueError – If the number of features in X does not match the reference spectrum length.
- get_alignment_quality(X_original, X_aligned=None)[source]#
Compute alignment quality metrics.
- Parameters:
X_original (pd.DataFrame) – Original (unaligned) spectra.
X_aligned (pd.DataFrame, optional) – Aligned spectra. If None, will compute by calling transform().
- Returns:
Quality metrics with columns: - correlation_before: Pearson correlation with reference (before) - correlation_after: Pearson correlation with reference (after) - improvement: correlation_after - correlation_before - rmse_before: RMSE with reference (before) - rmse_after: RMSE with reference (after)
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the transformer has not been fitted.
RawWarping (Full Resolution)#
- class maldiamrkit.alignment.RawWarping(method=AlignmentMethod.shift, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, max_shift_da=50.0, n_segments=5, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, reference='median', pipeline=None, peak_detector=None, min_reference_peaks=5, interp_step=0.5, n_jobs=1)[source]#
Bases:
BaseEstimator,TransformerMixinAlign MALDI-TOF spectra using raw (full resolution) data.
Unlike Warping (which operates on binned data), RawWarping: - Loads original raw spectra from file paths - Performs warping at full m/z resolution - Outputs binned spectra for pipeline compatibility
This approach provides more accurate alignment by avoiding binning artifacts during the warping process.
- Parameters:
method (str, default="shift") – Warping method: - “shift” : global m/z shift in Daltons - “linear” : linear m/z transformation (mz’ = a*mz + b) - “piecewise” : segment-wise m/z shifts with smoothing - “dtw” : dynamic time warping - “quadratic” : quadratic polynomial fit on matched peak pairs - “cubic” : cubic polynomial fit on matched peak pairs - “lowess” : LOWESS (Cleveland 1979) non-linear warping
bin_width (float, default=3) – Width of output bins in Daltons.
bin_method (str, default="uniform") – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
bin_kwargs (dict, optional) – Additional keyword arguments for binning.
max_shift_da (float, default=50.0) – Maximum allowed shift in Daltons.
n_segments (int, default=5) – Number of segments for piecewise warping.
dtw_radius (int, default=10) – Radius constraint for DTW.
smooth_sigma (float, default=2.0) – Gaussian smoothing for piecewise transitions.
lowess_frac (float, default=0.3) – LOWESS smoothing bandwidth (fraction of matched peaks used for each local fit). Applies when
method="lowess".lowess_it (int, default=3) – Number of LOWESS robustness iterations. Applies when
method="lowess".reference (str or int, default="median") – Reference selection: “median” or int index.
pipeline (PreprocessingPipeline, optional) – Settings for preprocessing raw spectra.
peak_detector (MaldiPeakDetector, optional) – Peak detector used to find peaks in spectra. If None, a default detector is created with binary=True and prominence=1e-5.
min_reference_peaks (int, default=5) – Minimum peaks expected in reference.
interp_step (float, default=0.5) – Step size in Daltons for the common m/z grid used when computing a median reference spectrum.
n_jobs (int, default=1) – Number of parallel jobs for transform. Use -1 for all available cores, 1 for sequential processing.
- Variables:
ref_mz (np.ndarray) – Reference spectrum m/z values (after fit).
ref_intensity (np.ndarray) – Reference spectrum intensities (after fit).
ref_peaks_mz (np.ndarray) – Peak m/z positions in reference spectrum.
output_columns (pd.Index) – Column names for output DataFrame (m/z bin centers).
pipeline (PreprocessingPipeline) – Preprocessing configuration used.
Examples
>>> from maldiamrkit.alignment import RawWarping, create_raw_input >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler >>> >>> # Create input DataFrame from directory >>> X_raw = create_raw_input("spectra/") >>> >>> # Use in sklearn pipeline >>> pipe = Pipeline([ ... ("warp", RawWarping(method="piecewise", bin_width=3)), ... ("scaler", StandardScaler()), ... ("clf", RandomForestClassifier()) ... ]) >>> pipe.fit(X_raw, y)
Notes
Input DataFrame X must have: - Index: sample IDs - Column “path”: paths to raw spectrum files
Use create_raw_input() to easily create this DataFrame from a directory.
- __init__(method=AlignmentMethod.shift, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, max_shift_da=50.0, n_segments=5, dtw_radius=10, smooth_sigma=2.0, lowess_frac=0.3, lowess_it=3, reference='median', pipeline=None, peak_detector=None, min_reference_peaks=5, interp_step=0.5, n_jobs=1)[source]#
- Parameters:
method (
str|AlignmentMethod)bin_width (
float)bin_method (
str|BinningMethod)max_shift_da (
float)n_segments (
int)dtw_radius (
int)smooth_sigma (
float)lowess_frac (
float)lowess_it (
int)pipeline (
PreprocessingPipeline|None)peak_detector (
MaldiPeakDetector|None)min_reference_peaks (
int)interp_step (
float)n_jobs (
int)
- Return type:
None
- fit(X, y=None)[source]#
Fit by computing the reference spectrum from raw data.
- Parameters:
X (pd.DataFrame) – Input DataFrame with sample IDs as index and a “path” column containing paths to raw spectrum files. Use create_raw_input() to easily create this DataFrame.
y (array-like, optional) – Target values (ignored).
- Returns:
self – Fitted transformer.
- Return type:
- Raises:
ValueError – If the input DataFrame is empty, lacks a ‘path’ column, or uses an unknown warping method.
- transform(X)[source]#
Transform spectra by loading raw data, warping, and binning.
- Parameters:
X (pd.DataFrame) – Input DataFrame with sample IDs as index and a “path” column containing paths to raw spectrum files.
- Returns:
X_aligned – Aligned and binned spectra with sample IDs as index and m/z bin centers as columns.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the transformer has not been fitted.
ValueError – If the input DataFrame lacks a ‘path’ column.
- get_alignment_quality(X, X_aligned=None)[source]#
Compute alignment quality metrics.
- Parameters:
X (pd.DataFrame) – Input DataFrame with “path” column.
X_aligned (pd.DataFrame, optional) – Aligned spectra. If None, will compute via transform.
- Returns:
Quality metrics per sample with columns: - correlation_before: Pearson correlation with reference (before) - correlation_after: Pearson correlation with reference (after) - improvement: correlation_after - correlation_before
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the transformer has not been fitted.
Alignment Strategies#
- class maldiamrkit.alignment.AlignmentStrategy[source]#
Base class for alignment strategies.
- abstractmethod align_binned(row, peaks, ref_peaks, mz_axis)[source]#
Align a binned spectrum row to the reference.
- Parameters:
row (np.ndarray) – Intensity values of the spectrum to align.
peaks (np.ndarray) – Detected peak indices in
row.ref_peaks (np.ndarray) – Detected peak indices in the reference spectrum.
mz_axis (np.ndarray) – Array of bin positions (e.g.
np.arange(len(row))).
- Returns:
Aligned intensity array with the same length as
row.- Return type:
np.ndarray
- abstractmethod align_raw(mz, intensity, peaks_mz, ref_peaks_mz, ref_mz, ref_intensity)[source]#
Align a raw spectrum to the reference.
- Parameters:
mz (np.ndarray) – m/z values of the spectrum to align.
intensity (np.ndarray) – Intensity values of the spectrum to align.
peaks_mz (np.ndarray) – Detected peak m/z positions in the sample spectrum.
ref_peaks_mz (np.ndarray) – Detected peak m/z positions in the reference spectrum.
ref_mz (np.ndarray) – m/z values of the reference spectrum.
ref_intensity (np.ndarray) – Intensity values of the reference spectrum.
- Return type:
- Returns:
aligned_mz (np.ndarray) – Aligned m/z values.
aligned_intensity (np.ndarray) – Aligned intensity values.
Alignment Methods#
- class maldiamrkit.alignment.AlignmentMethod(value)[source]#
-
Supported alignment/warping methods.
- Variables:
shift (str) – Rigid global shift alignment.
linear (str) – Linear (affine) recalibration.
piecewise (str) – Piecewise-linear recalibration.
dtw (str) – Dynamic time warping alignment.
quadratic (str) – Quadratic polynomial recalibration.
cubic (str) – Cubic polynomial recalibration.
lowess (str) – Non-linear LOWESS (Cleveland 1979) recalibration.
- shift = 'shift'#
- linear = 'linear'#
- piecewise = 'piecewise'#
- dtw = 'dtw'#
- quadratic = 'quadratic'#
- cubic = 'cubic'#
- lowess = 'lowess'#
Utility Functions#
- maldiamrkit.alignment.create_raw_input(spectra_dir, sample_ids=None, file_extension='.txt', duplicate_strategy=DuplicateStrategy.first)[source]#
Create input DataFrame for RawWarping from a directory of spectrum files.
This utility function creates a DataFrame suitable for use with RawWarping in sklearn pipelines. The DataFrame has sample IDs as index and file paths as values.
- Parameters:
spectra_dir (str or Path) – Directory containing raw spectrum files.
sample_ids (list of str, optional) – List of sample IDs. If None, discovers all files matching the extension in spectra_dir and uses filenames (without extension) as sample IDs.
file_extension (str, default=".txt") – File extension for spectrum files.
duplicate_strategy (str or DuplicateStrategy, default
"first") –How to handle duplicate sample IDs (e.g. the same sample appearing in multiple year subdirectories):
"first"– keep the first occurrence (default)."last"– keep the last occurrence."drop"– remove all duplicates."keep_all"– keep every replicate with_repNsuffixes."average"– keep all replicates and add an_original_idcolumn so thatRawWarping.transform()can group and average them.
- Returns:
DataFrame with: - Index: sample IDs - Column “path”: full paths to spectrum files
- Return type:
pd.DataFrame
- Raises:
ValueError – If no files with the specified extension are found in the directory.
Examples
>>> # Discover all .txt files in directory >>> X_raw = create_raw_input("spectra/") >>> >>> # Specify sample IDs explicitly >>> X_raw = create_raw_input("spectra/", sample_ids=["s1", "s2", "s3"]) >>> >>> # Use in pipeline >>> from sklearn.pipeline import Pipeline >>> pipe = Pipeline([ ... ("warp", RawWarping(method="piecewise")), ... ("scaler", StandardScaler()), ... ]) >>> X_binned = pipe.fit_transform(X_raw)
Example Usage#
from maldiamrkit.alignment import Warping, RawWarping, create_raw_input
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Alignment on binned data
warper = Warping(method="piecewise", n_jobs=-1)
X_aligned = warper.fit_transform(X_binned)
# Raw warping: create input from directory, get binned output
X_raw = create_raw_input("spectra/") # DataFrame with file paths
raw_warper = RawWarping(method="piecewise", bin_width=3, n_jobs=-1)
X_binned = raw_warper.fit_transform(X_raw)
# Use in sklearn pipeline
pipe = Pipeline([
("warp", RawWarping(method="piecewise", bin_width=3)),
("scaler", StandardScaler()),
])
X_processed = pipe.fit_transform(X_raw)