Preprocessing Module#
Functions for preprocessing MALDI-TOF spectra.
PreprocessingPipeline#
- class maldiamrkit.preprocessing.PreprocessingPipeline(steps)[source]#
Bases:
objectComposable pipeline of preprocessing steps for MALDI-TOF spectra.
- Parameters:
steps (list of (str, transformer) tuples) – Named preprocessing steps. Each transformer must be callable, accepting and returning a
pd.DataFramewithmassandintensitycolumns.
Examples
>>> pipe = PreprocessingPipeline.default() >>> preprocessed = pipe(raw_spectrum_df)
- __call__(df)[source]#
Apply all preprocessing steps sequentially.
- Parameters:
df (pd.DataFrame) – Raw spectrum with
massandintensitycolumns.- Returns:
Preprocessed spectrum.
- Return type:
pd.DataFrame
- classmethod default()[source]#
Return the standard preprocessing pipeline.
Steps: clip negatives -> sqrt transform -> Savitzky-Golay smoothing -> SNIP baseline -> m/z trim (2000-20000 Da) -> TIC normalization.
- Returns:
Default pipeline instance.
- Return type:
- to_dict()[source]#
Serialize the pipeline to a dictionary.
- Returns:
Dictionary representation suitable for JSON/YAML serialization.
- Return type:
- classmethod from_dict(d)[source]#
Reconstruct a pipeline from a dictionary.
- Parameters:
- Returns:
Reconstructed pipeline.
- Return type:
- classmethod from_json(path)[source]#
Load a pipeline from a JSON file.
- Parameters:
path (str or Path) – Input file path.
- Returns:
Reconstructed pipeline.
- Return type:
- to_yaml(path)[source]#
Save the pipeline configuration to a YAML file.
Requires
pyyamlto be installed.
The preprocess() function is a convenience wrapper around the pipeline:
- maldiamrkit.preprocessing.preprocess(df, pipeline=None)[source]#
Apply preprocessing pipeline to a raw MALDI-TOF spectrum.
By default applies: clip negatives -> sqrt transform -> Savitzky-Golay smoothing -> SNIP baseline -> m/z trim (2000-20000 Da) -> TIC normalization.
- Parameters:
df (pd.DataFrame) – Raw spectrum with columns ‘mass’ and ‘intensity’.
pipeline (PreprocessingPipeline, optional) – Custom pipeline. If None, uses
PreprocessingPipeline.default().
- Returns:
Preprocessed spectrum with columns ‘mass’ and ‘intensity’.
- Return type:
pd.DataFrame
See also
PreprocessingPipelineComposable preprocessing pipeline class.
bin_spectrumBin preprocessed spectrum into m/z bins.
Examples
>>> from maldiamrkit.preprocessing import preprocess, PreprocessingPipeline >>> preprocessed = preprocess(raw_df) >>> preprocessed = preprocess(raw_df, PreprocessingPipeline.default())
Individual Transformers#
Each transformer is a callable operating on a DataFrame with mass and
intensity columns. They can be composed via PreprocessingPipeline.
- class maldiamrkit.preprocessing.ClipNegatives[source]#
Clip negative intensity values to zero.
- class maldiamrkit.preprocessing.SqrtTransform[source]#
Variance-stabilizing square root transformation.
- class maldiamrkit.preprocessing.LogTransform[source]#
Log1p intensity transformation (alternative to sqrt).
- class maldiamrkit.preprocessing.SavitzkyGolaySmooth(window_length=21, polyorder=2)[source]#
Savitzky-Golay smoothing filter.
- Parameters:
- __call__(df)[source]#
Apply Savitzky-Golay smoothing.
- Raises:
ValueError – If
window_lengthexceeds the data length, or ifwindow_lengthis not greater thanpolyorder.- Parameters:
df (
DataFrame)- Return type:
- class maldiamrkit.preprocessing.MovingAverageSmooth(window_length=5)[source]#
Moving-average smoothing filter.
Applies a uniform (boxcar) moving average of length
window_lengthusing reflective boundary handling.- Parameters:
window_length (int, default=5) – Length of the smoothing window. Must be an odd integer greater than or equal to 3.
- Raises:
ValueError – If
window_lengthis not an odd integer>= 3, or if it exceeds the data length.
- class maldiamrkit.preprocessing.SNIPBaseline(half_window=40)[source]#
SNIP (Statistics-sensitive Non-linear Iterative Peak-clipping) baseline correction.
- Parameters:
half_window (int, default=40) – Half-window size for the SNIP algorithm.
- class maldiamrkit.preprocessing.TopHatBaseline(half_window=100)[source]#
Morphological top-hat baseline subtraction.
Estimates the baseline by morphological grey-level opening of the intensity trace (erosion followed by dilation), then subtracts it from the spectrum and clips negative values to zero.
- Parameters:
half_window (int, default=100) – Half-width of the structuring element in bins. The full element size is
2 * half_window + 1. Must be a positive integer.- Raises:
ValueError – If
half_windowis not a positive integer or exceeds the data length.
- class maldiamrkit.preprocessing.ConvexHullBaseline[source]#
Parameter-free baseline from the lower convex hull of the spectrum.
Computes the convex hull of the
(mass, intensity)points, extracts the lower hull (vertices traversed in ascending mass with minimum intensity), linearly interpolates it onto the full m/z axis, and subtracts the resulting baseline from the spectrum. Negative residuals are clipped to zero.Notes
Requires at least three distinct points to form a hull. For shorter inputs the baseline is taken as the per-point minimum of the first and last intensities (degenerate “flat” hull).
- class maldiamrkit.preprocessing.MedianBaseline(half_window=100, iterations=1)[source]#
Rolling-median baseline subtraction.
Estimates the baseline via a rolling median filter applied
iterationstimes, then subtracts it from the spectrum and clips negative values to zero.- Parameters:
half_window (int, default=100) – Half-width of the median filter in bins. The full window size is
2 * half_window + 1. Must be a positive integer.iterations (int, default=1) – Number of times the median filter is applied. Must be a positive integer. Additional iterations further flatten broad features at the cost of compute time.
- Raises:
ValueError – If
half_windoworiterationsis not a positive integer, or if the filter window exceeds the data length.
- class maldiamrkit.preprocessing.MzTrimmer(mz_min=2000, mz_max=20000)[source]#
Trim spectrum to a specified m/z range.
- Parameters:
- Raises:
ValueError – If
mz_minis greater than or equal tomz_max.
- class maldiamrkit.preprocessing.TICNormalizer[source]#
Total Ion Current normalization (intensities sum to 1).
- class maldiamrkit.preprocessing.MedianNormalizer[source]#
Normalize intensities by median value.
- class maldiamrkit.preprocessing.PQNNormalizer(reference=None)[source]#
Probabilistic Quotient Normalization.
First normalizes by TIC, then divides by the median of the quotient spectrum (sample / reference). If no reference is provided, the reference is the median spectrum across the dataset.
- Parameters:
reference (np.ndarray, list, or None, default=None) – Reference intensity vector. If None, uses TIC normalization only (the full PQN requires a reference from the dataset). Lists are converted to arrays internally.
- class maldiamrkit.preprocessing.MzMultiTrimmer(mz_ranges)[source]#
Keep only specific m/z ranges from the spectrum.
- Parameters:
mz_ranges (list of tuple[float, float]) – List of (mz_min, mz_max) ranges to keep.
- Raises:
ValueError – If
mz_rangesis empty.
Pipeline Serialization#
Save and load pipeline configurations for reproducibility:
from maldiamrkit.preprocessing import PreprocessingPipeline
pipe = PreprocessingPipeline.default()
# Save to JSON
pipe.to_json("pipeline.json")
pipe = PreprocessingPipeline.from_json("pipeline.json")
# Save to YAML (requires pyyaml)
pipe.to_yaml("pipeline.yaml")
pipe = PreprocessingPipeline.from_yaml("pipeline.yaml")
Binning#
- maldiamrkit.preprocessing.bin_spectrum(df, mz_min=2000, mz_max=20000, bin_width=3, method=BinningMethod.uniform, custom_edges=None, adaptive_min_width=1.0, adaptive_max_width=10.0, adaptive_peak_prominence=None, adaptive_kde_bandwidth=None)[source]#
Bin spectrum intensities into m/z intervals.
Supports multiple binning strategies: uniform (fixed width), proportional (width scales linearly with m/z), adaptive (smaller bins in peak-dense regions), and custom (user-defined edges).
- Parameters:
df (pd.DataFrame) – Preprocessed spectrum with columns ‘mass’ and ‘intensity’.
mz_min (int, default=2000) – Lower m/z bound in Daltons.
mz_max (int, default=20000) – Upper m/z bound in Daltons.
bin_width (int or float, default=3) – Width of each bin in Daltons. For ‘uniform’, this is the fixed width. For ‘proportional’, this is the reference width at mz_min. Ignored for ‘adaptive’ and ‘custom’ methods.
method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
custom_edges (array-like, optional) – User-provided bin edges. Required if method=’custom’.
adaptive_min_width (float, default=1.0) – Minimum bin width in Daltons for adaptive binning.
adaptive_max_width (float, default=10.0) – Maximum bin width in Daltons for adaptive binning.
adaptive_peak_prominence (float or None, default=None) – Minimum prominence for peak detection in adaptive binning. If
None, uses a MAD-based estimate (robust to outliers).adaptive_kde_bandwidth (float or None, default=None) – Bandwidth for the Gaussian KDE in adaptive binning. If
None, uses Silverman’s rule of thumb.
- Returns:
Tuple of (binned_spectrum, bin_metadata). binned_spectrum has columns ‘mass’ (bin start) and ‘intensity’. bin_metadata has columns ‘bin_index’, ‘bin_start’, ‘bin_end’, ‘bin_width’.
- Return type:
tuple[pd.DataFrame, pd.DataFrame]
- Raises:
ValueError – If method is invalid, custom_edges is missing for ‘custom’ method, or bin_width < 1.
Examples
>>> from maldiamrkit.preprocessing import bin_spectrum >>> >>> # Uniform binning (default) >>> binned, metadata = bin_spectrum(df, bin_width=3) >>> >>> # Proportional binning (width grows with m/z) >>> binned, metadata = bin_spectrum(df, bin_width=3, method='proportional') >>> >>> # Adaptive binning >>> binned, metadata = bin_spectrum(df, method='adaptive') >>> >>> # Custom binning >>> edges = [2000, 5000, 10000, 15000, 20000] >>> binned, metadata = bin_spectrum(df, method='custom', custom_edges=edges)
- maldiamrkit.preprocessing.get_bin_metadata(edges)[source]#
Generate bin metadata from edges.
- Parameters:
edges (np.ndarray) – Array of bin edges.
- Returns:
DataFrame with columns: bin_index, bin_start, bin_end, bin_width.
- Return type:
pd.DataFrame
- class maldiamrkit.preprocessing.BinningMethod(value)[source]#
-
Supported binning methods.
- Variables:
- uniform = 'uniform'#
- proportional = 'proportional'#
- adaptive = 'adaptive'#
- custom = 'custom'#
Binning Methods#
MaldiAMRKit supports multiple binning strategies:
Uniform (default): Fixed-width bins across the m/z range.
spec.bin(bin_width=3) # 3 Da bins
Proportional: Bin width scales with m/z, matching instrument resolution.
spec.bin(bin_width=3, method="proportional")
Adaptive: Smaller bins in peak-dense regions, larger bins elsewhere.
spec.bin(method="adaptive", adaptive_min_width=1.0, adaptive_max_width=10.0)
Custom: User-defined bin edges for domain-specific analysis.
spec.bin(method="custom", custom_edges=[2000, 5000, 10000, 15000, 20000])
Bin metadata is available via the bin_metadata attribute:
print(spec.bin_metadata.head())
# bin_index bin_start bin_end bin_width
# 0 0 2000.0 2003.0 3.0
Quality Metrics#
- maldiamrkit.preprocessing.estimate_snr(spectrum, noise_region=(19500, 20000), signal_method=SignalMethod.max, n_top_peaks=10)[source]#
Estimate signal-to-noise ratio of a spectrum.
Uses median absolute deviation (MAD) in a noise region to estimate noise level. The signal level is determined by signal_method.
- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
noise_region (tuple of (float, float), default=(19500, 20000)) – m/z range to use for noise estimation. Should be a region with minimal peaks (typically high m/z range).
signal_method (str, default="max") –
How to estimate the signal level:
"max": maximum intensity (standard approach)."median_peaks": median intensity of the top n_top_peaks detected peaks. More robust to single outlier peaks.
n_top_peaks (int, default=10) – Number of top peaks to consider when
signal_method="median_peaks".
- Returns:
Estimated signal-to-noise ratio, capped at
1e6. Returns1e6when the noise standard deviation is zero or the configured noise region contains no data points.- Return type:
- Raises:
ValueError – If
signal_methodis not one of ‘max’ or ‘median_peaks’.
Notes
The MAD-to-standard-deviation conversion factor (1.4826) assumes normally distributed noise.
Examples
>>> from maldiamrkit import MaldiSpectrum >>> from maldiamrkit.preprocessing import estimate_snr >>> spec = MaldiSpectrum("spectrum.txt").preprocess() >>> snr = estimate_snr(spec) >>> print(f"SNR: {snr:.1f}") >>> snr_robust = estimate_snr(spec, signal_method="median_peaks")
- class maldiamrkit.preprocessing.SpectrumQuality(noise_region=(19500, 20000), peak_prominence=0.0001, signal_method=SignalMethod.max, n_top_peaks=10)[source]#
Bases:
objectComprehensive quality assessment for MALDI-TOF spectra.
Provides methods to compute various quality metrics for individual spectra, useful for quality control and filtering poor-quality acquisitions.
- Parameters:
noise_region (tuple of (float, float), default=(19500, 20000)) – m/z range to use for noise estimation. Should be a region with minimal peaks (typically high m/z range).
peak_prominence (float, default=1e-4) – Minimum prominence for peak detection.
signal_method (str, default="max") –
How to estimate the signal level for SNR calculation:
"max": use the maximum intensity (standard, but sensitive to single outlier peaks)."median_peaks": use the median intensity of the top n_top_peaks detected peaks (more robust).
n_top_peaks (int, default=10) – Number of top peaks to consider when
signal_method="median_peaks".
Examples
>>> from maldiamrkit import MaldiSpectrum >>> from maldiamrkit.preprocessing.quality import SpectrumQuality >>> spec = MaldiSpectrum("spectrum.txt").preprocess() >>> qc = SpectrumQuality(noise_region=(19500, 20000)) >>> report = qc.assess(spec) >>> print(f"SNR: {report.snr:.1f}") >>> print(f"TIC: {report.total_ion_count:.2e}") >>> print(f"Peaks: {report.peak_count}")
- __init__(noise_region=(19500, 20000), peak_prominence=0.0001, signal_method=SignalMethod.max, n_top_peaks=10)[source]#
- estimate_noise_level(spectrum)[source]#
Estimate noise level using MAD in noise region.
- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
- Returns:
Estimated noise standard deviation. Returns 0 if noise region is empty.
- Return type:
- estimate_mad_noise(spectrum, mz_region=None, constant=1.4826)[source]#
Estimate noise level via median absolute deviation (MAD).
Uses
scipy.stats.median_abs_deviation()on the intensities in the selected m/z region and multiplies the raw MAD byconstant. The defaultconstant = 1.4826 = 1 / Phi^{-1}(3/4)rescales MAD to match the standard deviation of a Gaussian (Rousseeuw & Croux 1993), matching the convention used byestimate_noise_level().- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
mz_region (tuple of (float, float), optional) – m/z range to use for noise estimation. When
None(the default) falls back toself.noise_region.constant (float, default=1.4826) – Scale factor applied to the raw MAD. Use
1.4826for a standard-normal-scaled estimator (equivalent toscale='normal'in scipy).
- Returns:
Estimated noise level. Returns
0.0when the selected region contains no data points.- Return type:
- estimate_baseline_fraction(spectrum)[source]#
Estimate fraction of intensity below noise floor.
This indicates how much of the spectrum is dominated by baseline rather than signal. High values suggest poor acquisition quality or excessive baseline.
- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
- Returns:
Fraction of data points below 2x noise level (0 to 1).
- Return type:
- estimate_dynamic_range(spectrum)[source]#
Estimate dynamic range as log10 ratio of max to median signal.
Higher values indicate better separation between signal and background.
- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
- Returns:
Log10 ratio of max to median intensity. Returns 0 if median is zero.
- Return type:
- count_peaks(spectrum)[source]#
Count the number of peaks in the spectrum.
- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
- Returns:
Number of detected peaks.
- Return type:
- assess(spectrum)[source]#
Perform full quality assessment of a spectrum.
- Parameters:
spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.
- Returns:
Dataclass containing all quality metrics.
- Return type:
- class maldiamrkit.preprocessing.SpectrumQualityReport(snr, total_ion_count, peak_count, baseline_fraction, noise_level, dynamic_range)[source]#
Bases:
objectQuality metrics report for a single MALDI-TOF spectrum.
- Variables:
snr (float) – Signal-to-noise ratio.
total_ion_count (float) – Sum of all intensities (total ion count).
peak_count (int) – Number of detected peaks.
baseline_fraction (float) – Fraction of data points below noise floor (baseline contamination).
noise_level (float) – Estimated noise level (standard deviation).
dynamic_range (float) – Log10 ratio of max to median signal intensity.
- Parameters:
- class maldiamrkit.preprocessing.SignalMethod(value)[source]#
-
Method for estimating the signal level in SNR computation.
- Variables:
- max = 'max'#
- median_peaks = 'median_peaks'#
Usage Example#
from maldiamrkit import MaldiSpectrum
from maldiamrkit.preprocessing import SpectrumQuality
# Assess spectrum quality
spec = MaldiSpectrum("spectrum.txt").preprocess()
qc = SpectrumQuality() # Uses high m/z region (19500-20000) by default
report = qc.assess(spec)
print(f"SNR: {report.snr:.1f}")
print(f"Peak count: {report.peak_count}")
print(f"Total ion count: {report.total_ion_count:.2e}")
print(f"Baseline fraction: {report.baseline_fraction:.2%}")
print(f"Dynamic range: {report.dynamic_range:.2f}")
Replicate Merging#
- maldiamrkit.preprocessing.merge_replicates(spectra, method=MergingMethod.mean, weights=None)[source]#
Merge replicate spectra into a single consensus spectrum.
- Parameters:
spectra (list of MaldiSpectrum) – Replicate spectra to merge.
method (str, default="mean") –
Merging strategy:
"mean": arithmetic mean (or weighted mean ifweightsis provided)."median": element-wise median (weightsis ignored).
weights (array-like of float, optional) – Per-replicate weights for the
"mean"method (e.g. SNR values). Ignored whenmethod="median". Must have the same length asspectra.
- Returns:
Merged spectrum with
massandintensitycolumns.- Return type:
pd.DataFrame
- Raises:
ValueError – If spectra is empty, method is invalid, or weights length does not match spectra.
- maldiamrkit.preprocessing.detect_outlier_replicates(spectra, threshold=3.0)[source]#
Identify outlier replicates using correlation with the median spectrum.
Computes the Pearson correlation of each replicate against the element-wise median spectrum. Replicates whose correlation falls below
median(corrs) - threshold * MAD(corrs)are flagged as outliers.- Parameters:
spectra (list of MaldiSpectrum) – Replicate spectra.
threshold (float, default=3.0) – Number of MAD units below the median correlation to flag a replicate as an outlier.
- Returns:
Boolean array of length
len(spectra).Truemeans the replicate is kept;Falsemeans it is an outlier.- Return type:
np.ndarray
- Raises:
ValueError – If spectra has fewer than 3 elements (need at least 3 to estimate spread).