Preprocessing Module#

Functions for preprocessing MALDI-TOF spectra.

PreprocessingPipeline#

class maldiamrkit.preprocessing.PreprocessingPipeline(steps)[source]#

Bases: object

Composable pipeline of preprocessing steps for MALDI-TOF spectra.

Parameters:

steps (list of (str, transformer) tuples) – Named preprocessing steps. Each transformer must be callable, accepting and returning a pd.DataFrame with mass and intensity columns.

Examples

>>> pipe = PreprocessingPipeline.default()
>>> preprocessed = pipe(raw_spectrum_df)
__init__(steps)[source]#
Parameters:

steps (list[tuple[str, PreprocessingStep]])

__call__(df)[source]#

Apply all preprocessing steps sequentially.

Parameters:

df (pd.DataFrame) – Raw spectrum with mass and intensity columns.

Returns:

Preprocessed spectrum.

Return type:

pd.DataFrame

classmethod default()[source]#

Return the standard preprocessing pipeline.

Steps: clip negatives -> sqrt transform -> Savitzky-Golay smoothing -> SNIP baseline -> m/z trim (2000-20000 Da) -> TIC normalization.

Returns:

Default pipeline instance.

Return type:

PreprocessingPipeline

get_step(name)[source]#

Get a step by name.

Parameters:

name (str) – Step name.

Returns:

The transformer associated with that name.

Return type:

object

Raises:

KeyError – If no step with that name exists.

property step_names: list[str]#

Return the names of all steps.

property mz_range: tuple[int, int]#

Extract (mz_min, mz_max) from the MzTrimmer step.

Returns:

The m/z range from the MzTrimmer step, or the default (2000, 20000) if no MzTrimmer is present.

Return type:

tuple[int, int]

to_dict()[source]#

Serialize the pipeline to a dictionary.

Returns:

Dictionary representation suitable for JSON/YAML serialization.

Return type:

dict

classmethod from_dict(d)[source]#

Reconstruct a pipeline from a dictionary.

Parameters:

d (dict) – Dictionary as produced by to_dict().

Returns:

Reconstructed pipeline.

Return type:

PreprocessingPipeline

to_json(path)[source]#

Save the pipeline configuration to a JSON file.

Parameters:

path (str or Path) – Output file path.

Return type:

None

classmethod from_json(path)[source]#

Load a pipeline from a JSON file.

Parameters:

path (str or Path) – Input file path.

Returns:

Reconstructed pipeline.

Return type:

PreprocessingPipeline

to_yaml(path)[source]#

Save the pipeline configuration to a YAML file.

Requires pyyaml to be installed.

Parameters:

path (str or Path) – Output file path.

Return type:

None

classmethod from_yaml(path)[source]#

Load a pipeline from a YAML file.

Requires pyyaml to be installed.

Parameters:

path (str or Path) – Input file path.

Returns:

Reconstructed pipeline.

Return type:

PreprocessingPipeline

The preprocess() function is a convenience wrapper around the pipeline:

maldiamrkit.preprocessing.preprocess(df, pipeline=None)[source]#

Apply preprocessing pipeline to a raw MALDI-TOF spectrum.

By default applies: clip negatives -> sqrt transform -> Savitzky-Golay smoothing -> SNIP baseline -> m/z trim (2000-20000 Da) -> TIC normalization.

Parameters:
  • df (pd.DataFrame) – Raw spectrum with columns ‘mass’ and ‘intensity’.

  • pipeline (PreprocessingPipeline, optional) – Custom pipeline. If None, uses PreprocessingPipeline.default().

Returns:

Preprocessed spectrum with columns ‘mass’ and ‘intensity’.

Return type:

pd.DataFrame

See also

PreprocessingPipeline

Composable preprocessing pipeline class.

bin_spectrum

Bin preprocessed spectrum into m/z bins.

Examples

>>> from maldiamrkit.preprocessing import preprocess, PreprocessingPipeline
>>> preprocessed = preprocess(raw_df)
>>> preprocessed = preprocess(raw_df, PreprocessingPipeline.default())

Individual Transformers#

Each transformer is a callable operating on a DataFrame with mass and intensity columns. They can be composed via PreprocessingPipeline.

class maldiamrkit.preprocessing.ClipNegatives[source]#

Clip negative intensity values to zero.

__call__(df)[source]#

Apply clipping to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.SqrtTransform[source]#

Variance-stabilizing square root transformation.

__call__(df)[source]#

Apply square-root transformation to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.LogTransform[source]#

Log1p intensity transformation (alternative to sqrt).

__call__(df)[source]#

Apply log1p transformation to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.SavitzkyGolaySmooth(window_length=21, polyorder=2)[source]#

Savitzky-Golay smoothing filter.

Parameters:
  • window_length (int, default=21) – Length of the filter window. Must be a positive odd integer (per Savitzky & Golay 1964).

  • polyorder (int, default=2) – Order of the polynomial used to fit the samples.

__init__(window_length=21, polyorder=2)[source]#
Parameters:
  • window_length (int)

  • polyorder (int)

__call__(df)[source]#

Apply Savitzky-Golay smoothing.

Raises:

ValueError – If window_length exceeds the data length, or if window_length is not greater than polyorder.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.MovingAverageSmooth(window_length=5)[source]#

Moving-average smoothing filter.

Applies a uniform (boxcar) moving average of length window_length using reflective boundary handling.

Parameters:

window_length (int, default=5) – Length of the smoothing window. Must be an odd integer greater than or equal to 3.

Raises:

ValueError – If window_length is not an odd integer >= 3, or if it exceeds the data length.

__init__(window_length=5)[source]#
Parameters:

window_length (int)

__call__(df)[source]#

Apply moving-average smoothing to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.SNIPBaseline(half_window=40)[source]#

SNIP (Statistics-sensitive Non-linear Iterative Peak-clipping) baseline correction.

Parameters:

half_window (int, default=40) – Half-window size for the SNIP algorithm.

__init__(half_window=40)[source]#
Parameters:

half_window (int)

__call__(df)[source]#

Apply SNIP baseline correction to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.TopHatBaseline(half_window=100)[source]#

Morphological top-hat baseline subtraction.

Estimates the baseline by morphological grey-level opening of the intensity trace (erosion followed by dilation), then subtracts it from the spectrum and clips negative values to zero.

Parameters:

half_window (int, default=100) – Half-width of the structuring element in bins. The full element size is 2 * half_window + 1. Must be a positive integer.

Raises:

ValueError – If half_window is not a positive integer or exceeds the data length.

__init__(half_window=100)[source]#
Parameters:

half_window (int)

__call__(df)[source]#

Apply top-hat baseline subtraction to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.ConvexHullBaseline[source]#

Parameter-free baseline from the lower convex hull of the spectrum.

Computes the convex hull of the (mass, intensity) points, extracts the lower hull (vertices traversed in ascending mass with minimum intensity), linearly interpolates it onto the full m/z axis, and subtracts the resulting baseline from the spectrum. Negative residuals are clipped to zero.

Notes

Requires at least three distinct points to form a hull. For shorter inputs the baseline is taken as the per-point minimum of the first and last intensities (degenerate “flat” hull).

__call__(df)[source]#

Apply convex-hull baseline subtraction to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.MedianBaseline(half_window=100, iterations=1)[source]#

Rolling-median baseline subtraction.

Estimates the baseline via a rolling median filter applied iterations times, then subtracts it from the spectrum and clips negative values to zero.

Parameters:
  • half_window (int, default=100) – Half-width of the median filter in bins. The full window size is 2 * half_window + 1. Must be a positive integer.

  • iterations (int, default=1) – Number of times the median filter is applied. Must be a positive integer. Additional iterations further flatten broad features at the cost of compute time.

Raises:

ValueError – If half_window or iterations is not a positive integer, or if the filter window exceeds the data length.

__init__(half_window=100, iterations=1)[source]#
Parameters:
  • half_window (int)

  • iterations (int)

__call__(df)[source]#

Apply rolling-median baseline subtraction to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.MzTrimmer(mz_min=2000, mz_max=20000)[source]#

Trim spectrum to a specified m/z range.

Parameters:
  • mz_min (int, default=2000) – Lower m/z bound in Daltons.

  • mz_max (int, default=20000) – Upper m/z bound in Daltons.

Raises:

ValueError – If mz_min is greater than or equal to mz_max.

__init__(mz_min=2000, mz_max=20000)[source]#
Parameters:
__call__(df)[source]#

Apply m/z trimming to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.TICNormalizer[source]#

Total Ion Current normalization (intensities sum to 1).

__call__(df)[source]#

Apply TIC normalization to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.MedianNormalizer[source]#

Normalize intensities by median value.

__call__(df)[source]#

Apply median normalization to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.PQNNormalizer(reference=None)[source]#

Probabilistic Quotient Normalization.

First normalizes by TIC, then divides by the median of the quotient spectrum (sample / reference). If no reference is provided, the reference is the median spectrum across the dataset.

Parameters:

reference (np.ndarray, list, or None, default=None) – Reference intensity vector. If None, uses TIC normalization only (the full PQN requires a reference from the dataset). Lists are converted to arrays internally.

__init__(reference=None)[source]#
Parameters:

reference (ndarray | list | None)

__call__(df)[source]#

Apply PQN normalization to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

class maldiamrkit.preprocessing.MzMultiTrimmer(mz_ranges)[source]#

Keep only specific m/z ranges from the spectrum.

Parameters:

mz_ranges (list of tuple[float, float]) – List of (mz_min, mz_max) ranges to keep.

Raises:

ValueError – If mz_ranges is empty.

__init__(mz_ranges)[source]#
Parameters:

mz_ranges (list[tuple[float, float]])

__call__(df)[source]#

Apply m/z range subsetting to the spectrum.

Parameters:

df (DataFrame)

Return type:

DataFrame

to_dict()[source]#

Serialize transformer to a dictionary.

Return type:

dict

Pipeline Serialization#

Save and load pipeline configurations for reproducibility:

from maldiamrkit.preprocessing import PreprocessingPipeline

pipe = PreprocessingPipeline.default()

# Save to JSON
pipe.to_json("pipeline.json")
pipe = PreprocessingPipeline.from_json("pipeline.json")

# Save to YAML (requires pyyaml)
pipe.to_yaml("pipeline.yaml")
pipe = PreprocessingPipeline.from_yaml("pipeline.yaml")

Binning#

maldiamrkit.preprocessing.bin_spectrum(df, mz_min=2000, mz_max=20000, bin_width=3, method=BinningMethod.uniform, custom_edges=None, adaptive_min_width=1.0, adaptive_max_width=10.0, adaptive_peak_prominence=None, adaptive_kde_bandwidth=None)[source]#

Bin spectrum intensities into m/z intervals.

Supports multiple binning strategies: uniform (fixed width), proportional (width scales linearly with m/z), adaptive (smaller bins in peak-dense regions), and custom (user-defined edges).

Parameters:
  • df (pd.DataFrame) – Preprocessed spectrum with columns ‘mass’ and ‘intensity’.

  • mz_min (int, default=2000) – Lower m/z bound in Daltons.

  • mz_max (int, default=20000) – Upper m/z bound in Daltons.

  • bin_width (int or float, default=3) – Width of each bin in Daltons. For ‘uniform’, this is the fixed width. For ‘proportional’, this is the reference width at mz_min. Ignored for ‘adaptive’ and ‘custom’ methods.

  • method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.

  • custom_edges (array-like, optional) – User-provided bin edges. Required if method=’custom’.

  • adaptive_min_width (float, default=1.0) – Minimum bin width in Daltons for adaptive binning.

  • adaptive_max_width (float, default=10.0) – Maximum bin width in Daltons for adaptive binning.

  • adaptive_peak_prominence (float or None, default=None) – Minimum prominence for peak detection in adaptive binning. If None, uses a MAD-based estimate (robust to outliers).

  • adaptive_kde_bandwidth (float or None, default=None) – Bandwidth for the Gaussian KDE in adaptive binning. If None, uses Silverman’s rule of thumb.

Returns:

Tuple of (binned_spectrum, bin_metadata). binned_spectrum has columns ‘mass’ (bin start) and ‘intensity’. bin_metadata has columns ‘bin_index’, ‘bin_start’, ‘bin_end’, ‘bin_width’.

Return type:

tuple[pd.DataFrame, pd.DataFrame]

Raises:

ValueError – If method is invalid, custom_edges is missing for ‘custom’ method, or bin_width < 1.

Examples

>>> from maldiamrkit.preprocessing import bin_spectrum
>>>
>>> # Uniform binning (default)
>>> binned, metadata = bin_spectrum(df, bin_width=3)
>>>
>>> # Proportional binning (width grows with m/z)
>>> binned, metadata = bin_spectrum(df, bin_width=3, method='proportional')
>>>
>>> # Adaptive binning
>>> binned, metadata = bin_spectrum(df, method='adaptive')
>>>
>>> # Custom binning
>>> edges = [2000, 5000, 10000, 15000, 20000]
>>> binned, metadata = bin_spectrum(df, method='custom', custom_edges=edges)
maldiamrkit.preprocessing.get_bin_metadata(edges)[source]#

Generate bin metadata from edges.

Parameters:

edges (np.ndarray) – Array of bin edges.

Returns:

DataFrame with columns: bin_index, bin_start, bin_end, bin_width.

Return type:

pd.DataFrame

class maldiamrkit.preprocessing.BinningMethod(value)[source]#

Bases: str, Enum

Supported binning methods.

Variables:
  • uniform (str) – Fixed-width bins across the m/z range.

  • proportional (str) – Bin width scales linearly with m/z.

  • adaptive (str) – Smaller bins in peak-dense regions.

  • custom (str) – User-provided bin edges.

uniform = 'uniform'#
proportional = 'proportional'#
adaptive = 'adaptive'#
custom = 'custom'#

Binning Methods#

MaldiAMRKit supports multiple binning strategies:

Uniform (default): Fixed-width bins across the m/z range.

spec.bin(bin_width=3)  # 3 Da bins

Proportional: Bin width scales with m/z, matching instrument resolution.

spec.bin(bin_width=3, method="proportional")

Adaptive: Smaller bins in peak-dense regions, larger bins elsewhere.

spec.bin(method="adaptive", adaptive_min_width=1.0, adaptive_max_width=10.0)

Custom: User-defined bin edges for domain-specific analysis.

spec.bin(method="custom", custom_edges=[2000, 5000, 10000, 15000, 20000])

Bin metadata is available via the bin_metadata attribute:

print(spec.bin_metadata.head())
#    bin_index  bin_start  bin_end  bin_width
# 0          0     2000.0   2003.0        3.0

Quality Metrics#

maldiamrkit.preprocessing.estimate_snr(spectrum, noise_region=(19500, 20000), signal_method=SignalMethod.max, n_top_peaks=10)[source]#

Estimate signal-to-noise ratio of a spectrum.

Uses median absolute deviation (MAD) in a noise region to estimate noise level. The signal level is determined by signal_method.

Parameters:
  • spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

  • noise_region (tuple of (float, float), default=(19500, 20000)) – m/z range to use for noise estimation. Should be a region with minimal peaks (typically high m/z range).

  • signal_method (str, default="max") –

    How to estimate the signal level:

    • "max": maximum intensity (standard approach).

    • "median_peaks": median intensity of the top n_top_peaks detected peaks. More robust to single outlier peaks.

  • n_top_peaks (int, default=10) – Number of top peaks to consider when signal_method="median_peaks".

Returns:

Estimated signal-to-noise ratio, capped at 1e6. Returns 1e6 when the noise standard deviation is zero or the configured noise region contains no data points.

Return type:

float

Raises:

ValueError – If signal_method is not one of ‘max’ or ‘median_peaks’.

Notes

The MAD-to-standard-deviation conversion factor (1.4826) assumes normally distributed noise.

Examples

>>> from maldiamrkit import MaldiSpectrum
>>> from maldiamrkit.preprocessing import estimate_snr
>>> spec = MaldiSpectrum("spectrum.txt").preprocess()
>>> snr = estimate_snr(spec)
>>> print(f"SNR: {snr:.1f}")
>>> snr_robust = estimate_snr(spec, signal_method="median_peaks")
class maldiamrkit.preprocessing.SpectrumQuality(noise_region=(19500, 20000), peak_prominence=0.0001, signal_method=SignalMethod.max, n_top_peaks=10)[source]#

Bases: object

Comprehensive quality assessment for MALDI-TOF spectra.

Provides methods to compute various quality metrics for individual spectra, useful for quality control and filtering poor-quality acquisitions.

Parameters:
  • noise_region (tuple of (float, float), default=(19500, 20000)) – m/z range to use for noise estimation. Should be a region with minimal peaks (typically high m/z range).

  • peak_prominence (float, default=1e-4) – Minimum prominence for peak detection.

  • signal_method (str, default="max") –

    How to estimate the signal level for SNR calculation:

    • "max": use the maximum intensity (standard, but sensitive to single outlier peaks).

    • "median_peaks": use the median intensity of the top n_top_peaks detected peaks (more robust).

  • n_top_peaks (int, default=10) – Number of top peaks to consider when signal_method="median_peaks".

Examples

>>> from maldiamrkit import MaldiSpectrum
>>> from maldiamrkit.preprocessing.quality import SpectrumQuality
>>> spec = MaldiSpectrum("spectrum.txt").preprocess()
>>> qc = SpectrumQuality(noise_region=(19500, 20000))
>>> report = qc.assess(spec)
>>> print(f"SNR: {report.snr:.1f}")
>>> print(f"TIC: {report.total_ion_count:.2e}")
>>> print(f"Peaks: {report.peak_count}")
__init__(noise_region=(19500, 20000), peak_prominence=0.0001, signal_method=SignalMethod.max, n_top_peaks=10)[source]#
Parameters:
estimate_noise_level(spectrum)[source]#

Estimate noise level using MAD in noise region.

Parameters:

spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

Returns:

Estimated noise standard deviation. Returns 0 if noise region is empty.

Return type:

float

estimate_mad_noise(spectrum, mz_region=None, constant=1.4826)[source]#

Estimate noise level via median absolute deviation (MAD).

Uses scipy.stats.median_abs_deviation() on the intensities in the selected m/z region and multiplies the raw MAD by constant. The default constant = 1.4826 = 1 / Phi^{-1}(3/4) rescales MAD to match the standard deviation of a Gaussian (Rousseeuw & Croux 1993), matching the convention used by estimate_noise_level().

Parameters:
  • spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

  • mz_region (tuple of (float, float), optional) – m/z range to use for noise estimation. When None (the default) falls back to self.noise_region.

  • constant (float, default=1.4826) – Scale factor applied to the raw MAD. Use 1.4826 for a standard-normal-scaled estimator (equivalent to scale='normal' in scipy).

Returns:

Estimated noise level. Returns 0.0 when the selected region contains no data points.

Return type:

float

estimate_baseline_fraction(spectrum)[source]#

Estimate fraction of intensity below noise floor.

This indicates how much of the spectrum is dominated by baseline rather than signal. High values suggest poor acquisition quality or excessive baseline.

Parameters:

spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

Returns:

Fraction of data points below 2x noise level (0 to 1).

Return type:

float

estimate_dynamic_range(spectrum)[source]#

Estimate dynamic range as log10 ratio of max to median signal.

Higher values indicate better separation between signal and background.

Parameters:

spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

Returns:

Log10 ratio of max to median intensity. Returns 0 if median is zero.

Return type:

float

count_peaks(spectrum)[source]#

Count the number of peaks in the spectrum.

Parameters:

spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

Returns:

Number of detected peaks.

Return type:

int

assess(spectrum)[source]#

Perform full quality assessment of a spectrum.

Parameters:

spectrum (MaldiSpectrum) – Spectrum to assess. Uses preprocessed data if available, otherwise raw.

Returns:

Dataclass containing all quality metrics.

Return type:

SpectrumQualityReport

class maldiamrkit.preprocessing.SpectrumQualityReport(snr, total_ion_count, peak_count, baseline_fraction, noise_level, dynamic_range)[source]#

Bases: object

Quality metrics report for a single MALDI-TOF spectrum.

Variables:
  • snr (float) – Signal-to-noise ratio.

  • total_ion_count (float) – Sum of all intensities (total ion count).

  • peak_count (int) – Number of detected peaks.

  • baseline_fraction (float) – Fraction of data points below noise floor (baseline contamination).

  • noise_level (float) – Estimated noise level (standard deviation).

  • dynamic_range (float) – Log10 ratio of max to median signal intensity.

Parameters:
__init__(snr, total_ion_count, peak_count, baseline_fraction, noise_level, dynamic_range)#
Parameters:
Return type:

None

class maldiamrkit.preprocessing.SignalMethod(value)[source]#

Bases: str, Enum

Method for estimating the signal level in SNR computation.

Variables:
  • max (str) – Maximum intensity (standard approach).

  • median_peaks (str) – Median intensity of the top detected peaks (more robust).

max = 'max'#
median_peaks = 'median_peaks'#

Usage Example#

from maldiamrkit import MaldiSpectrum
from maldiamrkit.preprocessing import SpectrumQuality

# Assess spectrum quality
spec = MaldiSpectrum("spectrum.txt").preprocess()
qc = SpectrumQuality()  # Uses high m/z region (19500-20000) by default
report = qc.assess(spec)

print(f"SNR: {report.snr:.1f}")
print(f"Peak count: {report.peak_count}")
print(f"Total ion count: {report.total_ion_count:.2e}")
print(f"Baseline fraction: {report.baseline_fraction:.2%}")
print(f"Dynamic range: {report.dynamic_range:.2f}")

Replicate Merging#

maldiamrkit.preprocessing.merge_replicates(spectra, method=MergingMethod.mean, weights=None)[source]#

Merge replicate spectra into a single consensus spectrum.

Parameters:
  • spectra (list of MaldiSpectrum) – Replicate spectra to merge.

  • method (str, default="mean") –

    Merging strategy:

    • "mean": arithmetic mean (or weighted mean if weights is provided).

    • "median": element-wise median (weights is ignored).

  • weights (array-like of float, optional) – Per-replicate weights for the "mean" method (e.g. SNR values). Ignored when method="median". Must have the same length as spectra.

Returns:

Merged spectrum with mass and intensity columns.

Return type:

pd.DataFrame

Raises:

ValueError – If spectra is empty, method is invalid, or weights length does not match spectra.

maldiamrkit.preprocessing.detect_outlier_replicates(spectra, threshold=3.0)[source]#

Identify outlier replicates using correlation with the median spectrum.

Computes the Pearson correlation of each replicate against the element-wise median spectrum. Replicates whose correlation falls below median(corrs) - threshold * MAD(corrs) are flagged as outliers.

Parameters:
  • spectra (list of MaldiSpectrum) – Replicate spectra.

  • threshold (float, default=3.0) – Number of MAD units below the median correlation to flag a replicate as an outlier.

Returns:

Boolean array of length len(spectra). True means the replicate is kept; False means it is an outlier.

Return type:

np.ndarray

Raises:

ValueError – If spectra has fewer than 3 elements (need at least 3 to estimate spread).

class maldiamrkit.preprocessing.MergingMethod(value)[source]#

Bases: str, Enum

Supported replicate merging methods.

Variables:
  • mean (str) – Arithmetic mean (optionally weighted).

  • median (str) – Element-wise median.

mean = 'mean'#
median = 'median'#