Core Module#
Core data structures for MALDI-TOF mass spectrometry analysis.
MaldiSpectrum#
- class maldiamrkit.MaldiSpectrum(source, *, pipeline=None, verbose=False)[source]#
Bases:
objectA single MALDI-TOF spectrum.
Provides methods for loading, preprocessing, binning, and visualizing individual mass spectra.
- Parameters:
source (str, Path, or pd.DataFrame) – Source of the spectrum data. Can be a file path or a DataFrame with columns ‘mass’ and ‘intensity’.
pipeline (PreprocessingPipeline, optional) – Preprocessing pipeline. If None, uses the default pipeline.
verbose (bool, default=False) – If True, print progress messages.
- Variables:
path (Path or None) – Path to the source file, if loaded from file.
id (str) – Identifier for the spectrum (filename stem or ‘in-memory’).
pipeline (PreprocessingPipeline) – Preprocessing pipeline.
- Raises:
ValueError – If the source DataFrame is empty or missing required columns (‘mass’, ‘intensity’).
TypeError – If the ‘mass’ or ‘intensity’ columns are not numeric, or if
sourceis not a supported type.
Examples
>>> spec = MaldiSpectrum("raw/abc.txt") >>> spec.preprocess() >>> spec.bin(3) >>> from maldiamrkit.visualization import plot_spectrum >>> plot_spectrum(spec)
- property bin_width: int | float | None#
Return the bin width used for binning, or None if not binned.
- property bin_metadata: DataFrame#
Return bin metadata with bin boundaries and widths.
- Returns:
DataFrame with columns: bin_index, bin_start, bin_end, bin_width.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If bin() has not been called.
- property preprocessed: DataFrame#
Return the preprocessed spectrum.
- Raises:
RuntimeError – If preprocess() has not been called.
- property binned: DataFrame#
Return the binned spectrum.
- Raises:
RuntimeError – If bin() has not been called.
- preprocess()[source]#
Run preprocessing pipeline on the raw spectrum.
- Returns:
Self, for method chaining.
- Return type:
- bin(bin_width=3, method=BinningMethod.uniform, custom_edges=None, **kwargs)[source]#
Bin the spectrum into m/z intervals.
Automatically calls preprocess() if not already done. Supports multiple binning strategies.
- Parameters:
bin_width (int or float, default=3) – Width of each bin in Daltons. For ‘uniform’, this is the fixed width. For ‘proportional’, this is the reference width at mz_min. Ignored for ‘adaptive’ and ‘custom’ methods.
method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
custom_edges (array-like, optional) – User-provided bin edges. Required if method=’custom’.
**kwargs (dict) – Additional parameters for specific methods: - adaptive_min_width : float, default=1.0 - adaptive_max_width : float, default=10.0
- Returns:
Self, for method chaining.
- Return type:
Examples
>>> spec.bin(3) # uniform binning >>> spec.bin(3, method='proportional') >>> spec.bin(method='adaptive', adaptive_min_width=1.0, adaptive_max_width=10.0) >>> spec.bin(method='custom', custom_edges=[2000, 5000, 10000, 20000])
- save(path, *, stage='binned', fmt='csv')[source]#
Save spectrum data to a file.
- Parameters:
- Raises:
ValueError – If
stageis not one of ‘raw’, ‘preprocessed’, or ‘binned’, or iffmtis not one of ‘csv’ or ‘txt’.RuntimeError – If the requested stage has not been computed yet.
- Return type:
- get_data(prefer='preprocessed')[source]#
Return spectrum data, preferring the requested processing stage.
- Parameters:
prefer (str, default="preprocessed") – Preferred stage:
"preprocessed"or"binned". Falls back to raw data if the requested stage has not been computed.- Returns:
Copy of the spectrum data at the best available stage.
- Return type:
pd.DataFrame
MaldiSet#
- class maldiamrkit.MaldiSet(spectra, meta, *, aggregate_by=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, verbose=False)[source]#
Bases:
objectA collection of MALDI-TOF spectra with metadata.
Provides methods for loading multiple spectra from a directory, filtering by metadata, and generating feature matrices for ML.
- Parameters:
spectra (list of MaldiSpectrum) – List of spectrum objects.
meta (pd.DataFrame) – Metadata DataFrame with ‘ID’ column matching spectrum IDs.
aggregate_by (dict, optional) –
Dictionary specifying aggregation columns:
’antibiotics’: str or list of antibiotic column names
’species’: str, species value to filter by (metadata must have a column named ‘Species’)
All metadata columns are retained regardless of
aggregate_by. If None, all spectra are included without antibiotic/species filtering.bin_width (int, default=3) – Bin width for spectra.
bin_method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
bin_kwargs (dict, optional) – Additional keyword arguments for binning (e.g., custom_edges, adaptive_min_width).
verbose (bool, default=False) – If True, print progress messages.
- Variables:
spectra (list of MaldiSpectrum) – The spectrum objects.
antibiotics (list of str or None) – Antibiotic column names.
species (str or None) – Species value to filter by.
meta (pd.DataFrame) – Metadata indexed by ID (all columns retained).
Examples
>>> ds = MaldiSet.from_directory( ... "spectra/", "meta.csv", ... aggregate_by=dict( ... antibiotics=["Ceftriaxone", "Ceftazidime"], ... species="Escherichia coli", ... ) ... ) >>> ds.X.shape, ds.y.shape
- __init__(spectra, meta, *, aggregate_by=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, verbose=False)[source]#
- classmethod from_directory(spectra_dir, meta_file, *, aggregate_by=None, pipeline=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, n_jobs=-1, verbose=False)[source]#
Load spectra from a directory and metadata from a CSV file.
Only spectrum files whose filename stem matches an ID in the metadata are loaded, avoiding unnecessary I/O and preprocessing.
- Parameters:
spectra_dir (str or Path) – Directory containing spectrum .txt files.
meta_file (str or Path) – Path to CSV file with metadata.
aggregate_by (dict, optional) –
Dictionary specifying aggregation columns:
’antibiotics’: str or list of antibiotic column names
’species’: str, species value to filter by (metadata must have a column named ‘Species’)
All metadata columns are retained regardless of
aggregate_by. If None, all spectra matching metadata are loaded without antibiotic/species filtering.pipeline (PreprocessingPipeline, optional) – Preprocessing pipeline. If None, uses the default pipeline.
bin_width (int, default=3) – Bin width for spectra.
bin_method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
bin_kwargs (dict, optional) – Additional keyword arguments for binning.
n_jobs (int, default=-1) – Number of parallel jobs for loading spectra. Use -1 for all available cores, 1 for sequential processing.
verbose (bool, default=False) – If True, print progress messages.
- Returns:
Dataset with loaded spectra and metadata.
- Return type:
Notes
Files are sorted alphabetically before loading to ensure reproducibility across runs with different parallelization settings.
- property spectra_paths: dict[str, Path]#
Return mapping from spectrum ID to file path.
- Returns:
Dictionary mapping spectrum IDs to their file paths. Only includes spectra that were loaded from files.
- Return type:
- property bin_metadata: DataFrame#
Return bin metadata with bin boundaries and widths.
- Returns:
DataFrame with columns: bin_index, bin_start, bin_end, bin_width.
- Return type:
pd.DataFrame
Notes
If spectra have been binned, returns metadata from the first spectrum. Otherwise, computes metadata based on stored binning parameters.
- property X: DataFrame#
Return feature matrix (n_samples, n_features).
- Returns:
Feature matrix with samples as rows and m/z bins as columns. Filtered to configured subset (antibiotics, species).
- Return type:
pd.DataFrame
- Raises:
ValueError – If no spectra match metadata IDs, or if no samples remain after filtering by species.
- property y: DataFrame#
Return label matrix for all specified antibiotics.
- Returns:
Label matrix with one column per antibiotic.
- Return type:
pd.DataFrame
- Raises:
ValueError – If no antibiotics specified or none found in metadata.
- filter(*filters)[source]#
Return a new MaldiSet keeping only samples that pass all filters.
Filters are applied to the metadata rows (indexed by spectrum ID). Multiple filters can be combined with logical operators.
- Parameters:
*filters (SpectrumFilter) – One or more filter objects. Use
&,|,~to compose complex predicates before passing them in.- Returns:
A new dataset containing only the matching spectra.
- Return type:
Examples
>>> from maldiamrkit.filters import SpeciesFilter, QualityFilter >>> ds.filter(SpeciesFilter("Escherichia coli")) >>> ds.filter(SpeciesFilter("E. coli") & QualityFilter(min_snr=5.0))
- get_y_single(antibiotic=None)[source]#
Return labels for a single antibiotic.
- Parameters:
antibiotic (str, optional) – Antibiotic column name. If None, uses the first antibiotic.
- Returns:
Classification labels.
- Return type:
pd.Series
- Raises:
ValueError – If antibiotic not specified or not found.
- save_spectra(output_dir, *, stage='preprocessed', fmt='txt')[source]#
Save individual spectra to a directory.
- Parameters:
- Raises:
ValueError – If
stageorfmtis invalid.- Return type:
Examples
>>> data = MaldiSet.from_directory("spectra/", "metadata.csv") >>> data.save_spectra("processed/", stage="preprocessed", fmt="txt")
MaldiSet.from_directory() supports parallel loading via the n_jobs parameter:
from maldiamrkit import MaldiSet
# Parallel loading (use all cores)
data = MaldiSet.from_directory(
"spectra/",
"metadata.csv",
n_jobs=-1
)
Filters#
Composable filter system for selecting spectra from a MaldiSet.
Filters can be combined with & (and), | (or), and ~ (invert).
- class maldiamrkit.filters.SpectrumFilter[source]#
Bases:
ABCBase filter with operator overloading.
Subclasses must implement
__call__()which receives a single row of the metadata DataFrame (as apandas.Series) and returnsTrueto keep the sample.
- class maldiamrkit.filters.SpeciesFilter(species, column='Species')[source]#
Bases:
SpectrumFilterFilter by species name(s).
- Parameters:
- class maldiamrkit.filters.QualityFilter(min_snr=None, min_peaks=None, max_baseline_fraction=None)[source]#
Bases:
SpectrumFilterFilter by quality metrics stored in metadata columns.
- Parameters:
- class maldiamrkit.filters.DrugFilter(drug, status=None)[source]#
Bases:
SpectrumFilterFilter by antibiotic resistance status.
- Parameters:
Examples
>>> DrugFilter("Ceftriaxone") # has data for this drug >>> DrugFilter("Ceftriaxone", status="R") # resistant only >>> DrugFilter("Ceftriaxone", status=["R", "I"]) # resistant or intermediate
- class maldiamrkit.filters.MetadataFilter(column, condition)[source]#
Bases:
SpectrumFilterFilter by arbitrary metadata column condition.
- Parameters:
column (str) – Metadata column name.
condition (callable) – Function that takes a single value and returns bool.
Examples
>>> MetadataFilter("batch_id", lambda v: v == "batch_1") >>> MetadataFilter("age", lambda v: v >= 18)
- __call__(meta_row)[source]#
Apply the filter condition to a metadata row.
- Raises:
ValueError – If the condition callable raises an exception when applied to the column value.
- Parameters:
meta_row (
Series)- Return type:
Filter Example#
from maldiamrkit.filters import SpeciesFilter, DrugFilter, QualityFilter, MetadataFilter
# Single species
f = SpeciesFilter("Escherichia coli")
# Multiple species with quality threshold
f = SpeciesFilter(["E. coli", "K. pneumoniae"]) & QualityFilter(min_snr=5.0)
# Filter by antibiotic resistance status
f = SpeciesFilter("E. coli") & DrugFilter("Ceftriaxone", status="R")
# Negate a filter
f = ~SpeciesFilter("Staphylococcus aureus")
# Custom metadata condition
f = MetadataFilter("batch_id", lambda v: v == "batch_1")
# Apply to a MaldiSet
filtered_ds = ds.filter(f)