Core Module#

Core data structures for MALDI-TOF mass spectrometry analysis.

MaldiSpectrum#

class maldiamrkit.MaldiSpectrum(source, *, pipeline=None, verbose=False)[source]#

Bases: object

A single MALDI-TOF spectrum.

Provides methods for loading, preprocessing, binning, and visualizing individual mass spectra.

Parameters:
  • source (str, Path, or pd.DataFrame) – Source of the spectrum data. Can be a file path or a DataFrame with columns ‘mass’ and ‘intensity’.

  • pipeline (PreprocessingPipeline, optional) – Preprocessing pipeline. If None, uses the default pipeline.

  • verbose (bool, default=False) – If True, print progress messages.

Variables:
  • path (Path or None) – Path to the source file, if loaded from file.

  • id (str) – Identifier for the spectrum (filename stem or ‘in-memory’).

  • pipeline (PreprocessingPipeline) – Preprocessing pipeline.

Raises:
  • ValueError – If the source DataFrame is empty or missing required columns (‘mass’, ‘intensity’).

  • TypeError – If the ‘mass’ or ‘intensity’ columns are not numeric, or if source is not a supported type.

Examples

>>> spec = MaldiSpectrum("raw/abc.txt")
>>> spec.preprocess()
>>> spec.bin(3)
>>> from maldiamrkit.visualization import plot_spectrum
>>> plot_spectrum(spec)
__init__(source, *, pipeline=None, verbose=False)[source]#
Parameters:
Return type:

None

property raw: DataFrame#

Return a copy of the raw spectrum data.

property bin_width: int | float | None#

Return the bin width used for binning, or None if not binned.

property bin_method: str | None#

Return the binning method used, or None if not binned.

property bin_metadata: DataFrame#

Return bin metadata with bin boundaries and widths.

Returns:

DataFrame with columns: bin_index, bin_start, bin_end, bin_width.

Return type:

pd.DataFrame

Raises:

RuntimeError – If bin() has not been called.

property preprocessed: DataFrame#

Return the preprocessed spectrum.

Raises:

RuntimeError – If preprocess() has not been called.

property binned: DataFrame#

Return the binned spectrum.

Raises:

RuntimeError – If bin() has not been called.

preprocess()[source]#

Run preprocessing pipeline on the raw spectrum.

Returns:

Self, for method chaining.

Return type:

MaldiSpectrum

bin(bin_width=3, method=BinningMethod.uniform, custom_edges=None, **kwargs)[source]#

Bin the spectrum into m/z intervals.

Automatically calls preprocess() if not already done. Supports multiple binning strategies.

Parameters:
  • bin_width (int or float, default=3) – Width of each bin in Daltons. For ‘uniform’, this is the fixed width. For ‘proportional’, this is the reference width at mz_min. Ignored for ‘adaptive’ and ‘custom’ methods.

  • method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.

  • custom_edges (array-like, optional) – User-provided bin edges. Required if method=’custom’.

  • **kwargs (dict) – Additional parameters for specific methods: - adaptive_min_width : float, default=1.0 - adaptive_max_width : float, default=10.0

Returns:

Self, for method chaining.

Return type:

MaldiSpectrum

Examples

>>> spec.bin(3)  # uniform binning
>>> spec.bin(3, method='proportional')
>>> spec.bin(method='adaptive', adaptive_min_width=1.0, adaptive_max_width=10.0)
>>> spec.bin(method='custom', custom_edges=[2000, 5000, 10000, 20000])
save(path, *, stage='binned', fmt='csv')[source]#

Save spectrum data to a file.

Parameters:
  • path (str or Path) – Output file path.

  • stage (str, default="binned") – Which processing stage to save. One of "raw", "preprocessed", "binned".

  • fmt (str, default="csv") – Output format. "csv" for comma-separated, "txt" for tab-separated.

Raises:
  • ValueError – If stage is not one of ‘raw’, ‘preprocessed’, or ‘binned’, or if fmt is not one of ‘csv’ or ‘txt’.

  • RuntimeError – If the requested stage has not been computed yet.

Return type:

None

get_data(prefer='preprocessed')[source]#

Return spectrum data, preferring the requested processing stage.

Parameters:

prefer (str, default="preprocessed") – Preferred stage: "preprocessed" or "binned". Falls back to raw data if the requested stage has not been computed.

Returns:

Copy of the spectrum data at the best available stage.

Return type:

pd.DataFrame

property is_binned: bool#

Whether the spectrum has been binned.

property is_preprocessed: bool#

Whether the spectrum has been preprocessed.

property has_bin_metadata: bool#

Whether bin metadata is available (i.e. bin() has been called).

MaldiSet#

class maldiamrkit.MaldiSet(spectra, meta, *, aggregate_by=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, verbose=False)[source]#

Bases: object

A collection of MALDI-TOF spectra with metadata.

Provides methods for loading multiple spectra from a directory, filtering by metadata, and generating feature matrices for ML.

Parameters:
  • spectra (list of MaldiSpectrum) – List of spectrum objects.

  • meta (pd.DataFrame) – Metadata DataFrame with ‘ID’ column matching spectrum IDs.

  • aggregate_by (dict, optional) –

    Dictionary specifying aggregation columns:

    • ’antibiotics’: str or list of antibiotic column names

    • ’species’: str, species value to filter by (metadata must have a column named ‘Species’)

    All metadata columns are retained regardless of aggregate_by. If None, all spectra are included without antibiotic/species filtering.

  • bin_width (int, default=3) – Bin width for spectra.

  • bin_method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.

  • bin_kwargs (dict, optional) – Additional keyword arguments for binning (e.g., custom_edges, adaptive_min_width).

  • verbose (bool, default=False) – If True, print progress messages.

Variables:
  • spectra (list of MaldiSpectrum) – The spectrum objects.

  • antibiotics (list of str or None) – Antibiotic column names.

  • species (str or None) – Species value to filter by.

  • meta (pd.DataFrame) – Metadata indexed by ID (all columns retained).

Examples

>>> ds = MaldiSet.from_directory(
...     "spectra/", "meta.csv",
...     aggregate_by=dict(
...         antibiotics=["Ceftriaxone", "Ceftazidime"],
...         species="Escherichia coli",
...     )
... )
>>> ds.X.shape, ds.y.shape
__init__(spectra, meta, *, aggregate_by=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, verbose=False)[source]#
Parameters:
Return type:

None

classmethod from_directory(spectra_dir, meta_file, *, aggregate_by=None, pipeline=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, n_jobs=-1, verbose=False)[source]#

Load spectra from a directory and metadata from a CSV file.

Only spectrum files whose filename stem matches an ID in the metadata are loaded, avoiding unnecessary I/O and preprocessing.

Parameters:
  • spectra_dir (str or Path) – Directory containing spectrum .txt files.

  • meta_file (str or Path) – Path to CSV file with metadata.

  • aggregate_by (dict, optional) –

    Dictionary specifying aggregation columns:

    • ’antibiotics’: str or list of antibiotic column names

    • ’species’: str, species value to filter by (metadata must have a column named ‘Species’)

    All metadata columns are retained regardless of aggregate_by. If None, all spectra matching metadata are loaded without antibiotic/species filtering.

  • pipeline (PreprocessingPipeline, optional) – Preprocessing pipeline. If None, uses the default pipeline.

  • bin_width (int, default=3) – Bin width for spectra.

  • bin_method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.

  • bin_kwargs (dict, optional) – Additional keyword arguments for binning.

  • n_jobs (int, default=-1) – Number of parallel jobs for loading spectra. Use -1 for all available cores, 1 for sequential processing.

  • verbose (bool, default=False) – If True, print progress messages.

Returns:

Dataset with loaded spectra and metadata.

Return type:

MaldiSet

Notes

Files are sorted alphabetically before loading to ensure reproducibility across runs with different parallelization settings.

property spectra_paths: dict[str, Path]#

Return mapping from spectrum ID to file path.

Returns:

Dictionary mapping spectrum IDs to their file paths. Only includes spectra that were loaded from files.

Return type:

dict

property bin_metadata: DataFrame#

Return bin metadata with bin boundaries and widths.

Returns:

DataFrame with columns: bin_index, bin_start, bin_end, bin_width.

Return type:

pd.DataFrame

Notes

If spectra have been binned, returns metadata from the first spectrum. Otherwise, computes metadata based on stored binning parameters.

property X: DataFrame#

Return feature matrix (n_samples, n_features).

Returns:

Feature matrix with samples as rows and m/z bins as columns. Filtered to configured subset (antibiotics, species).

Return type:

pd.DataFrame

Raises:

ValueError – If no spectra match metadata IDs, or if no samples remain after filtering by species.

property y: DataFrame#

Return label matrix for all specified antibiotics.

Returns:

Label matrix with one column per antibiotic.

Return type:

pd.DataFrame

Raises:

ValueError – If no antibiotics specified or none found in metadata.

filter(*filters)[source]#

Return a new MaldiSet keeping only samples that pass all filters.

Filters are applied to the metadata rows (indexed by spectrum ID). Multiple filters can be combined with logical operators.

Parameters:

*filters (SpectrumFilter) – One or more filter objects. Use &, |, ~ to compose complex predicates before passing them in.

Returns:

A new dataset containing only the matching spectra.

Return type:

MaldiSet

Examples

>>> from maldiamrkit.filters import SpeciesFilter, QualityFilter
>>> ds.filter(SpeciesFilter("Escherichia coli"))
>>> ds.filter(SpeciesFilter("E. coli") & QualityFilter(min_snr=5.0))
get_y_single(antibiotic=None)[source]#

Return labels for a single antibiotic.

Parameters:

antibiotic (str, optional) – Antibiotic column name. If None, uses the first antibiotic.

Returns:

Classification labels.

Return type:

pd.Series

Raises:

ValueError – If antibiotic not specified or not found.

to_csv(path)[source]#

Export the feature matrix to CSV.

Parameters:

path (str or Path) – Output file path.

Return type:

None

to_parquet(path)[source]#

Export the feature matrix to Parquet.

Parameters:

path (str or Path) – Output file path.

Return type:

None

save_spectra(output_dir, *, stage='preprocessed', fmt='txt')[source]#

Save individual spectra to a directory.

Parameters:
  • output_dir (str or Path) – Directory where spectra will be saved. Created if it does not exist.

  • stage (str, default="preprocessed") – Which processing stage to save. One of "raw", "preprocessed", "binned".

  • fmt (str, default="txt") – Output format. "csv" for comma-separated, "txt" for tab-separated.

Raises:

ValueError – If stage or fmt is invalid.

Return type:

None

Examples

>>> data = MaldiSet.from_directory("spectra/", "metadata.csv")
>>> data.save_spectra("processed/", stage="preprocessed", fmt="txt")

MaldiSet.from_directory() supports parallel loading via the n_jobs parameter:

from maldiamrkit import MaldiSet

# Parallel loading (use all cores)
data = MaldiSet.from_directory(
    "spectra/",
    "metadata.csv",
    n_jobs=-1
)

Filters#

Composable filter system for selecting spectra from a MaldiSet. Filters can be combined with & (and), | (or), and ~ (invert).

class maldiamrkit.filters.SpectrumFilter[source]#

Bases: ABC

Base filter with operator overloading.

Subclasses must implement __call__() which receives a single row of the metadata DataFrame (as a pandas.Series) and returns True to keep the sample.

abstractmethod __call__(meta_row)[source]#

Return True if the sample should be kept.

Parameters:

meta_row (Series)

Return type:

bool

class maldiamrkit.filters.SpeciesFilter(species, column='Species')[source]#

Bases: SpectrumFilter

Filter by species name(s).

Parameters:
  • species (str or list of str) – Species name(s) to keep.

  • column (str, default="Species") – Metadata column containing species information.

__init__(species, column='Species')[source]#
Parameters:
Return type:

None

__call__(meta_row)[source]#

Return True if the row’s species is in the filter set.

Parameters:

meta_row (Series)

Return type:

bool

class maldiamrkit.filters.QualityFilter(min_snr=None, min_peaks=None, max_baseline_fraction=None)[source]#

Bases: SpectrumFilter

Filter by quality metrics stored in metadata columns.

Parameters:
  • min_snr (float, optional) – Minimum signal-to-noise ratio (column snr).

  • min_peaks (int, optional) – Minimum number of detected peaks (column n_peaks).

  • max_baseline_fraction (float, optional) – Maximum fraction of intensity in the baseline (column baseline_fraction).

__init__(min_snr=None, min_peaks=None, max_baseline_fraction=None)[source]#
Parameters:
Return type:

None

__call__(meta_row)[source]#

Return True if the row passes all quality thresholds.

Parameters:

meta_row (Series)

Return type:

bool

class maldiamrkit.filters.DrugFilter(drug, status=None)[source]#

Bases: SpectrumFilter

Filter by antibiotic resistance status.

Parameters:
  • drug (str) – Antibiotic column name in metadata.

  • status (str or list of str, optional) – Keep only samples with this resistance status (e.g. "R", "S", "I"). If None, keeps any sample where the drug column is not null.

Examples

>>> DrugFilter("Ceftriaxone")                    # has data for this drug
>>> DrugFilter("Ceftriaxone", status="R")        # resistant only
>>> DrugFilter("Ceftriaxone", status=["R", "I"]) # resistant or intermediate
__init__(drug, status=None)[source]#
Parameters:
Return type:

None

__call__(meta_row)[source]#

Return True if the sample matches the drug filter criteria.

Parameters:

meta_row (Series)

Return type:

bool

class maldiamrkit.filters.MetadataFilter(column, condition)[source]#

Bases: SpectrumFilter

Filter by arbitrary metadata column condition.

Parameters:
  • column (str) – Metadata column name.

  • condition (callable) – Function that takes a single value and returns bool.

Examples

>>> MetadataFilter("batch_id", lambda v: v == "batch_1")
>>> MetadataFilter("age", lambda v: v >= 18)
__init__(column, condition)[source]#
Parameters:
Return type:

None

__call__(meta_row)[source]#

Apply the filter condition to a metadata row.

Raises:

ValueError – If the condition callable raises an exception when applied to the column value.

Parameters:

meta_row (Series)

Return type:

bool

Filter Example#

from maldiamrkit.filters import SpeciesFilter, DrugFilter, QualityFilter, MetadataFilter

# Single species
f = SpeciesFilter("Escherichia coli")

# Multiple species with quality threshold
f = SpeciesFilter(["E. coli", "K. pneumoniae"]) & QualityFilter(min_snr=5.0)

# Filter by antibiotic resistance status
f = SpeciesFilter("E. coli") & DrugFilter("Ceftriaxone", status="R")

# Negate a filter
f = ~SpeciesFilter("Staphylococcus aureus")

# Custom metadata condition
f = MetadataFilter("batch_id", lambda v: v == "batch_1")

# Apply to a MaldiSet
filtered_ds = ds.filter(f)