Core Module#

Core data structures for MALDI-TOF mass spectrometry analysis.

MaldiSpectrum#

class maldiamrkit.MaldiSpectrum(source, *, pipeline=None, verbose=False)[source]#

Bases: object

A single MALDI-TOF spectrum.

Provides methods for loading, preprocessing, binning, and visualizing individual mass spectra.

Parameters:

source (str, Path, or pd.DataFrame) – Source of the spectrum data. Can be a file path or a DataFrame with columns ‘mass’ and ‘intensity’.
pipeline (PreprocessingPipeline, optional) – Preprocessing pipeline. If None, uses the default pipeline.
verbose (bool, default=False) – If True, print progress messages.

Variables:

path (Path or None) – Path to the source file, if loaded from file.
id (str) – Identifier for the spectrum (filename stem or ‘in-memory’).
pipeline (PreprocessingPipeline) – Preprocessing pipeline.

Raises:

ValueError – If the source DataFrame is empty or missing required columns (‘mass’, ‘intensity’).
TypeError – If the ‘mass’ or ‘intensity’ columns are not numeric, or if source is not a supported type.

Examples

>>> spec = MaldiSpectrum("raw/abc.txt")
>>> spec.preprocess()
>>> spec.bin(3)
>>> from maldiamrkit.visualization import plot_spectrum
>>> plot_spectrum(spec)

__init__(source, *, pipeline=None, verbose=False)[source]#

Parameters:

source (str | Path | DataFrame)
pipeline (PreprocessingPipeline | None)
verbose (bool)

Return type:

None

property raw: DataFrame#: Return a copy of the raw spectrum data.

property bin_width: int | float | None#: Return the bin width used for binning, or None if not binned.

property bin_method: str | None#: Return the binning method used, or None if not binned.

property bin_metadata: DataFrame#

Return bin metadata with bin boundaries and widths.

Returns:: DataFrame with columns: bin_index, bin_start, bin_end, bin_width.
Return type:: pd.DataFrame
Raises:: RuntimeError – If bin() has not been called.

property preprocessed: DataFrame#

Return the preprocessed spectrum.

Raises:: RuntimeError – If preprocess() has not been called.

property binned: DataFrame#

Return the binned spectrum.

Raises:: RuntimeError – If bin() has not been called.

preprocess()[source]#

Run preprocessing pipeline on the raw spectrum.

Returns:: Self, for method chaining.
Return type:: MaldiSpectrum

bin(bin_width=3, method=BinningMethod.uniform, custom_edges=None, **kwargs)[source]#

Bin the spectrum into m/z intervals.

Automatically calls preprocess() if not already done. Supports multiple binning strategies.

Parameters:

bin_width (int or float, default=3) – Width of each bin in Daltons. For ‘uniform’, this is the fixed width. For ‘proportional’, this is the reference width at mz_min. Ignored for ‘adaptive’ and ‘custom’ methods.
method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
custom_edges (array-like, optional) – User-provided bin edges. Required if method=’custom’.
**kwargs (dict) – Additional parameters for specific methods: - adaptive_min_width : float, default=1.0 - adaptive_max_width : float, default=10.0

Returns:

Self, for method chaining.

Return type:

MaldiSpectrum

Examples

>>> spec.bin(3)  # uniform binning
>>> spec.bin(3, method='proportional')
>>> spec.bin(method='adaptive', adaptive_min_width=1.0, adaptive_max_width=10.0)
>>> spec.bin(method='custom', custom_edges=[2000, 5000, 10000, 20000])

save(path, *, stage='binned', fmt='csv')[source]#

Save spectrum data to a file.

Parameters:

path (str or Path) – Output file path.
stage (str, default="binned") – Which processing stage to save. One of "raw", "preprocessed", "binned".
fmt (str, default="csv") – Output format. "csv" for comma-separated, "txt" for tab-separated.

Raises:

ValueError – If stage is not one of ‘raw’, ‘preprocessed’, or ‘binned’, or if fmt is not one of ‘csv’ or ‘txt’.
RuntimeError – If the requested stage has not been computed yet.

Return type:

None

get_data(prefer='preprocessed')[source]#

Return spectrum data, preferring the requested processing stage.

Parameters:: prefer (str, default="preprocessed") – Preferred stage: "preprocessed" or "binned". Falls back to raw data if the requested stage has not been computed.
Returns:: Copy of the spectrum data at the best available stage.
Return type:: pd.DataFrame

property is_binned: bool#: Whether the spectrum has been binned.

property is_preprocessed: bool#: Whether the spectrum has been preprocessed.

property has_bin_metadata: bool#: Whether bin metadata is available (i.e. bin() has been called).

MaldiSet#

class maldiamrkit.MaldiSet(spectra, meta, *, aggregate_by=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, verbose=False, isolate_pattern=None, isolate_column=None)[source]#

Bases: object

A collection of MALDI-TOF spectra with metadata.

Provides methods for loading multiple spectra from a directory, filtering by metadata, and generating feature matrices for ML.

Parameters:

spectra (list of MaldiSpectrum) – List of spectrum objects.
meta (pd.DataFrame) – Metadata DataFrame with ‘ID’ column matching spectrum IDs.
aggregate_by (dict, optional) –
Dictionary specifying aggregation columns:
- ’antibiotics’: str or list of antibiotic column names
- ’species’: str, species value to filter by (metadata must have a column named ‘Species’)
All metadata columns are retained regardless of aggregate_by. If None, all spectra are included without antibiotic/species filtering.
bin_width (int, default=3) – Bin width for spectra.
bin_method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
bin_kwargs (dict, optional) – Additional keyword arguments for binning (e.g., custom_edges, adaptive_min_width).
verbose (bool, default=False) – If True, print progress messages.
isolate_pattern (str | re.Pattern | None)
isolate_column (str | None)

Variables:

spectra (list of MaldiSpectrum) – The spectrum objects.
antibiotics (list of str or None) – Antibiotic column names.
species (str or None) – Species value to filter by.
meta (pd.DataFrame) – Metadata indexed by ID (all columns retained).

Examples

>>> ds = MaldiSet.from_directory(
...     "spectra/", "meta.csv",
...     aggregate_by=dict(
...         antibiotics=["Ceftriaxone", "Ceftazidime"],
...         species="Escherichia coli",
...     )
... )
>>> ds.X.shape, ds.y.shape

Parameters:

isolate_pattern (str | Pattern | None)
isolate_column (str | None)
spectra (list[MaldiSpectrum])
meta (pd.DataFrame)
aggregate_by (dict[str, str | list[str]] | None)
bin_width (int)
bin_method (str | BinningMethod)
bin_kwargs (dict | None)
verbose (bool)

__init__(spectra, meta, *, aggregate_by=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, verbose=False, isolate_pattern=None, isolate_column=None)[source]#

Parameters:

spectra (list[MaldiSpectrum])
meta (DataFrame)
aggregate_by (dict[str, str | list[str]] | None)
bin_width (int)
bin_method (str | BinningMethod)
bin_kwargs (dict | None)
verbose (bool)
isolate_pattern (str | Pattern | None)
isolate_column (str | None)

Return type:

None

classmethod from_directory(spectra_dir, meta_file, *, aggregate_by=None, pipeline=None, bin_width=3, bin_method=BinningMethod.uniform, bin_kwargs=None, n_jobs=-1, verbose=False)[source]#

Load spectra from a directory and metadata from a CSV file.

Only spectrum files whose filename stem matches an ID in the metadata are loaded, avoiding unnecessary I/O and preprocessing.

Parameters:

spectra_dir (str or Path) – Directory containing spectrum .txt files.
meta_file (str or Path) – Path to CSV file with metadata.
aggregate_by (dict, optional) –
Dictionary specifying aggregation columns:
- ’antibiotics’: str or list of antibiotic column names
- ’species’: str, species value to filter by (metadata must have a column named ‘Species’)
All metadata columns are retained regardless of aggregate_by. If None, all spectra matching metadata are loaded without antibiotic/species filtering.
pipeline (PreprocessingPipeline, optional) – Preprocessing pipeline. If None, uses the default pipeline.
bin_width (int, default=3) – Bin width for spectra.
bin_method (str, default='uniform') – Binning method. One of ‘uniform’, ‘proportional’, ‘adaptive’, ‘custom’.
bin_kwargs (dict, optional) – Additional keyword arguments for binning.
n_jobs (int, default=-1) – Number of parallel jobs for loading spectra. Use -1 for all available cores, 1 for sequential processing.
verbose (bool, default=False) – If True, print progress messages.

Returns:

Dataset with loaded spectra and metadata.

Return type:

MaldiSet

Notes

Files are sorted alphabetically before loading to ensure reproducibility across runs with different parallelization settings.

property spectra_paths: dict[str, Path]#

Return mapping from spectrum ID to file path.

Returns:: Dictionary mapping spectrum IDs to their file paths. Only includes spectra that were loaded from files.
Return type:: dict

property bin_metadata: DataFrame#

Return bin metadata with bin boundaries and widths.

Returns:: DataFrame with columns: bin_index, bin_start, bin_end, bin_width.
Return type:: pd.DataFrame

Notes

If spectra have been binned, returns metadata from the first spectrum. Otherwise, computes metadata based on stored binning parameters.

property X: DataFrame#

Return feature matrix (n_samples, n_features).

Returns:: Feature matrix with samples as rows and m/z bins as columns. Filtered to configured subset (antibiotics, species).
Return type:: pd.DataFrame
Raises:: ValueError – If no spectra match metadata IDs, or if no samples remain after filtering by species.

property y: DataFrame#

Return label matrix for all specified antibiotics.

Returns:: Label matrix with one column per antibiotic.
Return type:: pd.DataFrame
Raises:: ValueError – If no antibiotics specified or none found in metadata.

filter(*filters)[source]#

Return a new MaldiSet keeping only samples that pass all filters.

Filters are applied to the metadata rows (indexed by spectrum ID). Multiple filters can be combined with logical operators.

Parameters:: *filters (SpectrumFilter) – One or more filter objects. Use &, |, ~ to compose complex predicates before passing them in.
Returns:: A new dataset containing only the matching spectra.
Return type:: MaldiSet

Examples

>>> from maldiamrkit.filters import SpeciesFilter, QualityFilter
>>> ds.filter(SpeciesFilter("Escherichia coli"))
>>> ds.filter(SpeciesFilter("E. coli") & QualityFilter(min_snr=5.0))

get_y_single(antibiotic=None)[source]#

Return labels for a single antibiotic.

Parameters:: antibiotic (str, optional) – Antibiotic column name. If None, uses the first antibiotic.
Returns:: Classification labels.
Return type:: pd.Series
Raises:: ValueError – If antibiotic not specified or not found.

isolate_ids(*, pattern=<object object>, column=<object object>)[source]#

Per-isolate group IDs aligned to the feature-matrix rows.

Technical replicates of one biological isolate share an underlying sample. Recovering a per-isolate label lets you pass groups= to a group-aware cross-validator so replicates of the same isolate never span a train/test split (replicate leakage).

How the label is derived is resolved in this order:

an explicit column= argument (read from metadata), else
an explicit pattern= argument (regex suffix stripped off the ID), else
the isolate_column / replicate_pattern stamped by the source layout when the set was loaded (e.g. _MALDI<N> for DRIAMS), else
the DRIAMS _MALDI<N> suffix as a last-resort fallback.

If the resolved pattern strips nothing from any ID – every spectrum becomes its own group, giving no leakage protection – a warning is emitted pointing at pattern= / column=.

Parameters:

pattern (str or compiled regex, optional) – Replicate suffix to strip from each spectrum ID. Overrides the layout-stamped pattern; ignored if column is given.
column (str or None, optional) – Read isolate IDs from this metadata column instead of deriving them from the spectrum-ID index.

Returns:

Isolate IDs indexed by spectrum ID, aligned to self.X rows.

Return type:

pd.Series

property groups: ndarray#

Isolate group labels (ndarray) aligned to X rows.

Convenience over isolate_ids() using the replicate pattern stamped by the source layout (or the DRIAMS fallback); pass directly as groups= to a group-aware splitter (e.g. StratifiedGroupKFold).

to_csv(path)[source]#

Export the feature matrix to CSV.

Parameters:: path (str or Path) – Output file path.
Return type:: None

to_parquet(path)[source]#

Export the feature matrix to Parquet.

Parameters:: path (str or Path) – Output file path.
Return type:: None

save_spectra(output_dir, *, stage='preprocessed', fmt='txt')[source]#

Save individual spectra to a directory.

Parameters:

output_dir (str or Path) – Directory where spectra will be saved. Created if it does not exist.
stage (str, default="preprocessed") – Which processing stage to save. One of "raw", "preprocessed", "binned".
fmt (str, default="txt") – Output format. "csv" for comma-separated, "txt" for tab-separated.

Raises:

ValueError – If stage or fmt is invalid.

Return type:

None

Examples

>>> data = MaldiSet.from_directory("spectra/", "metadata.csv")
>>> data.save_spectra("processed/", stage="preprocessed", fmt="txt")

MaldiSet.from_directory() supports parallel loading via the n_jobs parameter:

from maldiamrkit import MaldiSet

# Parallel loading (use all cores)
data = MaldiSet.from_directory(
    "spectra/",
    "metadata.csv",
    n_jobs=-1
)

Filters#

Composable filter system for selecting spectra from a MaldiSet. Filters can be combined with & (and), | (or), and ~ (invert).

class maldiamrkit.filters.SpectrumFilter[source]#

Bases: ABC

Base filter with operator overloading.

Subclasses must implement __call__() which receives a single row of the metadata DataFrame (as a pandas.Series) and returns True to keep the sample.

abstractmethod __call__(meta_row)[source]#

Return True if the sample should be kept.

Parameters:: meta_row (Series)
Return type:: bool

class maldiamrkit.filters.SpeciesFilter(species, column='Species')[source]#

Bases: SpectrumFilter

Filter by species name(s).

Parameters:

species (str or list of str) – Species name(s) to keep.
column (str, default="Species") – Metadata column containing species information.

__init__(species, column='Species')[source]#

Parameters:

species (Union[str, Sequence[str]])
column (str)

Return type:

None

__call__(meta_row)[source]#

Return True if the row’s species is in the filter set.

Parameters:: meta_row (Series)
Return type:: bool

class maldiamrkit.filters.QualityFilter(min_snr=None, min_peaks=None, max_baseline_fraction=None)[source]#

Bases: SpectrumFilter

Filter by quality metrics stored in metadata columns.

Parameters:

min_snr (float, optional) – Minimum signal-to-noise ratio (column snr).
min_peaks (int, optional) – Minimum number of detected peaks (column n_peaks).
max_baseline_fraction (float, optional) – Maximum fraction of intensity in the baseline (column baseline_fraction).

__init__(min_snr=None, min_peaks=None, max_baseline_fraction=None)[source]#

Parameters:

min_snr (float | None)
min_peaks (int | None)
max_baseline_fraction (float | None)

Return type:

None

__call__(meta_row)[source]#

Return True if the row passes all quality thresholds.

Parameters:: meta_row (Series)
Return type:: bool

class maldiamrkit.filters.DrugFilter(drug, status=None)[source]#

Bases: SpectrumFilter

Filter by antibiotic resistance status.

Parameters:

drug (str) – Antibiotic column name in metadata.
status (str or list of str, optional) – Keep only samples with this resistance status (e.g. "R", "S", "I"). If None, keeps any sample where the drug column is not null.

Examples

>>> DrugFilter("Ceftriaxone")                    # has data for this drug
>>> DrugFilter("Ceftriaxone", status="R")        # resistant only
>>> DrugFilter("Ceftriaxone", status=["R", "I"]) # resistant or intermediate

__init__(drug, status=None)[source]#

Parameters:

drug (str)
status (Union[str, Sequence[str], None])

Return type:

None

__call__(meta_row)[source]#

Return True if the sample matches the drug filter criteria.

Parameters:: meta_row (Series)
Return type:: bool

class maldiamrkit.filters.MetadataFilter(column, condition)[source]#

Bases: SpectrumFilter

Filter by arbitrary metadata column condition.

Parameters:

column (str) – Metadata column name.
condition (callable) – Function that takes a single value and returns bool.

Examples

>>> MetadataFilter("batch_id", lambda v: v == "batch_1")
>>> MetadataFilter("age", lambda v: v >= 18)

__init__(column, condition)[source]#

Parameters:

column (str)
condition (Callable[[Any], bool])

Return type:

None

__call__(meta_row)[source]#

Apply the filter condition to a metadata row.

Raises:: ValueError – If the condition callable raises an exception when applied to the column value.
Parameters:: meta_row (Series)
Return type:: bool

Filter Example#

from maldiamrkit.filters import SpeciesFilter, DrugFilter, QualityFilter, MetadataFilter

# Single species
f = SpeciesFilter("Escherichia coli")

# Multiple species with quality threshold
f = SpeciesFilter(["E. coli", "K. pneumoniae"]) & QualityFilter(min_snr=5.0)

# Filter by antibiotic resistance status
f = SpeciesFilter("E. coli") & DrugFilter("Ceftriaxone", status="R")

# Negate a filter
f = ~SpeciesFilter("Staphylococcus aureus")

# Custom metadata condition
f = MetadataFilter("batch_id", lambda v: v == "batch_1")

# Apply to a MaldiSet
filtered_ds = ds.filter(f)