Dataset Builder & Loader#

Build and load datasets using pluggable layout adapters.

DatasetBuilder#

class maldiamrkit.data.DatasetBuilder(layout, output_dir, *, name=None, id_column='code', pipeline=None, bin_width=3, extra_handlers=None, metadata_dir='id', metadata_suffix='_clean.csv', n_jobs=-1, on_error='warn')[source]#

Build a standardised dataset from any supported input layout.

Parameters:
  • layout (InputLayout) – Input data layout adapter.

  • output_dir (str or Path) – Root directory for the standardised output.

  • name (str or None) – Dataset name (defaults to output_dir name).

  • id_column (str, default="code") – Column name for the spectrum identifier in the output metadata.

  • pipeline (PreprocessingPipeline or None) – Preprocessing pipeline. None uses the default.

  • bin_width (int or float, default=3) – Bin width in Daltons for the default binned output.

  • extra_handlers (list of ProcessingHandler or None) – Additional processing outputs.

  • metadata_dir (str, default="id") – Subdirectory name for metadata CSV output.

  • metadata_suffix (str, default="_clean.csv") – Filename suffix for metadata CSV output.

  • n_jobs (int, default=-1) – Number of parallel workers.

  • on_error (str, default="warn") – Error handling: "warn", "raise", or "skip".

Examples

>>> from maldiamrkit.data import DatasetBuilder, FlatLayout
>>> layout = FlatLayout("spectra/", "meta.csv")
>>> builder = DatasetBuilder(layout, "output/")
>>> report = builder.build()
__init__(layout, output_dir, *, name=None, id_column='code', pipeline=None, bin_width=3, extra_handlers=None, metadata_dir='id', metadata_suffix='_clean.csv', n_jobs=-1, on_error='warn')[source]#
Parameters:
Return type:

None

build()[source]#

Execute the build pipeline.

Returns:

Summary of the build.

Return type:

BuildReport

DatasetLoader#

class maldiamrkit.data.DatasetLoader(layout, *, stage=None, n_jobs=-1, verbose=False)[source]#

Load a dataset into a MaldiSet.

Parameters:
  • layout (DatasetLayout) – Dataset navigation adapter (e.g. DRIAMSLayout or MARISMaLayout).

  • stage (str or None) – Processing stage to load. None triggers auto-detection via the layout.

  • n_jobs (int, default=-1) – Number of parallel workers for spectrum loading.

  • verbose (bool, default=False) – If True, show tqdm progress bars during spectrum loading and pass verbose through to MaldiSet.

Examples

>>> from maldiamrkit.data import DatasetLoader, DRIAMSLayout
>>> layout = DRIAMSLayout("output/my_dataset")
>>> loader = DatasetLoader(layout)
>>> ds = loader.load(aggregate_by=dict(antibiotics="Ceftriaxone"))
__init__(layout, *, stage=None, n_jobs=-1, verbose=False)[source]#
Parameters:
Return type:

None

load(aggregate_by=None)[source]#

Load the dataset.

Parameters:

aggregate_by (dict, optional) – Passed through to MaldiSet.

Returns:

Dataset with loaded spectra and metadata.

Return type:

MaldiSet

Input Layouts#

class maldiamrkit.data.InputLayout[source]#

Abstract adapter for discovering spectra and metadata.

abstractmethod discover_spectra()[source]#

Return paths to all spectrum sources (files or directories).

Return type:

list[Path]

abstractmethod discover_metadata()[source]#

Return metadata DataFrame with an 'ID' column.

Return type:

DataFrame

abstractmethod get_id(spectrum_path)[source]#

Extract the spectrum identifier from a path.

Parameters:

spectrum_path (Path)

Return type:

str

abstractmethod get_year(spectrum_id)[source]#

Return the year for a spectrum, or None.

Parameters:

spectrum_id (str)

Return type:

str | None

class maldiamrkit.data.FlatLayout(spectra_dir, metadata_csv, *, id_column='ID', year_column=None)[source]#

Flat directory of pre-exported text spectrum files + metadata CSV.

Suitable for datasets where spectra are already exported as text files.

Parameters:
  • spectra_dir (str or Path) – Directory containing spectrum text files (flat or with year subfolders).

  • metadata_csv (str or Path) – CSV with an ID column, species, and antibiotic columns.

  • id_column (str, default="ID") – Column name for the spectrum identifier in the metadata.

  • year_column (str or None) – Column to extract year from, or None for flat layout.

__init__(spectra_dir, metadata_csv, *, id_column='ID', year_column=None)[source]#
Parameters:
Return type:

None

discover_spectra()[source]#

Glob for .txt files, flat or with year subfolders.

Return type:

list[Path]

discover_metadata()[source]#

Read metadata CSV and normalise the ID column.

Return type:

DataFrame

get_id(spectrum_path)[source]#

Filename stem is the spectrum ID.

Parameters:

spectrum_path (Path)

Return type:

str

get_year(spectrum_id)[source]#

Year from the metadata column, or None.

Parameters:

spectrum_id (str)

Return type:

str | None

class maldiamrkit.data.BrukerTreeLayout(root_dir, metadata_csv, *, id_column='Identifier', year_column='Year', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, validate=True)[source]#

Hierarchical directory tree containing raw Bruker binary data.

Suitable for datasets where spectra are stored as Bruker fid/acqus binaries in a hierarchical directory tree. The metadata CSV must contain a column with relative paths pointing to the Bruker data directories.

Parameters:
  • root_dir (str or Path) – Root directory of the dataset.

  • metadata_csv (str or Path) – Metadata CSV with columns for identifier, path to Bruker data, and (optionally) year and target position.

  • id_column (str, default="Identifier") – Column for specimen identifier.

  • year_column (str, default="Year") – Column for year.

  • path_column (str, default="Path") – Column with relative path to the Bruker directory.

  • target_position_column (str, default="target_position") – Column for the plate target position.

  • duplicate_strategy (str or DuplicateStrategy, default "first") –

    How to handle duplicate specimen identifiers (e.g. the same sample measured at multiple MALDI target positions):

    • "first" – keep the first occurrence (default).

    • "last" – keep the last occurrence.

    • "drop" – remove all duplicates.

    • "keep_all" – keep every replicate, appending the target-position value to the ID ({identifier}_{target_position}).

    • "average" – tag replicates for downstream averaging (adds _original_id column).

  • validate (bool, default=True) – If True, skip empty spectra (all-zero fid) and warn on duplicate spectra (SHA256 hash matching).

__init__(root_dir, metadata_csv, *, id_column='Identifier', year_column='Year', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, validate=True)[source]#
Parameters:
Return type:

None

discover_spectra()[source]#

Resolve Bruker directories from metadata paths.

Applies duplicate_strategy to handle specimens that appear at multiple target positions. Optionally validates for empty and duplicate spectra.

Return type:

list[Path]

discover_metadata()[source]#

Read metadata CSV, normalise ID column.

Return type:

DataFrame

get_id(spectrum_path)[source]#

Look up ID from the path mapping built during discovery.

Parameters:

spectrum_path (Path)

Return type:

str

get_year(spectrum_id)[source]#

Year from the metadata.

Parameters:

spectrum_id (str)

Return type:

str | None

Dataset Layouts#

class maldiamrkit.data.DatasetLayout[source]#

Abstract adapter for navigating and loading from a dataset.

abstractmethod discover_metadata()[source]#

Load metadata, return DataFrame with 'ID' column.

Return type:

DataFrame

abstractmethod collect_spectrum_files(stage, year)[source]#

Return paths to spectrum files for the given stage/year.

Parameters:
Return type:

list[Path]

abstractmethod detect_stage()[source]#

Auto-detect best available processing stage.

Return type:

str

postprocess_spectrum(spec, *, stage=None)[source]#

Apply dataset-specific fix-ups to a freshly-loaded spectrum.

Default is a no-op. Layouts whose on-disk format deviates from the (mass, intensity) convention assumed by read_spectrum() can override this to reshape the spectrum. Called by DatasetLoader after each file is loaded.

Parameters:
Return type:

MaldiSpectrum

class maldiamrkit.data.DRIAMSLayout(dataset_dir, *, id_column=<auto>, species_column=None, year=None, metadata_dir=<auto>, metadata_suffix=<auto>, spectrum_ext=<auto>, duplicate_strategy=DuplicateStrategy.first, id_transform=None, mz_min=<auto>, mz_max=<auto>, normalize_tic=False)[source]#

Navigate a DRIAMS-like dataset structure.

Works with both the output of DatasetBuilder and the original DRIAMS-A/B/C/D datasets.

Parameters:
  • dataset_dir (str or Path) – Root of the dataset.

  • id_column (str or None) – Metadata column for spectrum IDs. None triggers auto-detection ('code' > 'ID' > first column).

  • species_column (str or None) – Metadata column for species names. None triggers auto-detection (case-insensitive match for 'species'). The column is renamed to 'Species' for downstream use.

  • year (str, int, or None) – Restrict to a single year.

  • metadata_dir (str, default="id") – Subdirectory name containing metadata CSV files.

  • metadata_suffix (str, default="_clean.csv") – Filename suffix for metadata CSV files.

  • spectrum_ext (str, default=".txt") – File extension for spectrum files (including the dot).

  • duplicate_strategy (str or DuplicateStrategy, default "first") –

    How to handle duplicate spectrum IDs (e.g. the same sample appearing in multiple year subdirectories):

    • "first" – keep the first occurrence (default).

    • "last" – keep the last occurrence.

    • "drop" – remove all duplicates.

    • "keep_all" – keep every replicate with _repN suffixes.

    • "average" – tag replicates for downstream averaging.

  • id_transform (callable, optional) –

    Function mapping raw ID strings to a canonical sample identifier. When set, duplicates are detected on the transformed identifier rather than the raw one – so technical-replicate files that share an underlying sample (e.g. DRIAMS UUID_MALDI1 / UUID_MALDI2) are recognized as duplicates by duplicate_strategy. The raw ID column is preserved for spectrum-file matching; only deduplication uses the transformed key. Typical DRIAMS usage:

    import re
    DRIAMSLayout(
        ...,
        id_transform=lambda s: re.sub(r"_MALDI\\d+$", "", s),
        duplicate_strategy="first",   # or "average"
    )
    

    Leaving this at None preserves the legacy behaviour (each replicate counted as a distinct row). A one-time warning is emitted when _MALDI<N>-suffixed IDs are detected and id_transform is None, pointing at this kwarg; the warning can be silenced by passing id_transform=str if the per-replicate semantics are intentional.

  • mz_min (float, default=2000.0) – Lower m/z edge to assign to bin index 0 when a binned_N/ stage is loaded. Only consulted by postprocess_spectrum().

  • mz_max (float, default=19997.0) – Upper m/z edge assigned to bin index N-1.

  • normalize_tic (bool, default=False) – When True, re-apply a TIC normalization (intensity <- intensity / sum(intensity)) to every loaded spectrum in postprocess_spectrum(). Useful because the published DRIAMS / MS-UMG binned_6000/ files do not sum to 1.0 on disk (empirically ~1.29 and ~1.36 respectively), despite the DRIAMS preprocessing script calling calibrateIntensity(method="TIC") before trimming – the cause is somewhere in the upstream pipeline (MALDIquant version or an implicit scaling step) and has not been reproduced here. Enabling this kwarg gives sum=1.0 per spectrum, aligning DRIAMS / MS-UMG with flat-text datasets whose preprocessing pipeline already produces TIC=1.

__init__(dataset_dir, *, id_column=<auto>, species_column=None, year=None, metadata_dir=<auto>, metadata_suffix=<auto>, spectrum_ext=<auto>, duplicate_strategy=DuplicateStrategy.first, id_transform=None, mz_min=<auto>, mz_max=<auto>, normalize_tic=False)[source]#

Initialise the layout.

Several kwargs accept the sentinel _AUTO as their default. When _AUTO, the value is filled from site_info.json at the dataset root (if present) and otherwise falls back to the library-level default. Explicit kwargs always win. Fields with per-call semantics (year, species_column, id_transform, duplicate_strategy, normalize_tic) stay user-controlled and are never read from the manifest.

Parameters:
Return type:

None

discover_metadata()[source]#

Load metadata CSV(s) from the metadata directory.

Return type:

DataFrame

collect_spectrum_files(stage, year)[source]#

Glob spectrum files from the stage directory.

Parameters:
Return type:

list[Path]

detect_stage()[source]#

Auto-detect: binned_* > preprocessed > raw.

Return type:

str

postprocess_spectrum(spec, *, stage=None)[source]#

Rewrite binned_N/ spectra from bin_index to real m/z.

DRIAMS (and MS-UMG) binned_6000/*.txt files store bin_index binned_intensity rather than (mass, intensity). Without conversion, every downstream m/z-aware API (SpectrumQuality.noise_region, MzTrimmer, plot_spectrum axes, m/z-range filters) would operate in [0, N) instead of [mz_min, mz_max].

When stage matches binned_N and the loaded spectrum’s mass column looks like contiguous integers 0..N-1, the spectrum is rewritten:

  • mass becomes mz_min + i * (mz_max - mz_min) / (N - 1),

  • the spectrum is marked as pre-binned (_binned populated), so MaldiSet does not re-bin already-binned data,

  • _bin_metadata is filled in consistently.

Idempotent: a second call on already-converted data is a no-op (mass is no longer integer 0..N-1).

When self.normalize_tic is True, the intensities are additionally rescaled so that each spectrum sums to 1.

Parameters:
Return type:

MaldiSpectrum

class maldiamrkit.data.MARISMaLayout(root_dir, metadata_csv, *, id_column='Identifier', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, id_transform=None, year=None)[source]#

Navigate a dataset of raw Bruker spectra organised in a tree.

Load spectra directly from Bruker binary files without requiring a build step. The metadata CSV must contain a column with relative paths pointing to the Bruker data directories.

Parameters:
  • root_dir (str or Path) – Root directory of the dataset.

  • metadata_csv (str or Path) – Path to the metadata CSV.

  • id_column (str, default="Identifier") – Column for specimen identifier.

  • path_column (str, default="Path") – Column with relative path to the Bruker directory.

  • target_position_column (str, default="target_position") – Column for the plate target position.

  • duplicate_strategy (str or DuplicateStrategy, default "first") –

    How to handle duplicate specimen identifiers (e.g. the same sample measured at multiple MALDI target positions):

    • "first" – keep the first occurrence (default).

    • "last" – keep the last occurrence.

    • "drop" – remove all duplicates.

    • "keep_all" – keep every replicate, appending the target-position value to the ID ({identifier}_{target_position}).

    • "average" – tag replicates for downstream averaging (adds _original_id column).

  • id_transform (callable, optional) – Function mapping raw ID strings to a canonical sample identifier. When set, duplicates are detected on the transformed identifier rather than the raw one, so technical replicates that encode the underlying sample in a filename suffix / prefix pattern collapse under duplicate_strategy. The raw ID column is preserved for downstream matching; only deduplication uses the transformed key. Leave at None for the legacy behaviour (each replicate counted as a distinct row).

  • year (str, int, or None) – Restrict to a single year.

__init__(root_dir, metadata_csv, *, id_column='Identifier', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, id_transform=None, year=None)[source]#
Parameters:
Return type:

None

discover_metadata()[source]#

Read metadata CSV and normalise the ID column.

Return type:

DataFrame

collect_spectrum_files(stage, year)[source]#

Resolve Bruker directories from metadata Path column.

The stage parameter is ignored (only raw Bruker available).

Parameters:
Return type:

list[Path]

detect_stage()[source]#

Return 'raw' as the only available stage.

Return type:

str

ProcessingHandler#

class maldiamrkit.data.ProcessingHandler(folder_name, kind, pipeline=None, bin_width=3)[source]#

Define an additional processing output folder.

Parameters:
  • folder_name (str) – Name of the output folder (e.g. "preprocessed_sqrt").

  • kind (str) – Either "preprocessed" or "binned".

  • pipeline (PreprocessingPipeline or None) – Pipeline to apply. None uses the default.

  • bin_width (int or float) – Bin width in Daltons (only used when kind="binned").

folder_name: str#
kind: str#
pipeline: PreprocessingPipeline | None = None#
bin_width: int | float = 3#
to_dict()[source]#

Serialize to a dictionary.

Return type:

dict

classmethod from_dict(d)[source]#

Reconstruct from a dictionary.

Parameters:

d (dict)

Return type:

ProcessingHandler

__init__(folder_name, kind, pipeline=None, bin_width=3)#
Parameters:
Return type:

None

BuildReport#

class maldiamrkit.data.BuildReport(total, succeeded, failed, failed_ids, output_dir, folders_created)[source]#

Summary of a dataset build.

Variables:
  • total (int) – Number of spectra attempted.

  • succeeded (int) – Number successfully processed.

  • failed (int) – Number that failed.

  • failed_ids (list of str) – IDs of spectra that failed.

  • output_dir (Path) – Root of the output dataset.

  • folders_created (list of str) – Names of all processing folders created.

Parameters:
__init__(total, succeeded, failed, failed_ids, output_dir, folders_created)#
Parameters:
Return type:

None

Duplicate Handling#

class maldiamrkit.data.DuplicateStrategy(value)[source]#

Bases: str, Enum

Strategy for handling duplicate spectrum identifiers.

Variables:
  • first (str) – Keep the first occurrence of each duplicate ID.

  • last (str) – Keep the last occurrence of each duplicate ID.

  • drop (str) – Remove all rows whose ID appears more than once.

  • keep_all (str) – Retain every replicate, disambiguating IDs with a suffix (e.g. _rep1, _rep2 or the target-position value).

  • average (str) – Keep all replicates for downstream averaging. Adds an _original_id column so that loaders / transformers can group replicates and average their spectra.

first = 'first'#
last = 'last'#
drop = 'drop'#
keep_all = 'keep_all'#
average = 'average'#