Dataset Builder & Loader#
Build and load datasets using pluggable layout adapters.
DatasetBuilder#
- class maldiamrkit.data.DatasetBuilder(layout, output_dir, *, name=None, id_column='code', pipeline=None, bin_width=3, extra_handlers=None, metadata_dir='id', metadata_suffix='_clean.csv', n_jobs=-1, on_error='warn')[source]#
Build a standardised dataset from any supported input layout.
- Parameters:
layout (InputLayout) – Input data layout adapter.
output_dir (str or Path) – Root directory for the standardised output.
name (str or None) – Dataset name (defaults to
output_dirname).id_column (str, default="code") – Column name for the spectrum identifier in the output metadata.
pipeline (PreprocessingPipeline or None) – Preprocessing pipeline.
Noneuses the default.bin_width (int or float, default=3) – Bin width in Daltons for the default binned output.
extra_handlers (list of ProcessingHandler or None) – Additional processing outputs.
metadata_dir (str, default="id") – Subdirectory name for metadata CSV output.
metadata_suffix (str, default="_clean.csv") – Filename suffix for metadata CSV output.
n_jobs (int, default=-1) – Number of parallel workers.
on_error (str, default="warn") – Error handling:
"warn","raise", or"skip".
Examples
>>> from maldiamrkit.data import DatasetBuilder, FlatLayout >>> layout = FlatLayout("spectra/", "meta.csv") >>> builder = DatasetBuilder(layout, "output/") >>> report = builder.build()
- __init__(layout, output_dir, *, name=None, id_column='code', pipeline=None, bin_width=3, extra_handlers=None, metadata_dir='id', metadata_suffix='_clean.csv', n_jobs=-1, on_error='warn')[source]#
DatasetLoader#
- class maldiamrkit.data.DatasetLoader(layout, *, stage=None, n_jobs=-1, verbose=False)[source]#
Load a dataset into a
MaldiSet.- Parameters:
layout (DatasetLayout) – Dataset navigation adapter (e.g.
DRIAMSLayoutorMARISMaLayout).stage (str or None) – Processing stage to load.
Nonetriggers auto-detection via the layout.n_jobs (int, default=-1) – Number of parallel workers for spectrum loading.
verbose (bool, default=False) – If True, show tqdm progress bars during spectrum loading and pass
verbosethrough toMaldiSet.
Examples
>>> from maldiamrkit.data import DatasetLoader, DRIAMSLayout >>> layout = DRIAMSLayout("output/my_dataset") >>> loader = DatasetLoader(layout) >>> ds = loader.load(aggregate_by=dict(antibiotics="Ceftriaxone"))
- __init__(layout, *, stage=None, n_jobs=-1, verbose=False)[source]#
- Parameters:
layout (
DatasetLayout)n_jobs (
int)verbose (
bool)
- Return type:
None
Input Layouts#
- class maldiamrkit.data.InputLayout[source]#
Abstract adapter for discovering spectra and metadata.
- abstractmethod discover_spectra()[source]#
Return paths to all spectrum sources (files or directories).
- abstractmethod discover_metadata()[source]#
Return metadata DataFrame with an
'ID'column.- Return type:
- class maldiamrkit.data.FlatLayout(spectra_dir, metadata_csv, *, id_column='ID', year_column=None)[source]#
Flat directory of pre-exported text spectrum files + metadata CSV.
Suitable for datasets where spectra are already exported as text files.
- Parameters:
spectra_dir (str or Path) – Directory containing spectrum text files (flat or with year subfolders).
metadata_csv (str or Path) – CSV with an ID column, species, and antibiotic columns.
id_column (str, default="ID") – Column name for the spectrum identifier in the metadata.
year_column (str or None) – Column to extract year from, or
Nonefor flat layout.
- class maldiamrkit.data.BrukerTreeLayout(root_dir, metadata_csv, *, id_column='Identifier', year_column='Year', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, validate=True)[source]#
Hierarchical directory tree containing raw Bruker binary data.
Suitable for datasets where spectra are stored as Bruker
fid/acqusbinaries in a hierarchical directory tree. The metadata CSV must contain a column with relative paths pointing to the Bruker data directories.- Parameters:
root_dir (str or Path) – Root directory of the dataset.
metadata_csv (str or Path) – Metadata CSV with columns for identifier, path to Bruker data, and (optionally) year and target position.
id_column (str, default="Identifier") – Column for specimen identifier.
year_column (str, default="Year") – Column for year.
path_column (str, default="Path") – Column with relative path to the Bruker directory.
target_position_column (str, default="target_position") – Column for the plate target position.
duplicate_strategy (str or DuplicateStrategy, default
"first") –How to handle duplicate specimen identifiers (e.g. the same sample measured at multiple MALDI target positions):
"first"– keep the first occurrence (default)."last"– keep the last occurrence."drop"– remove all duplicates."keep_all"– keep every replicate, appending the target-position value to the ID ({identifier}_{target_position})."average"– tag replicates for downstream averaging (adds_original_idcolumn).
validate (bool, default=True) – If
True, skip empty spectra (all-zerofid) and warn on duplicate spectra (SHA256 hash matching).
- __init__(root_dir, metadata_csv, *, id_column='Identifier', year_column='Year', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, validate=True)[source]#
- discover_spectra()[source]#
Resolve Bruker directories from metadata paths.
Applies
duplicate_strategyto handle specimens that appear at multiple target positions. Optionally validates for empty and duplicate spectra.
Dataset Layouts#
- class maldiamrkit.data.DatasetLayout[source]#
Abstract adapter for navigating and loading from a dataset.
- abstractmethod discover_metadata()[source]#
Load metadata, return DataFrame with
'ID'column.- Return type:
- abstractmethod collect_spectrum_files(stage, year)[source]#
Return paths to spectrum files for the given stage/year.
- postprocess_spectrum(spec, *, stage=None)[source]#
Apply dataset-specific fix-ups to a freshly-loaded spectrum.
Default is a no-op. Layouts whose on-disk format deviates from the
(mass, intensity)convention assumed byread_spectrum()can override this to reshape the spectrum. Called byDatasetLoaderafter each file is loaded.- Parameters:
spec (
MaldiSpectrum)
- Return type:
- class maldiamrkit.data.DRIAMSLayout(dataset_dir, *, id_column=<auto>, species_column=None, year=None, metadata_dir=<auto>, metadata_suffix=<auto>, spectrum_ext=<auto>, duplicate_strategy=DuplicateStrategy.first, id_transform=None, mz_min=<auto>, mz_max=<auto>, normalize_tic=False)[source]#
Navigate a DRIAMS-like dataset structure.
Works with both the output of
DatasetBuilderand the original DRIAMS-A/B/C/D datasets.- Parameters:
dataset_dir (str or Path) – Root of the dataset.
id_column (str or None) – Metadata column for spectrum IDs.
Nonetriggers auto-detection ('code'>'ID'> first column).species_column (str or None) – Metadata column for species names.
Nonetriggers auto-detection (case-insensitive match for'species'). The column is renamed to'Species'for downstream use.metadata_dir (str, default="id") – Subdirectory name containing metadata CSV files.
metadata_suffix (str, default="_clean.csv") – Filename suffix for metadata CSV files.
spectrum_ext (str, default=".txt") – File extension for spectrum files (including the dot).
duplicate_strategy (str or DuplicateStrategy, default
"first") –How to handle duplicate spectrum IDs (e.g. the same sample appearing in multiple year subdirectories):
"first"– keep the first occurrence (default)."last"– keep the last occurrence."drop"– remove all duplicates."keep_all"– keep every replicate with_repNsuffixes."average"– tag replicates for downstream averaging.
id_transform (callable, optional) –
Function mapping raw
IDstrings to a canonical sample identifier. When set, duplicates are detected on the transformed identifier rather than the raw one – so technical-replicate files that share an underlying sample (e.g. DRIAMSUUID_MALDI1/UUID_MALDI2) are recognized as duplicates byduplicate_strategy. The rawIDcolumn is preserved for spectrum-file matching; only deduplication uses the transformed key. Typical DRIAMS usage:import re DRIAMSLayout( ..., id_transform=lambda s: re.sub(r"_MALDI\\d+$", "", s), duplicate_strategy="first", # or "average" )
Leaving this at
Nonepreserves the legacy behaviour (each replicate counted as a distinct row). A one-time warning is emitted when_MALDI<N>-suffixed IDs are detected andid_transformisNone, pointing at this kwarg; the warning can be silenced by passingid_transform=strif the per-replicate semantics are intentional.mz_min (float, default=2000.0) – Lower m/z edge to assign to bin index 0 when a
binned_N/stage is loaded. Only consulted bypostprocess_spectrum().mz_max (float, default=19997.0) – Upper m/z edge assigned to bin index
N-1.normalize_tic (bool, default=False) – When
True, re-apply a TIC normalization (intensity <- intensity / sum(intensity)) to every loaded spectrum inpostprocess_spectrum(). Useful because the published DRIAMS / MS-UMGbinned_6000/files do not sum to 1.0 on disk (empirically ~1.29 and ~1.36 respectively), despite the DRIAMS preprocessing script callingcalibrateIntensity(method="TIC")before trimming – the cause is somewhere in the upstream pipeline (MALDIquant version or an implicit scaling step) and has not been reproduced here. Enabling this kwarg gives sum=1.0 per spectrum, aligning DRIAMS / MS-UMG with flat-text datasets whose preprocessing pipeline already produces TIC=1.
- __init__(dataset_dir, *, id_column=<auto>, species_column=None, year=None, metadata_dir=<auto>, metadata_suffix=<auto>, spectrum_ext=<auto>, duplicate_strategy=DuplicateStrategy.first, id_transform=None, mz_min=<auto>, mz_max=<auto>, normalize_tic=False)[source]#
Initialise the layout.
Several kwargs accept the sentinel
_AUTOas their default. When_AUTO, the value is filled fromsite_info.jsonat the dataset root (if present) and otherwise falls back to the library-level default. Explicit kwargs always win. Fields with per-call semantics (year,species_column,id_transform,duplicate_strategy,normalize_tic) stay user-controlled and are never read from the manifest.
- postprocess_spectrum(spec, *, stage=None)[source]#
Rewrite
binned_N/spectra from bin_index to real m/z.DRIAMS (and MS-UMG)
binned_6000/*.txtfiles storebin_index binned_intensityrather than(mass, intensity). Without conversion, every downstream m/z-aware API (SpectrumQuality.noise_region,MzTrimmer,plot_spectrumaxes, m/z-range filters) would operate in [0, N) instead of [mz_min, mz_max].When
stagematchesbinned_Nand the loaded spectrum’smasscolumn looks like contiguous integers0..N-1, the spectrum is rewritten:massbecomesmz_min + i * (mz_max - mz_min) / (N - 1),the spectrum is marked as pre-binned (
_binnedpopulated), soMaldiSetdoes not re-bin already-binned data,_bin_metadatais filled in consistently.
Idempotent: a second call on already-converted data is a no-op (mass is no longer integer 0..N-1).
When
self.normalize_ticisTrue, the intensities are additionally rescaled so that each spectrum sums to 1.- Parameters:
spec (
MaldiSpectrum)
- Return type:
- class maldiamrkit.data.MARISMaLayout(root_dir, metadata_csv, *, id_column='Identifier', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, id_transform=None, year=None)[source]#
Navigate a dataset of raw Bruker spectra organised in a tree.
Load spectra directly from Bruker binary files without requiring a build step. The metadata CSV must contain a column with relative paths pointing to the Bruker data directories.
- Parameters:
root_dir (str or Path) – Root directory of the dataset.
metadata_csv (str or Path) – Path to the metadata CSV.
id_column (str, default="Identifier") – Column for specimen identifier.
path_column (str, default="Path") – Column with relative path to the Bruker directory.
target_position_column (str, default="target_position") – Column for the plate target position.
duplicate_strategy (str or DuplicateStrategy, default
"first") –How to handle duplicate specimen identifiers (e.g. the same sample measured at multiple MALDI target positions):
"first"– keep the first occurrence (default)."last"– keep the last occurrence."drop"– remove all duplicates."keep_all"– keep every replicate, appending the target-position value to the ID ({identifier}_{target_position})."average"– tag replicates for downstream averaging (adds_original_idcolumn).
id_transform (callable, optional) – Function mapping raw
IDstrings to a canonical sample identifier. When set, duplicates are detected on the transformed identifier rather than the raw one, so technical replicates that encode the underlying sample in a filename suffix / prefix pattern collapse underduplicate_strategy. The rawIDcolumn is preserved for downstream matching; only deduplication uses the transformed key. Leave atNonefor the legacy behaviour (each replicate counted as a distinct row).
- __init__(root_dir, metadata_csv, *, id_column='Identifier', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, id_transform=None, year=None)[source]#
ProcessingHandler#
- class maldiamrkit.data.ProcessingHandler(folder_name, kind, pipeline=None, bin_width=3)[source]#
Define an additional processing output folder.
- Parameters:
folder_name (str) – Name of the output folder (e.g.
"preprocessed_sqrt").kind (str) – Either
"preprocessed"or"binned".pipeline (PreprocessingPipeline or None) – Pipeline to apply.
Noneuses the default.bin_width (int or float) – Bin width in Daltons (only used when
kind="binned").
- pipeline: PreprocessingPipeline | None = None#
BuildReport#
- class maldiamrkit.data.BuildReport(total, succeeded, failed, failed_ids, output_dir, folders_created)[source]#
Summary of a dataset build.
Duplicate Handling#
- class maldiamrkit.data.DuplicateStrategy(value)[source]#
-
Strategy for handling duplicate spectrum identifiers.
- Variables:
first (str) – Keep the first occurrence of each duplicate ID.
last (str) – Keep the last occurrence of each duplicate ID.
drop (str) – Remove all rows whose ID appears more than once.
keep_all (str) – Retain every replicate, disambiguating IDs with a suffix (e.g.
_rep1,_rep2or the target-position value).average (str) – Keep all replicates for downstream averaging. Adds an
_original_idcolumn so that loaders / transformers can group replicates and average their spectra.
- first = 'first'#
- last = 'last'#
- drop = 'drop'#
- keep_all = 'keep_all'#
- average = 'average'#