Dataset Builder & Loader#

Build and load datasets using pluggable layout adapters.

DatasetBuilder#

class maldiamrkit.data.DatasetBuilder(layout, output_dir, *, name=None, id_column='code', pipeline=None, bin_width=3, extra_handlers=None, metadata_dir='id', metadata_suffix='_clean.csv', n_jobs=-1, on_error='warn')[source]#

Build a standardised dataset from any supported input layout.

Parameters:

layout (InputLayout) – Input data layout adapter.
output_dir (str or Path) – Root directory for the standardised output.
name (str or None) – Dataset name (defaults to output_dir name).
id_column (str, default="code") – Column name for the spectrum identifier in the output metadata.
pipeline (PreprocessingPipeline or None) – Preprocessing pipeline. None uses the default.
bin_width (int or float, default=3) – Bin width in Daltons for the default binned output.
extra_handlers (list of ProcessingHandler or None) – Additional processing outputs.
metadata_dir (str, default="id") – Subdirectory name for metadata CSV output.
metadata_suffix (str, default="_clean.csv") – Filename suffix for metadata CSV output.
n_jobs (int, default=-1) – Number of parallel workers.
on_error (str, default="warn") – Error handling: "warn", "raise", or "skip".

Examples

>>> from maldiamrkit.data import DatasetBuilder, FlatLayout
>>> layout = FlatLayout("spectra/", "meta.csv")
>>> builder = DatasetBuilder(layout, "output/")
>>> report = builder.build()

__init__(layout, output_dir, *, name=None, id_column='code', pipeline=None, bin_width=3, extra_handlers=None, metadata_dir='id', metadata_suffix='_clean.csv', n_jobs=-1, on_error='warn')[source]#

Parameters:

layout (InputLayout)
output_dir (str | Path)
name (str | None)
id_column (str)
pipeline (PreprocessingPipeline | None)
bin_width (int | float)
extra_handlers (list[ProcessingHandler] | None)
metadata_dir (str)
metadata_suffix (str)
n_jobs (int)
on_error (str)

Return type:

None

build()[source]#

Execute the build pipeline.

Returns:: Summary of the build.
Return type:: BuildReport

DatasetLoader#

class maldiamrkit.data.DatasetLoader(layout, *, stage=None, n_jobs=-1, verbose=False)[source]#

Load a dataset into a MaldiSet.

Parameters:

layout (DatasetLayout) – Dataset navigation adapter (e.g. DRIAMSLayout or MARISMaLayout).
stage (str or None) – Processing stage to load. None triggers auto-detection via the layout.
n_jobs (int, default=-1) – Number of parallel workers for spectrum loading.
verbose (bool, default=False) – If True, show tqdm progress bars during spectrum loading and pass verbose through to MaldiSet.

Examples

>>> from maldiamrkit.data import DatasetLoader, DRIAMSLayout
>>> layout = DRIAMSLayout("output/my_dataset")
>>> loader = DatasetLoader(layout)
>>> ds = loader.load(aggregate_by=dict(antibiotics="Ceftriaxone"))

__init__(layout, *, stage=None, n_jobs=-1, verbose=False)[source]#

Parameters:

layout (DatasetLayout)
stage (str | None)
n_jobs (int)
verbose (bool)

Return type:

None

load(aggregate_by=None)[source]#

Load the dataset.

Parameters:: aggregate_by (dict, optional) – Passed through to MaldiSet.
Returns:: Dataset with loaded spectra and metadata.
Return type:: MaldiSet

Input Layouts#

class maldiamrkit.data.InputLayout[source]#

Abstract adapter for discovering spectra and metadata.

abstractmethod discover_spectra()[source]#

Return paths to all spectrum sources (files or directories).

Return type:: list[Path]

abstractmethod discover_metadata()[source]#

Return metadata DataFrame with an 'ID' column.

Return type:: DataFrame

abstractmethod get_id(spectrum_path)[source]#

Extract the spectrum identifier from a path.

Parameters:: spectrum_path (Path)
Return type:: str

abstractmethod get_year(spectrum_id)[source]#

Return the year for a spectrum, or None.

Parameters:: spectrum_id (str)
Return type:: str | None

class maldiamrkit.data.FlatLayout(spectra_dir, metadata_csv, *, id_column='ID', year_column=None, year_overrides=None)[source]#

Flat directory of pre-exported text spectrum files + metadata CSV.

Suitable for datasets where spectra are already exported as text files.

Parameters:

spectra_dir (str or Path) – Directory containing spectrum text files (flat or with year subfolders).
metadata_csv (str or Path) – CSV with an ID column, species, and antibiotic columns.
id_column (str, default="ID") – Column name for the spectrum identifier in the metadata.
year_column (str or None) – Column to extract year from. When None, the year is instead inferred from a four-digit input subfolder name (if any), so a year-organised input directory still produces year subfolders in the output. See get_year().
year_overrides (dict[str, str] or None) – Optional explicit {spectrum_id: year} mapping. Useful when the spectra_dir is itself flat but the years are known from another source. Takes precedence over the inferred subfolder year, but not over year_column.

__init__(spectra_dir, metadata_csv, *, id_column='ID', year_column=None, year_overrides=None)[source]#

Parameters:

spectra_dir (str | Path)
metadata_csv (str | Path)
id_column (str)
year_column (str | None)
year_overrides (dict[str, str] | None)

Return type:

None

discover_spectra()[source]#

Find .txt files, whether flat or in (possibly nested) subfolders.

Files are searched flat first, then one level down, then recursively. Whenever a spectrum sits under a four-digit-year folder, that year is recorded so it can serve as a fallback when no year_column is supplied (see get_year()).

Return type:: list[Path]

discover_metadata()[source]#

Read metadata CSV and normalise the ID column.

Return type:: DataFrame

get_id(spectrum_path)[source]#

Filename stem is the spectrum ID.

Parameters:: spectrum_path (Path)
Return type:: str

get_year(spectrum_id)[source]#

Resolve a spectrum’s year.

Resolution order: the metadata year_column (authoritative when set), then an explicit year_overrides entry, then the year inferred from a four-digit input subfolder (see discover_spectra()); failing all three, None (flat layout).

Parameters:: spectrum_id (str)
Return type:: str | None

class maldiamrkit.data.BrukerTreeLayout(root_dir, metadata_csv, *, id_column='Identifier', year_column='Year', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, validate=True)[source]#

Hierarchical directory tree containing raw Bruker binary data.

Suitable for datasets where spectra are stored as Bruker fid/acqus binaries in a hierarchical directory tree. The metadata CSV must contain a column with relative paths pointing to the Bruker data directories.

Parameters:

root_dir (str or Path) – Root directory of the dataset.
metadata_csv (str or Path) – Metadata CSV with columns for identifier, path to Bruker data, and (optionally) year and target position.
id_column (str, default="Identifier") – Column for specimen identifier.
year_column (str, default="Year") – Column for year.
path_column (str, default="Path") – Column with relative path to the Bruker directory.
target_position_column (str, default="target_position") – Column for the plate target position.
duplicate_strategy (str or DuplicateStrategy, default "first") –
How to handle duplicate specimen identifiers (e.g. the same sample measured at multiple MALDI target positions):
- "first" – keep the first occurrence (default).
- "last" – keep the last occurrence.
- "drop" – remove all duplicates.
- "keep_all" – keep every replicate, appending the target-position value to the ID ({identifier}_{target_position}).
- "average" – tag replicates for downstream averaging (adds _original_id column).
validate (bool, default=True) – If True, skip empty spectra (all-zero fid) and warn on duplicate spectra (SHA256 hash matching).

__init__(root_dir, metadata_csv, *, id_column='Identifier', year_column='Year', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, validate=True)[source]#

Parameters:

root_dir (str | Path)
metadata_csv (str | Path)
id_column (str)
year_column (str)
path_column (str)
target_position_column (str)
duplicate_strategy (str | DuplicateStrategy)
validate (bool)

Return type:

None

discover_spectra()[source]#

Resolve Bruker directories from metadata paths.

Applies duplicate_strategy to handle specimens that appear at multiple target positions. Optionally validates for empty and duplicate spectra.

Return type:: list[Path]

discover_metadata()[source]#

Read metadata CSV, normalise ID column.

Return type:: DataFrame

get_id(spectrum_path)[source]#

Look up ID from the path mapping built during discovery.

Parameters:: spectrum_path (Path)
Return type:: str

get_year(spectrum_id)[source]#

Year from the metadata.

Parameters:: spectrum_id (str)
Return type:: str | None

Dataset Layouts#

class maldiamrkit.data.DatasetLayout[source]#

Abstract adapter for navigating and loading from a dataset.

replicate_pattern: Pattern[str] | None = None#

isolate_column: str | None = None#

abstractmethod discover_metadata()[source]#

Load metadata, return DataFrame with 'ID' column.

Return type:: DataFrame

abstractmethod collect_spectrum_files(stage, year)[source]#

Return paths to spectrum files for the given stage/year.

Parameters:

stage (str | None)
year (str | int | None)

Return type:

list[Path]

abstractmethod detect_stage()[source]#

Auto-detect best available processing stage.

Return type:: str

postprocess_spectrum(spec, *, stage=None)[source]#

Apply dataset-specific fix-ups to a freshly-loaded spectrum.

Default is a no-op. Layouts whose on-disk format deviates from the (mass, intensity) convention assumed by read_spectrum() can override this to reshape the spectrum. Called by DatasetLoader after each file is loaded.

Parameters:

spec (MaldiSpectrum)
stage (str | None)

Return type:

MaldiSpectrum

class maldiamrkit.data.DRIAMSLayout(dataset_dir, *, id_column=<auto>, species_column=None, year=None, metadata_dir=<auto>, metadata_suffix=<auto>, spectrum_ext=<auto>, duplicate_strategy=DuplicateStrategy.first, id_transform=None, collapse_replicates=False, mz_min=<auto>, mz_max=<auto>, normalize_tic=False)[source]#

Navigate a DRIAMS-like dataset structure.

Works with both the output of DatasetBuilder and the original DRIAMS-A/B/C/D datasets.

Parameters:

dataset_dir (str or Path) – Root of the dataset.
id_column (str or None) – Metadata column for spectrum IDs. None triggers auto-detection ('code' > 'ID' > first column).
species_column (str or None) – Metadata column for species names. None triggers auto-detection (case-insensitive match for 'species'). The column is renamed to 'Species' for downstream use.
year (str, int, or None) – Restrict to a single year.
metadata_dir (str, default="id") – Subdirectory name containing metadata CSV files.
metadata_suffix (str, default="_clean.csv") – Filename suffix for metadata CSV files.
spectrum_ext (str, default=".txt") – File extension for spectrum files (including the dot).
duplicate_strategy (str or DuplicateStrategy, default "first") –
How to handle duplicate spectrum IDs (e.g. the same sample appearing in multiple year subdirectories):
- "first" – keep the first occurrence (default).
- "last" – keep the last occurrence.
- "drop" – remove all duplicates.
- "keep_all" – keep every replicate with _repN suffixes.
- "average" – tag replicates for downstream averaging.
id_transform (callable, optional) –
Function mapping raw ID strings to a canonical sample identifier. When set, duplicates are detected on the transformed identifier rather than the raw one – so technical-replicate files that share an underlying sample (e.g. DRIAMS UUID_MALDI1 / UUID_MALDI2) are recognized as duplicates by duplicate_strategy. The raw ID column is preserved for spectrum-file matching; only deduplication uses the transformed key. Typical DRIAMS usage:
```
import re
DRIAMSLayout(
    ...,
    id_transform=lambda s: re.sub(r"_MALDI\\d+$", "", s),
    duplicate_strategy="first",   # or "average"
)
```
Leaving this at None preserves the legacy behaviour (each replicate counted as a distinct row). A one-time warning is emitted when _MALDI<N>-suffixed IDs are detected and id_transform is None, pointing at this kwarg; the warning can be silenced by passing id_transform=str if the per-replicate semantics are intentional.
collapse_replicates (bool, default=False) – Convenience shortcut for id_transform=strip_driams_replicate: collapse DRIAMS technical replicates (_MALDI<N>) to one row per underlying isolate via the active duplicate_strategy. Ignored when an explicit id_transform is given (that always takes precedence).
mz_min (float, default=2000.0) – Lower m/z edge to assign to bin index 0 when a binned_N/ stage is loaded. Only consulted by postprocess_spectrum().
mz_max (float, default=19997.0) – Upper m/z edge assigned to bin index N-1.
normalize_tic (bool, default=False) – When True, re-apply a TIC normalization (intensity <- intensity / sum(intensity)) to every loaded spectrum in postprocess_spectrum(). Useful because the published DRIAMS / MS-UMG binned_6000/ files do not sum to 1.0 on disk (empirically ~1.29 and ~1.36 respectively), despite the DRIAMS preprocessing script calling calibrateIntensity(method="TIC") before trimming – the cause is somewhere in the upstream pipeline (MALDIquant version or an implicit scaling step) and has not been reproduced here. Enabling this kwarg gives sum=1.0 per spectrum, aligning DRIAMS / MS-UMG with flat-text datasets whose preprocessing pipeline already produces TIC=1.

replicate_pattern: Pattern[str] | None = re.compile('_MALDI\\d+$')#

__init__(dataset_dir, *, id_column=<auto>, species_column=None, year=None, metadata_dir=<auto>, metadata_suffix=<auto>, spectrum_ext=<auto>, duplicate_strategy=DuplicateStrategy.first, id_transform=None, collapse_replicates=False, mz_min=<auto>, mz_max=<auto>, normalize_tic=False)[source]#

Initialise the layout.

Several kwargs accept the sentinel _AUTO as their default. When _AUTO, the value is filled from site_info.json at the dataset root (if present) and otherwise falls back to the library-level default. Explicit kwargs always win. Fields with per-call semantics (year, species_column, id_transform, duplicate_strategy, normalize_tic) stay user-controlled and are never read from the manifest.

Parameters:

dataset_dir (str | Path)
id_column (str | None | _Sentinel)
species_column (str | None)
year (str | int | None)
metadata_dir (str | _Sentinel)
metadata_suffix (str | _Sentinel)
spectrum_ext (str | _Sentinel)
duplicate_strategy (str | DuplicateStrategy)
id_transform (Callable[[str], str] | None)
collapse_replicates (bool)
mz_min (float | _Sentinel)
mz_max (float | _Sentinel)
normalize_tic (bool)

Return type:

None

discover_metadata()[source]#

Load metadata CSV(s) from the metadata directory.

Return type:: DataFrame

collect_spectrum_files(stage, year)[source]#

Glob spectrum files from the stage directory.

Parameters:

stage (str | None)
year (str | int | None)

Return type:

list[Path]

detect_stage()[source]#

Auto-detect: binned_* > preprocessed > raw.

Return type:: str

postprocess_spectrum(spec, *, stage=None)[source]#

Rewrite binned_N/ spectra from bin_index to real m/z.

DRIAMS (and MS-UMG) binned_6000/*.txt files store bin_index binned_intensity rather than (mass, intensity). Without conversion, every downstream m/z-aware API (SpectrumQuality.noise_region, MzTrimmer, plot_spectrum axes, m/z-range filters) would operate in [0, N) instead of [mz_min, mz_max].

When stage matches binned_N and the loaded spectrum’s mass column looks like contiguous integers 0..N-1, the spectrum is rewritten:

mass becomes mz_min + i * (mz_max - mz_min) / (N - 1),
the spectrum is marked as pre-binned (_binned populated), so MaldiSet does not re-bin already-binned data,
_bin_metadata is filled in consistently.

Idempotent: a second call on already-converted data is a no-op (mass is no longer integer 0..N-1).

When self.normalize_tic is True, the intensities are additionally rescaled so that each spectrum sums to 1.

Parameters:

spec (MaldiSpectrum)
stage (str | None)

Return type:

MaldiSpectrum

class maldiamrkit.data.MARISMaLayout(root_dir, metadata_csv, *, id_column='Identifier', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, id_transform=None, year=None)[source]#

Navigate a dataset of raw Bruker spectra organised in a tree.

Load spectra directly from Bruker binary files without requiring a build step. The metadata CSV must contain a column with relative paths pointing to the Bruker data directories.

Parameters:

root_dir (str or Path) – Root directory of the dataset.
metadata_csv (str or Path) – Path to the metadata CSV.
id_column (str, default="Identifier") – Column for specimen identifier.
path_column (str, default="Path") – Column with relative path to the Bruker directory.
target_position_column (str, default="target_position") – Column for the plate target position.
duplicate_strategy (str or DuplicateStrategy, default "first") –
How to handle duplicate specimen identifiers (e.g. the same sample measured at multiple MALDI target positions):
- "first" – keep the first occurrence (default).
- "last" – keep the last occurrence.
- "drop" – remove all duplicates.
- "keep_all" – keep every replicate, appending the target-position value to the ID ({identifier}_{target_position}).
- "average" – tag replicates for downstream averaging (adds _original_id column).
id_transform (callable, optional) – Function mapping raw ID strings to a canonical sample identifier. When set, duplicates are detected on the transformed identifier rather than the raw one, so technical replicates that encode the underlying sample in a filename suffix / prefix pattern collapse under duplicate_strategy. The raw ID column is preserved for downstream matching; only deduplication uses the transformed key. Leave at None for the legacy behaviour (each replicate counted as a distinct row).
year (str, int, or None) – Restrict to a single year.

__init__(root_dir, metadata_csv, *, id_column='Identifier', path_column='Path', target_position_column='target_position', duplicate_strategy=DuplicateStrategy.first, id_transform=None, year=None)[source]#

Parameters:

root_dir (str | Path)
metadata_csv (str | Path)
id_column (str)
path_column (str)
target_position_column (str)
duplicate_strategy (str | DuplicateStrategy)
id_transform (Callable[[str], str] | None)
year (str | int | None)

Return type:

None

discover_metadata()[source]#

Read metadata CSV and normalise the ID column.

Return type:: DataFrame

collect_spectrum_files(stage, year)[source]#

Resolve Bruker directories from metadata Path column.

The stage parameter is ignored (only raw Bruker available).

Parameters:

stage (str | None)
year (str | int | None)

Return type:

list[Path]

detect_stage()[source]#

Return 'raw' as the only available stage.

Return type:: str

maldiamrkit.data.strip_driams_replicate(sample_id)[source]#

Return the underlying DRIAMS sample ID by removing the _MALDI<N> tag.

"abc123_MALDI2" -> "abc123". IDs without the suffix are returned unchanged. Use as a group key (one value per biological isolate) for group-aware cross-validation.

Parameters:: sample_id (str)
Return type:: str

ProcessingHandler#

class maldiamrkit.data.ProcessingHandler(folder_name, kind, pipeline=None, bin_width=3)[source]#

Define an additional processing output folder.

Parameters:

folder_name (str) – Name of the output folder (e.g. "preprocessed_sqrt").
kind (str) – Either "preprocessed" or "binned".
pipeline (PreprocessingPipeline or None) – Pipeline to apply. None uses the default.
bin_width (int or float) – Bin width in Daltons (only used when kind="binned").

folder_name: str#

kind: str#

pipeline: PreprocessingPipeline | None = None#

bin_width: int | float = 3#

to_dict()[source]#

Serialize to a dictionary.

Return type:: dict

classmethod from_dict(d)[source]#

Reconstruct from a dictionary.

Parameters:: d (dict)
Return type:: ProcessingHandler

__init__(folder_name, kind, pipeline=None, bin_width=3)#

Parameters:

folder_name (str)
kind (str)
pipeline (PreprocessingPipeline | None)
bin_width (int | float)

Return type:

None

BuildReport#

class maldiamrkit.data.BuildReport(total, succeeded, failed, failed_ids, output_dir, folders_created)[source]#

Summary of a dataset build.

Variables:

total (int) – Number of spectra attempted.
succeeded (int) – Number successfully processed.
failed (int) – Number that failed.
failed_ids (list of str) – IDs of spectra that failed.
output_dir (Path) – Root of the output dataset.
folders_created (list of str) – Names of all processing folders created.

Parameters:

total (int)
succeeded (int)
failed (int)
failed_ids (list[str])
output_dir (Path)
folders_created (list[str])

__init__(total, succeeded, failed, failed_ids, output_dir, folders_created)#

Parameters:

total (int)
succeeded (int)
failed (int)
failed_ids (list[str])
output_dir (Path)
folders_created (list[str])

Return type:

None

Duplicate Handling#

class maldiamrkit.data.DuplicateStrategy(value)[source]#

Bases: str, Enum

Strategy for handling duplicate spectrum identifiers.

Variables:

first (str) – Keep the first occurrence of each duplicate ID.
last (str) – Keep the last occurrence of each duplicate ID.
drop (str) – Remove all rows whose ID appears more than once.
keep_all (str) – Retain every replicate, disambiguating IDs with a suffix (e.g. _rep1, _rep2 or the target-position value).
average (str) – Keep all replicates for downstream averaging. Adds an _original_id column so that loaders / transformers can group replicates and average their spectra.

first = 'first'#

last = 'last'#

drop = 'drop'#

keep_all = 'keep_all'#

average = 'average'#

Dataset Manifest#

Every dataset produced by DatasetBuilder carries a self-describing site_info.json manifest at its root, so it can be re-opened without external knowledge. Downstream layouts (notably DRIAMSLayout) consult it at load time to pre-fill unspecified constructor kwargs.

class maldiamrkit.data.SiteInfo(id_column, metadata_dir, metadata_suffix, spectrum_ext, spectra_folders, mz_range, bin_width, build_info=None, format_version=1)[source]#

Top-level dataset manifest.

Parameters:

id_column (str) – Loader-relevant settings; pre-fill the matching kwargs of DRIAMSLayout.
metadata_dir (str) – Loader-relevant settings; pre-fill the matching kwargs of DRIAMSLayout.
metadata_suffix (str) – Loader-relevant settings; pre-fill the matching kwargs of DRIAMSLayout.
spectrum_ext (str) – Loader-relevant settings; pre-fill the matching kwargs of DRIAMSLayout.
spectra_folders (list[str]) – Sub-directories under the dataset root that contain spectra (e.g. ["raw", "preprocessed", "binned_6000"]).
mz_range (tuple[float, float]) – (mz_min, mz_max) used at build time.
bin_width (float) – Bin width in Daltons used at build time.
build_info (BuildInfo, optional) – Optional provenance block.
format_version (int, optional) – Manifest schema version; defaults to the current release’s schema version.

id_column: str#

metadata_dir: str#

metadata_suffix: str#

spectrum_ext: str#

spectra_folders: list[str]#

mz_range: tuple[float, float]#

bin_width: float#

build_info: BuildInfo | None = None#

format_version: int = 1#

to_dict()[source]#

Serialise as a plain dict, with format_version first.

Return type:: dict[str, Any]

__init__(id_column, metadata_dir, metadata_suffix, spectrum_ext, spectra_folders, mz_range, bin_width, build_info=None, format_version=1)#

Parameters:

id_column (str)
metadata_dir (str)
metadata_suffix (str)
spectrum_ext (str)
spectra_folders (list[str])
mz_range (tuple[float, float])
bin_width (float)
build_info (BuildInfo | None)
format_version (int)

Return type:

None

class maldiamrkit.data.BuildInfo(maldiamrkit_version=None, created_at=None, source_layout=None, duplicate_strategy=None, n_total_spectra=None, n_succeeded=None, n_failed=None)[source]#

Optional provenance block nested under build_info.

Informational only; readers may inspect it but are not required to interpret any field. All fields are optional.

Parameters:

maldiamrkit_version (str | None)
created_at (str | None)
source_layout (str | None)
duplicate_strategy (str | None)
n_total_spectra (int | None)
n_succeeded (int | None)
n_failed (int | None)

maldiamrkit_version: str | None = None#

created_at: str | None = None#

source_layout: str | None = None#

duplicate_strategy: str | None = None#

n_total_spectra: int | None = None#

n_succeeded: int | None = None#

n_failed: int | None = None#

to_dict()[source]#

Serialise as a plain dict (omitting None values for cleanliness).

Return type:: dict[str, Any]

__init__(maldiamrkit_version=None, created_at=None, source_layout=None, duplicate_strategy=None, n_total_spectra=None, n_succeeded=None, n_failed=None)#

Parameters:

maldiamrkit_version (str | None)
created_at (str | None)
source_layout (str | None)
duplicate_strategy (str | None)
n_total_spectra (int | None)
n_succeeded (int | None)
n_failed (int | None)

Return type:

None

maldiamrkit.data.read_site_info(dataset_dir, *, missing_ok=True)[source]#

Read <dataset_dir>/site_info.json if present.

Parameters:

dataset_dir (str or Path) – Dataset root directory.
missing_ok (bool, default=True) – When True and the manifest does not exist, return None. When False, raise FileNotFoundError.

Returns:

Parsed manifest, or None if absent and missing_ok=True.

Return type:

SiteInfo or None

Raises:

FileNotFoundError – If the manifest is absent and missing_ok=False.
ValueError – If the manifest is malformed, missing a required field, or has a non-integer format_version.

maldiamrkit.data.write_site_info(dataset_dir, site_info)[source]#

Write a SiteInfo to <dataset_dir>/site_info.json.

Parameters:

dataset_dir (str or Path) – Dataset root directory. Must exist.
site_info (SiteInfo) – Manifest contents.

Returns:

Path to the written manifest.

Return type:

Path