Evaluation Module#

AMR-specific evaluation metrics and stratified splitting utilities, following EUCAST conventions.

Note

LabelEncoder and IntermediateHandling moved to the Susceptibility module in v0.15. Importing them from maldiamrkit.evaluation still works but emits a DeprecationWarning and will be removed in v0.17.

Metrics#

maldiamrkit.evaluation.very_major_error_rate(y_true, y_pred, resistant_label=1)[source]#

Very Major Error rate: resistant isolates classified as susceptible.

VME = FN / (FN + TP), i.e., the miss rate for resistant samples. This is the most dangerous error type in clinical microbiology.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

  • resistant_label (int, default=1) – Label value representing the resistant class.

Returns:

VME rate in [0, 1]. Returns 0.0 if no resistant samples exist.

Return type:

float

Examples

>>> very_major_error_rate([1, 1, 0, 0], [0, 1, 0, 0])
0.5
maldiamrkit.evaluation.major_error_rate(y_true, y_pred, resistant_label=1)[source]#

Major Error rate: susceptible isolates classified as resistant.

ME = FP / (FP + TN), i.e., the false alarm rate for susceptible samples.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

  • resistant_label (int, default=1) – Label value representing the resistant class.

Returns:

ME rate in [0, 1]. Returns 0.0 if no susceptible samples exist.

Return type:

float

Examples

>>> major_error_rate([1, 1, 0, 0], [1, 1, 1, 0])
0.5
maldiamrkit.evaluation.sensitivity_score(y_true, y_pred, resistant_label=1)[source]#

Sensitivity (recall) for the resistant class.

Sensitivity = TP / (TP + FN) = 1 - VME.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

  • resistant_label (int, default=1) – Label value representing the resistant class.

Returns:

Sensitivity in [0, 1]. Returns 0.0 if no resistant samples exist.

Return type:

float

maldiamrkit.evaluation.specificity_score(y_true, y_pred, resistant_label=1)[source]#

Specificity (true negative rate) for the susceptible class.

Specificity = TN / (TN + FP) = 1 - ME.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

  • resistant_label (int, default=1) – Label value representing the resistant class.

Returns:

Specificity in [0, 1]. Returns 0.0 if no susceptible samples exist.

Return type:

float

maldiamrkit.evaluation.categorical_agreement(y_true, y_pred)[source]#

Categorical agreement (accuracy) as reported in AST studies.

CA = (TP + TN) / N.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

Returns:

Agreement rate in [0, 1].

Return type:

float

maldiamrkit.evaluation.vme_me_curve(y_true, y_score, resistant_label=1)[source]#

VME and ME rates at varying decision thresholds.

Useful for selecting an optimal threshold balancing VME against ME.

Parameters:
  • y_true (array-like) – True binary labels.

  • y_score (array-like) – Predicted scores (e.g., probabilities for the resistant class).

  • resistant_label (int, default=1) – Label value representing the resistant class.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

  • vme_rates (np.ndarray) – VME rates at each threshold.

  • me_rates (np.ndarray) – ME rates at each threshold.

  • thresholds (np.ndarray) – Decision thresholds (sorted ascending).

maldiamrkit.evaluation.amr_classification_report(y_true, y_pred, resistant_label=1)[source]#

Full AMR classification report.

Returns all clinical metrics in a single dictionary.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

  • resistant_label (int, default=1) – Label value representing the resistant class.

Returns:

Dictionary with keys: vme, me, sensitivity, specificity, categorical_agreement, n_resistant, n_susceptible, n_total.

Return type:

dict

Examples

>>> report = amr_classification_report([1, 1, 0, 0], [1, 0, 0, 1])
>>> report["vme"]
0.5
maldiamrkit.evaluation.amr_multilabel_report(y_true, y_pred, *, resistant_label=1, as_dataframe=False)[source]#

AMR classification report for multiple antibiotics.

Computes per-drug VME, ME, sensitivity, specificity, and categorical agreement, plus a macro-average across all drugs.

Parameters:
  • y_true (pd.DataFrame) – True binary labels with one column per antibiotic.

  • y_pred (pd.DataFrame) – Predicted binary labels with matching columns.

  • resistant_label (int, default=1) – Label value representing the resistant class.

  • as_dataframe (bool, default=False) – If True, return a DataFrame instead of a nested dict.

Returns:

Per-drug metrics plus a "macro_avg" entry. When as_dataframe is True, rows are drugs + "macro_avg" and columns are metric names.

Return type:

dict or pd.DataFrame

Examples

>>> report = amr_multilabel_report(y_true, y_pred, as_dataframe=True)
>>> report.loc["macro_avg", "vme"]
0.15
maldiamrkit.evaluation.mic_regression_report(y_true, y_pred, *, breakpoints=None, species=None, drug=None, sample_weight=None)[source]#

Compute MIC regression metrics on log2-MIC predictions.

Parameters:
  • y_true (array-like) – True log2(MIC) values.

  • y_pred (array-like) – Predicted log2(MIC) values.

  • breakpoints (BreakpointTable or None, default=None) – When provided, the report also includes categorical agreement after re-binning both y_true and y_pred to S/I/R. Requires species and drug.

  • species (str or array-like, optional) – Species per sample (or a single species applied to all). Required when breakpoints is provided.

  • drug (str or array-like, optional) – Drug per sample (or a single drug applied to all). Required when breakpoints is provided.

  • sample_weight (array-like, optional) – Per-sample weights for the regression metrics. Ignored for categorical agreement.

Returns:

Keys: n, rmse_log2, mae_log2, bias_log2, essential_agreement (fraction within ±1 dilution), and when breakpoints are provided also categorical_agreement, very_major_error_rate (R predicted as S), major_error_rate (S predicted as R), and per-category sample counts.

Return type:

dict

Notes

“Essential agreement” is the standard clinical benchmark for MIC prediction accuracy: a prediction is essential-agreement-correct if it is within one log2 dilution of the true value.

Sklearn Scorers#

Pre-built scorers for use with cross_val_score or GridSearchCV:

maldiamrkit.evaluation.vme_scorer#

Scorer that minimizes VME (Very Major Error rate). Use with cross_val_score(pipe, X, y, scoring=vme_scorer).

maldiamrkit.evaluation.me_scorer#

Scorer that minimizes ME (Major Error rate). Use with cross_val_score(pipe, X, y, scoring=me_scorer).

Metrics Example#

from maldiamrkit.evaluation import (
    very_major_error_rate, major_error_rate,
    amr_classification_report, vme_scorer,
)
from sklearn.model_selection import cross_val_score

# Individual metrics
vme = very_major_error_rate(y_true, y_pred)
me = major_error_rate(y_true, y_pred)

# Full report
report = amr_classification_report(y_true, y_pred)

# Use scorer in cross-validation
scores = cross_val_score(pipe, X, y, cv=5, scoring=vme_scorer)

Splitting Utilities#

maldiamrkit.evaluation.stratified_species_drug_split(X, y, species, test_size=0.2, random_state=None, min_count=2)[source]#

Stratified train/test split preserving species-drug label distributions.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Feature matrix.

  • y (array-like) – Resistance labels.

  • species (array-like) – Species labels aligned with X.

  • test_size (float, default=0.2) – Fraction of samples for the test set.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • min_count (int, default=2) – Minimum samples per species-drug stratum. Smaller groups are merged.

Returns:

X_train, X_test, y_train, y_test – Split data.

Return type:

arrays

maldiamrkit.evaluation.case_based_split(X, y, case_ids, test_size=0.2, random_state=None)[source]#

Train/test split keeping all samples from the same patient together.

Prevents data leakage from having the same patient in both train and test.

Parameters:
  • X (pd.DataFrame or np.ndarray) – Feature matrix.

  • y (array-like) – Resistance labels.

  • case_ids (array-like) – Patient/case identifiers aligned with X.

  • test_size (float, default=0.2) – Fraction of groups for the test set.

  • random_state (int or None, default=None) – Random seed for reproducibility.

Returns:

X_train, X_test, y_train, y_test – Split data.

Return type:

arrays

class maldiamrkit.evaluation.SpeciesDrugStratifiedKFold(n_splits=5, shuffle=True, random_state=None, min_count=2)[source]#

Bases: object

K-fold cross-validation with species-drug stratification.

Ensures each fold preserves the distribution of species-drug combinations. Implements the sklearn splitter interface.

Parameters:
  • n_splits (int, default=5) – Number of folds.

  • shuffle (bool, default=True) – Whether to shuffle before splitting.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • min_count (int, default=2) – Minimum samples per stratum before merging.

Examples

>>> cv = SpeciesDrugStratifiedKFold(n_splits=5)
>>> for train_idx, test_idx in cv.split(X, y, species=species):
...     X_train, X_test = X[train_idx], X[test_idx]
__init__(n_splits=5, shuffle=True, random_state=None, min_count=2)[source]#
Parameters:
get_n_splits(X=None, y=None, groups=None)[source]#

Return the number of splits.

Parameters:
Return type:

int

split(X, y, species=None, groups=None)[source]#

Generate train/test indices for each fold.

Parameters:
  • X (array-like) – Feature matrix.

  • y (array-like) – Resistance labels.

  • species (array-like) – Species labels. If None, falls back to plain stratified KFold.

  • groups (ignored) – Not used, present for API compatibility.

Yields:

train_idx, test_idx (np.ndarray) – Indices for train and test sets.

Return type:

Iterator[tuple[ndarray, ndarray]]

class maldiamrkit.evaluation.CaseGroupedKFold(n_splits=5, shuffle=True, random_state=None)[source]#

Bases: object

K-fold cross-validation keeping patient cases together and stratified by y.

All samples from the same case/patient are always in the same fold, and folds are stratified on the resistance label to preserve class balance. Wraps sklearn.model_selection.StratifiedGroupKFold.

Parameters:
  • n_splits (int, default=5) – Number of folds.

  • shuffle (bool, default=True) – Whether to shuffle group order before splitting.

  • random_state (int or None, default=None) – Random seed (used only when shuffle=True).

Examples

>>> cv = CaseGroupedKFold(n_splits=5)
>>> for train_idx, test_idx in cv.split(X, y, groups=case_ids):
...     X_train, X_test = X[train_idx], X[test_idx]
__init__(n_splits=5, shuffle=True, random_state=None)[source]#
Parameters:
get_n_splits(X=None, y=None, groups=None)[source]#

Return the number of splits.

Parameters:
Return type:

int

split(X, y=None, groups=None)[source]#

Generate stratified, group-preserving train/test indices for each fold.

Parameters:
  • X (array-like) – Feature matrix.

  • y (array-like) – Resistance labels. Required for stratification.

  • groups (array-like) – Case/patient identifiers. Required.

Yields:

train_idx, test_idx (np.ndarray) – Indices for train and test sets.

Raises:

ValueError – If groups or y is None.

Return type:

Iterator[tuple[ndarray, ndarray]]

Splitting Example#

from maldiamrkit.evaluation import (
    stratified_species_drug_split,
    case_based_split,
    SpeciesDrugStratifiedKFold,
    CaseGroupedKFold,
)

# Single split preserving species-drug distributions
X_train, X_test, y_train, y_test = stratified_species_drug_split(
    X, y, species=species_labels, test_size=0.2, random_state=42
)

# Patient-grouped split
X_train, X_test, y_train, y_test = case_based_split(
    X, y, case_ids=patient_ids, test_size=0.2
)

# Sklearn-compatible CV splitters
cv = SpeciesDrugStratifiedKFold(n_splits=5)
for train_idx, test_idx in cv.split(X, y, species=species_labels):
    pass

cv = CaseGroupedKFold(n_splits=5)
for train_idx, test_idx in cv.split(X, y, groups=patient_ids):
    pass

Multi-Drug Evaluation#

For predicting resistance to multiple antibiotics simultaneously:

from maldiamrkit.susceptibility import LabelEncoder
from maldiamrkit.evaluation import amr_multilabel_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

# Encode multi-drug labels (intermediate -> NaN)
enc = LabelEncoder(intermediate="nan")
y_encoded = enc.fit_transform(data.y)  # DataFrame with one column per drug

# Train multi-output model
clf = MultiOutputClassifier(RandomForestClassifier())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Per-drug AMR report
report = amr_multilabel_report(y_test, y_pred, as_dataframe=True)
print(report)