{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8c8ba933",
   "metadata": {},
   "source": "# MaldiAMRKit - Susceptibility & MIC regression\n\nThis notebook walks through the `maldiamrkit.susceptibility` submodule introduced in v0.15, and its regression-evaluation counterpart in `maldiamrkit.evaluation`:\n\n* `MICEncoder` - turns raw MIC strings into a tidy DataFrame with `log2_mic`, a censoring mask, and (when given a `BreakpointTable`) the clinical S/I/R category plus the ATU flag.\n* `BreakpointTable` - clinical breakpoint table loaded from bundled EUCAST YAMLs or user-supplied files.\n* `mic_regression_report` - regression-style evaluation with essential agreement (within 1 dilution) and (with breakpoints) clinical categorical agreement. Lives in `maldiamrkit.evaluation` since it complements `amr_classification_report`.\n\nEverything below runs on a small synthetic dataset and the bundled `example.yaml` to keep the notebook self-contained. For real clinical work, drop in a vendored EUCAST YAML produced by the gitignored `eucast_converter/` tooling."
  },
  {
   "cell_type": "markdown",
   "id": "f65e8123",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d381582f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:08.935830Z",
     "iopub.status.busy": "2026-05-13T17:45:08.935732Z",
     "iopub.status.idle": "2026-05-13T17:45:09.483904Z",
     "shell.execute_reply": "2026-05-13T17:45:09.483196Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ettore/.venvs/maldiamrkit/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from importlib import resources\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from maldiamrkit.evaluation import mic_regression_report\n",
    "from maldiamrkit.susceptibility import BreakpointTable, MICEncoder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef98d596",
   "metadata": {},
   "source": [
    "## Loading a breakpoint table\n",
    "\n",
    "`BreakpointTable` has four constructors:\n",
    "\n",
    "* `BreakpointTable.from_yaml(path)` - load any YAML file in the canonical schema.\n",
    "* `BreakpointTable.from_version(\"16.0\")` - load a vendored EUCAST version.\n",
    "* `BreakpointTable.from_year(2026)` - look up by publication year via the bundled manifest.\n",
    "* `BreakpointTable.from_latest()` - return the highest-numbered bundled version.\n",
    "\n",
    "`BreakpointTable.list_available()` reports which EUCAST versions ship with the current install."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d80bf943",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:09.486276Z",
     "iopub.status.busy": "2026-05-13T17:45:09.485966Z",
     "iopub.status.idle": "2026-05-13T17:45:09.493030Z",
     "shell.execute_reply": "2026-05-13T17:45:09.492533Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bundled EUCAST versions on this install: ['1.0', '1.1', '1.2', '1.3', '2.0', '4.0', '5.0', '6.0', '7.1', '8.0', '8.1', '9.0', '10.0', '11.0', '12.0', '13.1', '14.0', '15.0', '16.0']\n",
      "BreakpointTable(EXAMPLE v0.0, 5 rows)\n",
      "Species: ['Escherichia coli', 'Klebsiella pneumoniae']\n",
      "Drugs: ['Ceftriaxone', 'Ciprofloxacin', 'Meropenem', 'Piperacillin-tazobactam']\n"
     ]
    }
   ],
   "source": [
    "available = BreakpointTable.list_available()\n",
    "print(f\"Bundled EUCAST versions on this install: {available or '[none yet]'}\")\n",
    "\n",
    "example_path = resources.files(\"maldiamrkit\") / \"data\" / \"breakpoints\" / \"example.yaml\"\n",
    "bp = BreakpointTable.from_yaml(example_path)\n",
    "print(bp)\n",
    "print(\"Species:\", bp.species())\n",
    "print(\"Drugs:\", bp.drugs())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c7b9d8b",
   "metadata": {},
   "source": [
    "### Categorising a single MIC value\n",
    "\n",
    "`bp.apply(species, drug, mic)` returns a `BreakpointResult` with three fields:\n",
    "\n",
    "* `category` - `\"S\"`, `\"I\"`, `\"R\"`, or `None` if the lookup failed.\n",
    "* `atu` - True when the MIC sits in the species/drug ATU (Area of Technical Uncertainty) range. This is an *assay-quality flag*, not a third clinical category.\n",
    "* `source` - provenance, e.g. `\"EUCAST v16.0\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cfb4d72b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:09.494195Z",
     "iopub.status.busy": "2026-05-13T17:45:09.494105Z",
     "iopub.status.idle": "2026-05-13T17:45:09.496300Z",
     "shell.execute_reply": "2026-05-13T17:45:09.495946Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MIC= 0.25 mg/L -> category=S     atu=False\n",
      "MIC=  1.0 mg/L -> category=S     atu=False\n",
      "MIC=  2.0 mg/L -> category=S     atu=False\n",
      "MIC=  4.0 mg/L -> category=I     atu=True\n",
      "MIC= 16.0 mg/L -> category=R     atu=False\n"
     ]
    }
   ],
   "source": [
    "for mic in (0.25, 1.0, 2.0, 4.0, 16.0):\n",
    "    r = bp.apply(\"Klebsiella pneumoniae\", \"Meropenem\", mic=mic)\n",
    "    print(f\"MIC={mic:>5} mg/L -> category={r.category!s:<5} atu={r.atu}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48509d28",
   "metadata": {},
   "source": [
    "Modern EUCAST treats `I` as *Susceptible, Increased exposure* (a real, treatable category) - not as \"uncertain\". ATU is the assay-quality flag that runs alongside S/I/R: a Meropenem MIC of 4 here is still clinically `I`, but the ATU flag tells you the call sits in a zone where assay variability can flip it. Treat ATU-flagged results as \"investigate further\" rather than discarding them."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "483f958d",
   "metadata": {},
   "source": [
    "## `MICEncoder` - parse MIC strings into ML targets\n",
    "\n",
    "`MICEncoder` is an sklearn-style transformer. Given a DataFrame with a MIC column, it produces:\n",
    "\n",
    "* `log2_mic` - regression target.\n",
    "* `censored` - True where the source MIC used `<=`, `<`, `>=`, or `>` qualifiers.\n",
    "* `category`, `atu`, `source` - populated only when a `BreakpointTable` is supplied.\n",
    "\n",
    "Without breakpoints, you get the regression-only output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9bf09805",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:09.498072Z",
     "iopub.status.busy": "2026-05-13T17:45:09.497891Z",
     "iopub.status.idle": "2026-05-13T17:45:09.505827Z",
     "shell.execute_reply": "2026-05-13T17:45:09.505275Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>log2_mic</th>\n",
       "      <th>censored</th>\n",
       "      <th>category</th>\n",
       "      <th>atu</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2.0</td>\n",
       "      <td>True</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "      <td>True</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   log2_mic  censored category   atu source\n",
       "0      -2.0      True     <NA>  <NA>   <NA>\n",
       "1      -1.0     False     <NA>  <NA>   <NA>\n",
       "2       0.0     False     <NA>  <NA>   <NA>\n",
       "3       1.0     False     <NA>  <NA>   <NA>\n",
       "4       2.0     False     <NA>  <NA>   <NA>\n",
       "5       4.0      True     <NA>  <NA>   <NA>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"Species\": [\"Klebsiella pneumoniae\"] * 6,\n",
    "        \"Drug\": [\"Ceftriaxone\"] * 6,\n",
    "        \"MIC\": [\"<=0.25\", \"0.5\", \"1\", \"2\", \"4\", \">16\"],\n",
    "    }\n",
    ")\n",
    "\n",
    "enc = MICEncoder(mic_col=\"MIC\")\n",
    "out = enc.fit_transform(df)\n",
    "out"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aea48231",
   "metadata": {},
   "source": [
    "Wire in a `BreakpointTable` and the same call also categorises each row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8edc5a2a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:09.507004Z",
     "iopub.status.busy": "2026-05-13T17:45:09.506895Z",
     "iopub.status.idle": "2026-05-13T17:45:09.512760Z",
     "shell.execute_reply": "2026-05-13T17:45:09.512286Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>log2_mic</th>\n",
       "      <th>censored</th>\n",
       "      <th>category</th>\n",
       "      <th>atu</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2.0</td>\n",
       "      <td>True</td>\n",
       "      <td>S</td>\n",
       "      <td>False</td>\n",
       "      <td>MaldiAMRKit synthetic example (NOT FOR CLINICA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.0</td>\n",
       "      <td>False</td>\n",
       "      <td>S</td>\n",
       "      <td>False</td>\n",
       "      <td>MaldiAMRKit synthetic example (NOT FOR CLINICA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>False</td>\n",
       "      <td>S</td>\n",
       "      <td>False</td>\n",
       "      <td>MaldiAMRKit synthetic example (NOT FOR CLINICA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>False</td>\n",
       "      <td>MaldiAMRKit synthetic example (NOT FOR CLINICA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>False</td>\n",
       "      <td>R</td>\n",
       "      <td>False</td>\n",
       "      <td>MaldiAMRKit synthetic example (NOT FOR CLINICA...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "      <td>True</td>\n",
       "      <td>R</td>\n",
       "      <td>False</td>\n",
       "      <td>MaldiAMRKit synthetic example (NOT FOR CLINICA...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   log2_mic  censored category    atu  \\\n",
       "0      -2.0      True        S  False   \n",
       "1      -1.0     False        S  False   \n",
       "2       0.0     False        S  False   \n",
       "3       1.0     False        I  False   \n",
       "4       2.0     False        R  False   \n",
       "5       4.0      True        R  False   \n",
       "\n",
       "                                              source  \n",
       "0  MaldiAMRKit synthetic example (NOT FOR CLINICA...  \n",
       "1  MaldiAMRKit synthetic example (NOT FOR CLINICA...  \n",
       "2  MaldiAMRKit synthetic example (NOT FOR CLINICA...  \n",
       "3  MaldiAMRKit synthetic example (NOT FOR CLINICA...  \n",
       "4  MaldiAMRKit synthetic example (NOT FOR CLINICA...  \n",
       "5  MaldiAMRKit synthetic example (NOT FOR CLINICA...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "enc = MICEncoder(\n",
    "    breakpoints=bp,\n",
    "    mic_col=\"MIC\",\n",
    "    species_col=\"Species\",\n",
    "    drug_col=\"Drug\",\n",
    ")\n",
    "out = enc.fit_transform(df)\n",
    "out"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51f4b3a3",
   "metadata": {},
   "source": [
    "`MICEncoder` is sklearn-compatible: `fit`, `transform`, `fit_transform`, and `get_feature_names_out` work as expected, so it slots into `Pipeline` and `ColumnTransformer`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "691a1e5a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:09.513742Z",
     "iopub.status.busy": "2026-05-13T17:45:09.513645Z",
     "iopub.status.idle": "2026-05-13T17:45:09.516416Z",
     "shell.execute_reply": "2026-05-13T17:45:09.516055Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['log2_mic', 'censored', 'category', 'atu', 'source']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "enc.get_feature_names_out().tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82f1c84e",
   "metadata": {},
   "source": [
    "## `mic_regression_report` - clinically-grounded evaluation\n",
    "\n",
    "When a model predicts continuous MIC values, the metrics clinicians look at are different from the standard regression suite:\n",
    "\n",
    "* `rmse_log2` / `mae_log2` / `bias_log2` - standard regression diagnostics on the log2 scale (one dilution = 1 log2 unit).\n",
    "* `essential_agreement` - fraction of predictions within +/- 1 log2 dilution. The clinical benchmark for MIC prediction.\n",
    "* When breakpoints are supplied, the report also re-bins both `y_true` and `y_pred` to S/I/R and reports clinical categorical agreement, very-major-error rate (R predicted as S), and major-error rate (S predicted as R)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "85cc1058",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-13T17:45:09.517619Z",
     "iopub.status.busy": "2026-05-13T17:45:09.517523Z",
     "iopub.status.idle": "2026-05-13T17:45:09.521731Z",
     "shell.execute_reply": "2026-05-13T17:45:09.521220Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                     n: 7\n",
      "             rmse_log2: 0.3637316429270602\n",
      "              mae_log2: 0.2746647707298642\n",
      "             bias_log2: 0.16018918733802207\n",
      "   essential_agreement: 1.0\n",
      " categorical_agreement: 0.7142857142857143\n",
      " very_major_error_rate: 0.0\n",
      "      major_error_rate: 0.0\n",
      "         n_categorical: 7\n",
      "      n_resistant_true: 3\n",
      "    n_susceptible_true: 3\n"
     ]
    }
   ],
   "source": [
    "rng = np.random.default_rng(seed=0)\n",
    "y_true_mic = np.array([0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0])\n",
    "y_true = np.log2(y_true_mic)\n",
    "y_pred = y_true + rng.normal(0.0, 0.6, size=y_true.shape)\n",
    "\n",
    "report = mic_regression_report(\n",
    "    y_true=y_true,\n",
    "    y_pred=y_pred,\n",
    "    breakpoints=bp,\n",
    "    species=\"Klebsiella pneumoniae\",\n",
    "    drug=\"Ceftriaxone\",\n",
    ")\n",
    "for k, v in report.items():\n",
    "    print(f\"{k:>22}: {v}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}