{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MaldiAMRKit - Peak Detection\n",
    "\n",
    "This notebook covers peak detection methods including local maxima and persistent homology."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import StratifiedKFold, cross_val_score\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nfrom maldiamrkit import MaldiSet\nfrom maldiamrkit.detection import MaldiPeakDetector\nfrom maldiamrkit.visualization import plot_peaks"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features shape: (29, 6000)\n",
      "Labels shape: (29,)\n"
     ]
    }
   ],
   "source": [
    "data = MaldiSet.from_directory(\n",
    "    \"../data/\",\n",
    "    \"../data/metadata/metadata.csv\",\n",
    "    aggregate_by=dict(antibiotics=\"Drug\"),\n",
    ")\n",
    "X = data.X\n",
    "y = data.y[\"Drug\"].map({\"S\": 0, \"I\": 1, \"R\": 1})\n",
    "\n",
    "print(f\"Features shape: {X.shape}\")\n",
    "print(f\"Labels shape: {y.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Peak Detection with Local Maxima\n",
    "\n",
    "The `MaldiPeakDetector` uses local maxima detection by default. It's fast and works well for most cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CV ROC AUC: 0.417 +/- 0.260\n"
     ]
    }
   ],
   "source": [
    "cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n",
    "\n",
    "pipe = Pipeline(\n",
    "    [\n",
    "        (\"peaks\", MaldiPeakDetector(binary=False, prominence=1e-8)),\n",
    "        (\"scaler\", StandardScaler()),\n",
    "        (\"clf\", LogisticRegression()),\n",
    "    ]\n",
    ")\n",
    "\n",
    "scores = cross_val_score(pipe, X, y, cv=cv, scoring=\"roc_auc\")\n",
    "print(f\"CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Peak Detection with Persistent Homology\n",
    "\n",
    "Persistent homology (`method=\"ph\"`) is a topological approach that can better handle noise. It's slower but often more robust.\n",
    "\n",
    "**Note:** Requires the `gudhi` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>n_peaks</th>\n",
       "      <th>mean_intensity</th>\n",
       "      <th>max_intensity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10s</th>\n",
       "      <td>40</td>\n",
       "      <td>0.001594</td>\n",
       "      <td>0.005164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11s</th>\n",
       "      <td>43</td>\n",
       "      <td>0.001695</td>\n",
       "      <td>0.006086</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12s</th>\n",
       "      <td>48</td>\n",
       "      <td>0.001844</td>\n",
       "      <td>0.007017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13s</th>\n",
       "      <td>56</td>\n",
       "      <td>0.001946</td>\n",
       "      <td>0.007615</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14s</th>\n",
       "      <td>49</td>\n",
       "      <td>0.001784</td>\n",
       "      <td>0.005047</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     n_peaks  mean_intensity  max_intensity\n",
       "10s       40        0.001594       0.005164\n",
       "11s       43        0.001695       0.006086\n",
       "12s       48        0.001844       0.007017\n",
       "13s       56        0.001946       0.007615\n",
       "14s       49        0.001784       0.005047"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "peak_detector = MaldiPeakDetector(\n",
    "    method=\"ph\",\n",
    "    binary=False,\n",
    "    persistence_threshold=5e-4,\n",
    ")\n",
    "\n",
    "# Fit on a subset and get statistics\n",
    "peak_detector.fit(X.iloc[:5])\n",
    "peak_detector.get_peak_statistics(X.iloc[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize Detected Peaks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "_ = plot_peaks(peak_detector, X.iloc[:1])"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Binary vs Intensity Mode\n",
    "\n",
    "The `binary` parameter controls whether peak positions are returned as binary (0/1) or preserve original intensities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Binary mode unique values: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0.]\n",
      "Intensity mode - non-zero values: [0.0006844  0.00347906 0.00088132 0.00204877 0.00098946 0.00162945\n",
      " 0.00086379 0.00157396 0.00082802 0.00031478]\n"
     ]
    }
   ],
   "source": [
    "# Binary mode - peaks are marked as 1, non-peaks as 0\n",
    "detector_binary = MaldiPeakDetector(binary=True, prominence=1e-4)\n",
    "peaks_binary = detector_binary.fit_transform(X.iloc[:1])\n",
    "print(f\"Binary mode unique values: {peaks_binary.values.flatten()[:20]}\")\n",
    "\n",
    "# Intensity mode - preserves original intensities at peak positions\n",
    "detector_intensity = MaldiPeakDetector(binary=False, prominence=1e-4)\n",
    "peaks_intensity = detector_intensity.fit_transform(X.iloc[:1])\n",
    "print(\n",
    "    f\"Intensity mode - non-zero values: {peaks_intensity.values[peaks_intensity.values > 0][:10]}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parallelization\n",
    "\n",
    "Use `n_jobs` parameter to enable parallel processing for faster computation on large datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed 29 spectra\n"
     ]
    }
   ],
   "source": [
    "# Sequential processing\n",
    "detector_seq = MaldiPeakDetector(prominence=1e-4, n_jobs=1)\n",
    "\n",
    "# Parallel processing (use all cores)\n",
    "detector_par = MaldiPeakDetector(prominence=1e-4, n_jobs=-1)\n",
    "\n",
    "# Both produce the same results\n",
    "result = detector_par.fit_transform(X)\n",
    "print(f\"Processed {len(X)} spectra\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "maldiamrkit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}