{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MaldiAMRKit - Peak Detection\n", "\n", "This notebook covers peak detection methods including local maxima and persistent homology." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import StratifiedKFold, cross_val_score\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nfrom maldiamrkit import MaldiSet\nfrom maldiamrkit.detection import MaldiPeakDetector\nfrom maldiamrkit.visualization import plot_peaks" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features shape: (29, 6000)\n", "Labels shape: (29,)\n" ] } ], "source": [ "data = MaldiSet.from_directory(\n", " \"../data/\",\n", " \"../data/metadata/metadata.csv\",\n", " aggregate_by=dict(antibiotics=\"Drug\"),\n", ")\n", "X = data.X\n", "y = data.y[\"Drug\"].map({\"S\": 0, \"I\": 1, \"R\": 1})\n", "\n", "print(f\"Features shape: {X.shape}\")\n", "print(f\"Labels shape: {y.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Peak Detection with Local Maxima\n", "\n", "The `MaldiPeakDetector` uses local maxima detection by default. It's fast and works well for most cases." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV ROC AUC: 0.417 +/- 0.260\n" ] } ], "source": [ "cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n", "\n", "pipe = Pipeline(\n", " [\n", " (\"peaks\", MaldiPeakDetector(binary=False, prominence=1e-8)),\n", " (\"scaler\", StandardScaler()),\n", " (\"clf\", LogisticRegression()),\n", " ]\n", ")\n", "\n", "scores = cross_val_score(pipe, X, y, cv=cv, scoring=\"roc_auc\")\n", "print(f\"CV ROC AUC: {scores.mean():.3f} +/- {scores.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Peak Detection with Persistent Homology\n", "\n", "Persistent homology (`method=\"ph\"`) is a topological approach that can better handle noise. It's slower but often more robust.\n", "\n", "**Note:** Requires the `gudhi` package." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
n_peaksmean_intensitymax_intensity
10s400.0015940.005164
11s430.0016950.006086
12s480.0018440.007017
13s560.0019460.007615
14s490.0017840.005047
\n", "
" ], "text/plain": [ " n_peaks mean_intensity max_intensity\n", "10s 40 0.001594 0.005164\n", "11s 43 0.001695 0.006086\n", "12s 48 0.001844 0.007017\n", "13s 56 0.001946 0.007615\n", "14s 49 0.001784 0.005047" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "peak_detector = MaldiPeakDetector(\n", " method=\"ph\",\n", " binary=False,\n", " persistence_threshold=5e-4,\n", ")\n", "\n", "# Fit on a subset and get statistics\n", "peak_detector.fit(X.iloc[:5])\n", "peak_detector.get_peak_statistics(X.iloc[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize Detected Peaks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "_ = plot_peaks(peak_detector, X.iloc[:1])" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary vs Intensity Mode\n", "\n", "The `binary` parameter controls whether peak positions are returned as binary (0/1) or preserve original intensities." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Binary mode unique values: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0.]\n", "Intensity mode - non-zero values: [0.0006844 0.00347906 0.00088132 0.00204877 0.00098946 0.00162945\n", " 0.00086379 0.00157396 0.00082802 0.00031478]\n" ] } ], "source": [ "# Binary mode - peaks are marked as 1, non-peaks as 0\n", "detector_binary = MaldiPeakDetector(binary=True, prominence=1e-4)\n", "peaks_binary = detector_binary.fit_transform(X.iloc[:1])\n", "print(f\"Binary mode unique values: {peaks_binary.values.flatten()[:20]}\")\n", "\n", "# Intensity mode - preserves original intensities at peak positions\n", "detector_intensity = MaldiPeakDetector(binary=False, prominence=1e-4)\n", "peaks_intensity = detector_intensity.fit_transform(X.iloc[:1])\n", "print(\n", " f\"Intensity mode - non-zero values: {peaks_intensity.values[peaks_intensity.values > 0][:10]}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parallelization\n", "\n", "Use `n_jobs` parameter to enable parallel processing for faster computation on large datasets." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processed 29 spectra\n" ] } ], "source": [ "# Sequential processing\n", "detector_seq = MaldiPeakDetector(prominence=1e-4, n_jobs=1)\n", "\n", "# Parallel processing (use all cores)\n", "detector_par = MaldiPeakDetector(prominence=1e-4, n_jobs=-1)\n", "\n", "# Both produce the same results\n", "result = detector_par.fit_transform(X)\n", "print(f\"Processed {len(X)} spectra\")" ] } ], "metadata": { "kernelspec": { "display_name": "maldiamrkit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }