Similarity Module#
Spectral distance metrics, pairwise distance matrix computation, clustering algorithms, and visualizations for spectral similarity analysis.
Metrics#
- maldiamrkit.similarity.spectral_distance(spec_a, spec_b, metric=SpectralMetric.wasserstein)[source]#
Compute distance between two spectra.
- Parameters:
spec_a (MaldiSpectrum, DataFrame, or ndarray) – For non-binned metrics (
"wasserstein","dtw"):MaldiSpectrumor DataFrame withmassandintensitycolumns. For binned metrics ("cosine","spectral_contrast_angle","pearson"): 1-D intensity arrays.spec_b (MaldiSpectrum, DataFrame, or ndarray) – For non-binned metrics (
"wasserstein","dtw"):MaldiSpectrumor DataFrame withmassandintensitycolumns. For binned metrics ("cosine","spectral_contrast_angle","pearson"): 1-D intensity arrays.metric (str or SpectralMetric, default="wasserstein") – Key in
METRIC_REGISTRY.
- Returns:
Distance (or
1 - similarityfor correlation-based metrics).- Return type:
- Raises:
ValueError – If metric is not in
METRIC_REGISTRY.
- class maldiamrkit.similarity.SpectralMetric(value)[source]#
-
Supported spectral distance/similarity metrics.
- Variables:
wasserstein (str) – Earth mover’s (Wasserstein-1) distance on raw spectra.
dtw (str) – Dynamic time warping distance on raw spectra.
cosine (str) – Cosine distance on binned intensity vectors.
spectral_contrast_angle (str) – Spectral contrast angle on binned intensity vectors.
pearson (str) – 1 - Pearson correlation on binned intensity vectors.
- wasserstein = 'wasserstein'#
- dtw = 'dtw'#
- cosine = 'cosine'#
- spectral_contrast_angle = 'spectral_contrast_angle'#
- pearson = 'pearson'#
- maldiamrkit.similarity.METRIC_REGISTRY#
Dictionary mapping metric names to callable distance functions. See
SpectralMetricfor the built-in keys.
Pairwise Distances#
- maldiamrkit.similarity.pairwise_distances(spectra, metric=SpectralMetric.wasserstein, n_jobs=1)[source]#
Compute an n x n symmetric distance matrix.
- Parameters:
spectra (list[MaldiSpectrum] or DataFrame) – If a
DataFrame(binned feature matrix, rows are samples), row vectors are used. If a list ofMaldiSpectrum, raw/preprocessed data is used.metric (str or SpectralMetric, default="wasserstein") – Key in
METRIC_REGISTRY.n_jobs (int, default=1) – Number of parallel jobs for pairwise computation.
- Returns:
Symmetric distance matrix of shape
(n, n)with zeros on the diagonal.- Return type:
np.ndarray
- Raises:
ValueError – If metric is not in the registry.
Clustering#
- maldiamrkit.similarity.cluster_spectra(distance_matrix, method=ClusteringMethod.hierarchical, n_clusters=None, threshold=None, **kwargs)[source]#
Cluster spectra from a precomputed distance matrix.
- Parameters:
distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
method ({"hierarchical", "hdbscan", "kmedoids"}, default="hierarchical") – Clustering algorithm.
n_clusters (int or None, default=None) – Number of clusters. Required for
"kmedoids"and one ofn_clusters/thresholdfor"hierarchical".threshold (float or None, default=None) – Distance threshold for cutting the dendrogram (hierarchical only).
**kwargs –
Extra keyword arguments forwarded to the underlying function:
hierarchical:
method(linkage method, default"average") and any extra keyword arguments accepted byscipy.cluster.hierarchy.linkage().hdbscan:
eps(cluster selection epsilon, default0.5),min_samples(default5).kmedoids:
max_iter(default300),random_state,init("build"or"random").
- Returns:
Cluster labels.
- Return type:
ndarray of shape (n,)
- Raises:
ValueError – If method is unknown, or required parameters are missing / conflicting.
- maldiamrkit.similarity.hierarchical_clustering(distance_matrix, method='average', **kwargs)[source]#
Agglomerative hierarchical clustering on a precomputed distance matrix.
- Parameters:
distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
method (str, default="average") – Linkage method forwarded to
scipy.cluster.hierarchy.linkage().**kwargs – Extra keyword arguments for
linkage().
- Returns:
Linkage matrix.
- Return type:
ndarray of shape (n - 1, 4)
- maldiamrkit.similarity.hdbscan_clustering(distance_matrix, eps=0.5, min_samples=5)[source]#
HDBSCAN clustering on a precomputed distance matrix.
- Parameters:
- Returns:
Cluster labels (
-1for noise points).- Return type:
ndarray of shape (n,)
- maldiamrkit.similarity.kmedoids_clustering(distance_matrix, n_clusters=3, max_iter=300, random_state=None, init=KMedoidsInit.build)[source]#
K-medoids clustering using the PAM algorithm.
- Parameters:
distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
n_clusters (int, default=3) – Number of clusters.
max_iter (int, default=300) – Maximum SWAP iterations.
random_state (int or None, default=None) – Random seed (used only when
init="random").init (str or KMedoidsInit, default="build") – Medoid initialization strategy.
"build"uses the deterministic BUILD phase of PAM;"random"selects initial medoids uniformly at random.
- Returns:
Cluster labels.
- Return type:
ndarray of shape (n,)
- Raises:
ValueError – If init is not
"build"or"random".
- maldiamrkit.similarity.silhouette_scores(distance_matrix, labels)[source]#
Silhouette score for a clustering on a precomputed distance matrix.
- Parameters:
distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
labels (ndarray of shape (n,)) – Cluster assignments.
- Returns:
Mean silhouette coefficient in
[-1, 1].- Return type:
- maldiamrkit.similarity.cluster_metadata_concordance(labels, metadata)[source]#
Evaluate clustering agreement with known metadata labels.
Visualization#
- maldiamrkit.similarity.plot_distance_heatmap(distance_matrix, labels=None, *, metric=None, cmap='viridis', ax=None, title=None, figsize=None, annot=None, cluster=False, vmin=None, vmax=None, cbar_label='distance', show=True)[source]#
Plot a pairwise distance matrix as a heatmap.
- Parameters:
distance_matrix (ndarray of shape (n, n)) – Symmetric distance matrix.
labels (list of str, ndarray, or None, default=None) – Tick labels for rows and columns.
metric (str, optional) – Name of the distance metric used (e.g.
"cosine","pearson","spectral_contrast_angle"). When given and recognised, the colourbar limits are clamped to the metric’s theoretical bounds so heatmaps computed with the same metric share a comparable colour scale. Explicitvmin/vmaxalways win.cmap (str, default="viridis") – Matplotlib / seaborn colormap name.
ax (Axes or None, default=None) – Pre-existing axes. If
None, a new figure and axes are created.title (str or None, default=None) – Plot title. Defaults to
"Pairwise distance"(including the metric name, if provided).figsize (tuple of float, optional) – Figure size in inches. When
None, scales withn(side = min(16, 4 + 0.1 * n)). Only used whenaxisNone.annot (bool, optional) – When
True, annotate each cell with its distance value. WhenNone(default), annotate iff the matrix is small (n ≤ 15).cluster (bool, default=False) – When
True, reorder rows and columns via hierarchical clustering so similar samples group visually. Labels reorder accordingly.vmin (float, optional) – Explicit colourbar limits. Override any metric-derived bounds.
vmax (float, optional) – Explicit colourbar limits. Override any metric-derived bounds.
cbar_label (str, default="distance") – Label drawn on the colourbar.
show (bool, default=True) – Call
plt.show()at the end.
- Return type:
tuple[Figure,Axes]- Returns:
fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)
- maldiamrkit.similarity.plot_dendrogram(linkage_matrix, labels=None, *, ax=None, title=None, figsize=(10, 6), leaf_rotation=90.0, color_threshold=None, truncate_mode=None, p=30, show=True)[source]#
Plot a dendrogram from a hierarchical clustering linkage matrix.
- Parameters:
linkage_matrix (ndarray of shape (n - 1, 4)) – Linkage matrix from
hierarchical_clustering().ax (Axes or None, default=None) – Pre-existing axes.
title (str or None, default=None) – Plot title. Defaults to
"Hierarchical clustering dendrogram".figsize (tuple of float, default=(10, 6)) – Figure size in inches (used only when
axisNone).leaf_rotation (float, default=90.0) – Rotation (in degrees) of leaf labels along the bottom axis.
color_threshold (float, optional) – Colour threshold forwarded to scipy’s
dendrogram. Clusters below this threshold share a colour. WhenNone(default), scipy chooses0.7 * max(linkage[:, 2]).truncate_mode ({"lastp", "level", None}, optional) – Forwarded to scipy’s
dendrogramto collapse deep branches. Essential for large trees; pair withp.p (int, default=30) – Number of leaves / merges to keep when
truncate_modeis set.show (bool, default=True) – Call
plt.show()at the end.
- Return type:
tuple[Figure,Axes]- Returns:
fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)
Example#
from maldiamrkit.similarity import (
pairwise_distances,
cluster_spectra,
plot_distance_heatmap,
plot_dendrogram,
hierarchical_clustering,
)
# Compute pairwise distance matrix
D = pairwise_distances(spectra, metric="cosine", n_jobs=-1)
# Visualize distances
plot_distance_heatmap(D, labels=sample_ids)
# Cluster spectra
labels = cluster_spectra(D, method="hierarchical", n_clusters=3)
# Plot dendrogram
linkage = hierarchical_clustering(D)
plot_dendrogram(linkage, labels=sample_ids)