Similarity Module#

Spectral distance metrics, pairwise distance matrix computation, clustering algorithms, and visualizations for spectral similarity analysis.

Metrics#

maldiamrkit.similarity.spectral_distance(spec_a, spec_b, metric=SpectralMetric.wasserstein)[source]#

Compute distance between two spectra.

Parameters:

spec_a (MaldiSpectrum, DataFrame, or ndarray) – For non-binned metrics ("wasserstein", "dtw"): MaldiSpectrum or DataFrame with mass and intensity columns. For binned metrics ("cosine", "spectral_contrast_angle", "pearson"): 1-D intensity arrays.
spec_b (MaldiSpectrum, DataFrame, or ndarray) – For non-binned metrics ("wasserstein", "dtw"): MaldiSpectrum or DataFrame with mass and intensity columns. For binned metrics ("cosine", "spectral_contrast_angle", "pearson"): 1-D intensity arrays.
metric (str or SpectralMetric, default="wasserstein") – One of the values of SpectralMetric.

Returns:

Distance (or 1 - similarity for correlation-based metrics).

Return type:

float

Raises:

ValueError – If metric is not a recognised SpectralMetric.

class maldiamrkit.similarity.SpectralMetric(value)[source]#

Bases: str, Enum

Supported spectral distance/similarity metrics.

Variables:

wasserstein (str) – Earth mover’s (Wasserstein-1) distance on raw spectra.
dtw (str) – Dynamic time warping distance on raw spectra.
cosine (str) – Cosine distance on binned intensity vectors.
spectral_contrast_angle (str) – Spectral contrast angle on binned intensity vectors.
pearson (str) – 1 - Pearson correlation on binned intensity vectors.

wasserstein = 'wasserstein'#

dtw = 'dtw'#

cosine = 'cosine'#

spectral_contrast_angle = 'spectral_contrast_angle'#

pearson = 'pearson'#

Pairwise Distances#

maldiamrkit.similarity.pairwise_distances(spectra, metric=SpectralMetric.wasserstein, n_jobs=1)[source]#

Compute an n x n symmetric distance matrix.

Parameters:

spectra (list[MaldiSpectrum] or DataFrame) – If a DataFrame (binned feature matrix, rows are samples), row vectors are used. If a list of MaldiSpectrum, raw/preprocessed data is used.
metric (str or SpectralMetric, default="wasserstein") – One of the values of SpectralMetric.
n_jobs (int, default=1) – Number of parallel jobs for pairwise computation.

Returns:

Symmetric distance matrix of shape (n, n) with zeros on the diagonal.

Return type:

np.ndarray

Raises:

ValueError – If metric is not in the registry.

Clustering#

maldiamrkit.similarity.cluster_spectra(distance_matrix, method=ClusteringMethod.hierarchical, n_clusters=None, threshold=None, **kwargs)[source]#

Cluster spectra from a precomputed distance matrix.

Parameters:

distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
method ({"hierarchical", "hdbscan", "kmedoids"}, default="hierarchical") – Clustering algorithm.
n_clusters (int or None, default=None) – Number of clusters. Required for "kmedoids" and one of n_clusters / threshold for "hierarchical".
threshold (float or None, default=None) – Distance threshold for cutting the dendrogram (hierarchical only).
**kwargs –
Extra keyword arguments forwarded to the underlying function:
- hierarchical: method (linkage method, default "average") and any extra keyword arguments accepted by scipy.cluster.hierarchy.linkage().
- hdbscan: eps (cluster selection epsilon, default 0.5), min_samples (default 5).
- kmedoids: max_iter (default 300), random_state, init ("build" or "random").

Returns:

Cluster labels.

Return type:

ndarray of shape (n,)

Raises:

ValueError – If method is unknown, or required parameters are missing / conflicting.

maldiamrkit.similarity.hierarchical_clustering(distance_matrix, method='average', **kwargs)[source]#

Agglomerative hierarchical clustering on a precomputed distance matrix.

Parameters:

distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
method (str, default="average") – Linkage method forwarded to scipy.cluster.hierarchy.linkage().
**kwargs – Extra keyword arguments for linkage().

Returns:

Linkage matrix.

Return type:

ndarray of shape (n - 1, 4)

maldiamrkit.similarity.hdbscan_clustering(distance_matrix, eps=0.5, min_samples=5)[source]#

HDBSCAN clustering on a precomputed distance matrix.

Parameters:

distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
eps (float, default=0.5) – Cluster selection epsilon passed to cluster_selection_epsilon.
min_samples (int, default=5) – Minimum number of samples in a neighbourhood.

Returns:

Cluster labels (-1 for noise points).

Return type:

ndarray of shape (n,)

maldiamrkit.similarity.kmedoids_clustering(distance_matrix, n_clusters=3, max_iter=300, random_state=None, init=KMedoidsInit.build)[source]#

K-medoids clustering using the PAM algorithm.

Parameters:

distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
n_clusters (int, default=3) – Number of clusters.
max_iter (int, default=300) – Maximum SWAP iterations.
random_state (int or None, default=None) – Random seed (used only when init="random").
init (str or KMedoidsInit, default="build") – Medoid initialization strategy. "build" uses the deterministic BUILD phase of PAM; "random" selects initial medoids uniformly at random.

Returns:

Cluster labels.

Return type:

ndarray of shape (n,)

Raises:

ValueError – If init is not "build" or "random".

maldiamrkit.similarity.silhouette_scores(distance_matrix, labels)[source]#

Silhouette score for a clustering on a precomputed distance matrix.

Parameters:

distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.
labels (ndarray of shape (n,)) – Cluster assignments.

Returns:

Mean silhouette coefficient in [-1, 1].

Return type:

float

maldiamrkit.similarity.cluster_metadata_concordance(labels, metadata)[source]#

Evaluate clustering agreement with known metadata labels.

Parameters:

labels (ndarray of shape (n,)) – Cluster assignments.
metadata (Series of shape (n,)) – Ground-truth categorical labels.

Returns:

{"adjusted_rand_index": float, "normalized_mutual_info": float}.

Return type:

dict[str, float]

class maldiamrkit.similarity.ClusteringMethod(value)[source]#

Bases: str, Enum

Supported clustering algorithms for cluster_spectra().

Variables:

hierarchical (str) – Agglomerative hierarchical clustering.
hdbscan (str) – HDBSCAN density-based clustering.
kmedoids (str) – K-medoids (PAM) clustering.

hierarchical = 'hierarchical'#

hdbscan = 'hdbscan'#

kmedoids = 'kmedoids'#

class maldiamrkit.similarity.KMedoidsInit(value)[source]#

Bases: str, Enum

Initialization strategy for kmedoids_clustering().

Variables:

build (str) – Deterministic BUILD phase of PAM.
random (str) – Random medoid selection.

build = 'build'#

random = 'random'#

Visualization#

maldiamrkit.similarity.plot_distance_heatmap(distance_matrix, labels=None, *, metric=None, cmap='viridis', ax=None, title=None, figsize=None, annot=None, cluster=False, vmin=None, vmax=None, cbar_label='distance', show=True)[source]#

Plot a pairwise distance matrix as a heatmap.

Parameters:

distance_matrix (ndarray of shape (n, n)) – Symmetric distance matrix.
labels (list of str, ndarray, or None, default=None) – Tick labels for rows and columns.
metric (str, optional) – Name of the distance metric used (e.g. "cosine", "pearson", "spectral_contrast_angle"). When given and recognised, the colourbar limits are clamped to the metric’s theoretical bounds so heatmaps computed with the same metric share a comparable colour scale. Explicit vmin/vmax always win.
cmap (str, default="viridis") – Matplotlib / seaborn colormap name.
ax (Axes or None, default=None) – Pre-existing axes. If None, a new figure and axes are created.
title (str or None, default=None) – Plot title. Defaults to "Pairwise distance" (including the metric name, if provided).
figsize (tuple of float, optional) – Figure size in inches. When None, scales with n (side = min(16, 4 + 0.1 * n)). Only used when ax is None.
annot (bool, optional) – When True, annotate each cell with its distance value. When None (default), annotate iff the matrix is small (n ≤ 15).
cluster (bool, default=False) – When True, reorder rows and columns via hierarchical clustering so similar samples group visually. Labels reorder accordingly.
vmin (float, optional) – Explicit colourbar limits. Override any metric-derived bounds.
vmax (float, optional) – Explicit colourbar limits. Override any metric-derived bounds.
cbar_label (str, default="distance") – Label drawn on the colourbar.
show (bool, default=True) – Call plt.show() at the end.

Return type:

tuple[Figure, Axes]

Returns:

fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)

maldiamrkit.similarity.plot_dendrogram(linkage_matrix, labels=None, *, ax=None, title=None, figsize=(10, 6), leaf_rotation=90.0, color_threshold=None, truncate_mode=None, p=30, show=True)[source]#

Plot a dendrogram from a hierarchical clustering linkage matrix.

Parameters:

linkage_matrix (ndarray of shape (n - 1, 4)) – Linkage matrix from hierarchical_clustering().
labels (list of str or None, default=None) – Leaf labels.
ax (Axes or None, default=None) – Pre-existing axes.
title (str or None, default=None) – Plot title. Defaults to "Hierarchical clustering dendrogram".
figsize (tuple of float, default=(10, 6)) – Figure size in inches (used only when ax is None).
leaf_rotation (float, default=90.0) – Rotation (in degrees) of leaf labels along the bottom axis.
color_threshold (float, optional) – Colour threshold forwarded to scipy’s dendrogram. Clusters below this threshold share a colour. When None (default), scipy chooses 0.7 * max(linkage[:, 2]).
truncate_mode ({"lastp", "level", None}, optional) – Forwarded to scipy’s dendrogram to collapse deep branches. Essential for large trees; pair with p.
p (int, default=30) – Number of leaves / merges to keep when truncate_mode is set.
show (bool, default=True) – Call plt.show() at the end.

Return type:

tuple[Figure, Axes]

Returns:

fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)

Example#

from maldiamrkit.similarity import (
    pairwise_distances,
    cluster_spectra,
    plot_distance_heatmap,
    plot_dendrogram,
    hierarchical_clustering,
)

# Compute pairwise distance matrix
D = pairwise_distances(spectra, metric="cosine", n_jobs=-1)

# Visualize distances
plot_distance_heatmap(D, labels=sample_ids)

# Cluster spectra
labels = cluster_spectra(D, method="hierarchical", n_clusters=3)

# Plot dendrogram
linkage = hierarchical_clustering(D)
plot_dendrogram(linkage, labels=sample_ids)