Similarity Module#

Spectral distance metrics, pairwise distance matrix computation, clustering algorithms, and visualizations for spectral similarity analysis.

Metrics#

maldiamrkit.similarity.spectral_distance(spec_a, spec_b, metric=SpectralMetric.wasserstein)[source]#

Compute distance between two spectra.

Parameters:
  • spec_a (MaldiSpectrum, DataFrame, or ndarray) – For non-binned metrics ("wasserstein", "dtw"): MaldiSpectrum or DataFrame with mass and intensity columns. For binned metrics ("cosine", "spectral_contrast_angle", "pearson"): 1-D intensity arrays.

  • spec_b (MaldiSpectrum, DataFrame, or ndarray) – For non-binned metrics ("wasserstein", "dtw"): MaldiSpectrum or DataFrame with mass and intensity columns. For binned metrics ("cosine", "spectral_contrast_angle", "pearson"): 1-D intensity arrays.

  • metric (str or SpectralMetric, default="wasserstein") – Key in METRIC_REGISTRY.

Returns:

Distance (or 1 - similarity for correlation-based metrics).

Return type:

float

Raises:

ValueError – If metric is not in METRIC_REGISTRY.

class maldiamrkit.similarity.SpectralMetric(value)[source]#

Bases: str, Enum

Supported spectral distance/similarity metrics.

Variables:
  • wasserstein (str) – Earth mover’s (Wasserstein-1) distance on raw spectra.

  • dtw (str) – Dynamic time warping distance on raw spectra.

  • cosine (str) – Cosine distance on binned intensity vectors.

  • spectral_contrast_angle (str) – Spectral contrast angle on binned intensity vectors.

  • pearson (str) – 1 - Pearson correlation on binned intensity vectors.

wasserstein = 'wasserstein'#
dtw = 'dtw'#
cosine = 'cosine'#
spectral_contrast_angle = 'spectral_contrast_angle'#
pearson = 'pearson'#
maldiamrkit.similarity.METRIC_REGISTRY#

Dictionary mapping metric names to callable distance functions. See SpectralMetric for the built-in keys.

Pairwise Distances#

maldiamrkit.similarity.pairwise_distances(spectra, metric=SpectralMetric.wasserstein, n_jobs=1)[source]#

Compute an n x n symmetric distance matrix.

Parameters:
Returns:

Symmetric distance matrix of shape (n, n) with zeros on the diagonal.

Return type:

np.ndarray

Raises:

ValueError – If metric is not in the registry.

Clustering#

maldiamrkit.similarity.cluster_spectra(distance_matrix, method=ClusteringMethod.hierarchical, n_clusters=None, threshold=None, **kwargs)[source]#

Cluster spectra from a precomputed distance matrix.

Parameters:
  • distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.

  • method ({"hierarchical", "hdbscan", "kmedoids"}, default="hierarchical") – Clustering algorithm.

  • n_clusters (int or None, default=None) – Number of clusters. Required for "kmedoids" and one of n_clusters / threshold for "hierarchical".

  • threshold (float or None, default=None) – Distance threshold for cutting the dendrogram (hierarchical only).

  • **kwargs

    Extra keyword arguments forwarded to the underlying function:

    • hierarchical: method (linkage method, default "average") and any extra keyword arguments accepted by scipy.cluster.hierarchy.linkage().

    • hdbscan: eps (cluster selection epsilon, default 0.5), min_samples (default 5).

    • kmedoids: max_iter (default 300), random_state, init ("build" or "random").

Returns:

Cluster labels.

Return type:

ndarray of shape (n,)

Raises:

ValueError – If method is unknown, or required parameters are missing / conflicting.

maldiamrkit.similarity.hierarchical_clustering(distance_matrix, method='average', **kwargs)[source]#

Agglomerative hierarchical clustering on a precomputed distance matrix.

Parameters:
  • distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.

  • method (str, default="average") – Linkage method forwarded to scipy.cluster.hierarchy.linkage().

  • **kwargs – Extra keyword arguments for linkage().

Returns:

Linkage matrix.

Return type:

ndarray of shape (n - 1, 4)

maldiamrkit.similarity.hdbscan_clustering(distance_matrix, eps=0.5, min_samples=5)[source]#

HDBSCAN clustering on a precomputed distance matrix.

Parameters:
  • distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.

  • eps (float, default=0.5) – Cluster selection epsilon passed to cluster_selection_epsilon.

  • min_samples (int, default=5) – Minimum number of samples in a neighbourhood.

Returns:

Cluster labels (-1 for noise points).

Return type:

ndarray of shape (n,)

maldiamrkit.similarity.kmedoids_clustering(distance_matrix, n_clusters=3, max_iter=300, random_state=None, init=KMedoidsInit.build)[source]#

K-medoids clustering using the PAM algorithm.

Parameters:
  • distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.

  • n_clusters (int, default=3) – Number of clusters.

  • max_iter (int, default=300) – Maximum SWAP iterations.

  • random_state (int or None, default=None) – Random seed (used only when init="random").

  • init (str or KMedoidsInit, default="build") – Medoid initialization strategy. "build" uses the deterministic BUILD phase of PAM; "random" selects initial medoids uniformly at random.

Returns:

Cluster labels.

Return type:

ndarray of shape (n,)

Raises:

ValueError – If init is not "build" or "random".

maldiamrkit.similarity.silhouette_scores(distance_matrix, labels)[source]#

Silhouette score for a clustering on a precomputed distance matrix.

Parameters:
  • distance_matrix (ndarray of shape (n, n)) – Symmetric pairwise distance matrix.

  • labels (ndarray of shape (n,)) – Cluster assignments.

Returns:

Mean silhouette coefficient in [-1, 1].

Return type:

float

maldiamrkit.similarity.cluster_metadata_concordance(labels, metadata)[source]#

Evaluate clustering agreement with known metadata labels.

Parameters:
  • labels (ndarray of shape (n,)) – Cluster assignments.

  • metadata (Series of shape (n,)) – Ground-truth categorical labels.

Returns:

{"adjusted_rand_index": float, "normalized_mutual_info": float}.

Return type:

dict[str, float]

class maldiamrkit.similarity.ClusteringMethod(value)[source]#

Bases: str, Enum

Supported clustering algorithms for cluster_spectra().

Variables:
  • hierarchical (str) – Agglomerative hierarchical clustering.

  • hdbscan (str) – HDBSCAN density-based clustering.

  • kmedoids (str) – K-medoids (PAM) clustering.

hierarchical = 'hierarchical'#
hdbscan = 'hdbscan'#
kmedoids = 'kmedoids'#
class maldiamrkit.similarity.KMedoidsInit(value)[source]#

Bases: str, Enum

Initialization strategy for kmedoids_clustering().

Variables:
  • build (str) – Deterministic BUILD phase of PAM.

  • random (str) – Random medoid selection.

build = 'build'#
random = 'random'#

Visualization#

maldiamrkit.similarity.plot_distance_heatmap(distance_matrix, labels=None, *, metric=None, cmap='viridis', ax=None, title=None, figsize=None, annot=None, cluster=False, vmin=None, vmax=None, cbar_label='distance', show=True)[source]#

Plot a pairwise distance matrix as a heatmap.

Parameters:
  • distance_matrix (ndarray of shape (n, n)) – Symmetric distance matrix.

  • labels (list of str, ndarray, or None, default=None) – Tick labels for rows and columns.

  • metric (str, optional) – Name of the distance metric used (e.g. "cosine", "pearson", "spectral_contrast_angle"). When given and recognised, the colourbar limits are clamped to the metric’s theoretical bounds so heatmaps computed with the same metric share a comparable colour scale. Explicit vmin/vmax always win.

  • cmap (str, default="viridis") – Matplotlib / seaborn colormap name.

  • ax (Axes or None, default=None) – Pre-existing axes. If None, a new figure and axes are created.

  • title (str or None, default=None) – Plot title. Defaults to "Pairwise distance" (including the metric name, if provided).

  • figsize (tuple of float, optional) – Figure size in inches. When None, scales with n (side = min(16, 4 + 0.1 * n)). Only used when ax is None.

  • annot (bool, optional) – When True, annotate each cell with its distance value. When None (default), annotate iff the matrix is small (n 15).

  • cluster (bool, default=False) – When True, reorder rows and columns via hierarchical clustering so similar samples group visually. Labels reorder accordingly.

  • vmin (float, optional) – Explicit colourbar limits. Override any metric-derived bounds.

  • vmax (float, optional) – Explicit colourbar limits. Override any metric-derived bounds.

  • cbar_label (str, default="distance") – Label drawn on the colourbar.

  • show (bool, default=True) – Call plt.show() at the end.

Return type:

tuple[Figure, Axes]

Returns:

  • fig (matplotlib.figure.Figure)

  • ax (matplotlib.axes.Axes)

maldiamrkit.similarity.plot_dendrogram(linkage_matrix, labels=None, *, ax=None, title=None, figsize=(10, 6), leaf_rotation=90.0, color_threshold=None, truncate_mode=None, p=30, show=True)[source]#

Plot a dendrogram from a hierarchical clustering linkage matrix.

Parameters:
  • linkage_matrix (ndarray of shape (n - 1, 4)) – Linkage matrix from hierarchical_clustering().

  • labels (list of str or None, default=None) – Leaf labels.

  • ax (Axes or None, default=None) – Pre-existing axes.

  • title (str or None, default=None) – Plot title. Defaults to "Hierarchical clustering dendrogram".

  • figsize (tuple of float, default=(10, 6)) – Figure size in inches (used only when ax is None).

  • leaf_rotation (float, default=90.0) – Rotation (in degrees) of leaf labels along the bottom axis.

  • color_threshold (float, optional) – Colour threshold forwarded to scipy’s dendrogram. Clusters below this threshold share a colour. When None (default), scipy chooses 0.7 * max(linkage[:, 2]).

  • truncate_mode ({"lastp", "level", None}, optional) – Forwarded to scipy’s dendrogram to collapse deep branches. Essential for large trees; pair with p.

  • p (int, default=30) – Number of leaves / merges to keep when truncate_mode is set.

  • show (bool, default=True) – Call plt.show() at the end.

Return type:

tuple[Figure, Axes]

Returns:

  • fig (matplotlib.figure.Figure)

  • ax (matplotlib.axes.Axes)

Example#

from maldiamrkit.similarity import (
    pairwise_distances,
    cluster_spectra,
    plot_distance_heatmap,
    plot_dendrogram,
    hierarchical_clustering,
)

# Compute pairwise distance matrix
D = pairwise_distances(spectra, metric="cosine", n_jobs=-1)

# Visualize distances
plot_distance_heatmap(D, labels=sample_ids)

# Cluster spectra
labels = cluster_spectra(D, method="hierarchical", n_clusters=3)

# Plot dendrogram
linkage = hierarchical_clustering(D)
plot_dendrogram(linkage, labels=sample_ids)