pdmlabs.utils.distances

pdmlabs.utils.distances#

Pairwise distance calculation utilities for time-series data.

This module provides efficient functions for computing distances between multiple time-series subsequences using various metrics (Euclidean, DTW, cross-correlation, RBF kernels, etc.).

Functions:

calculate_distance_many_to_many: Compute all-pairs distances between two dataframes calculate_distance_many_to_one: Compute distances from one point to all rows in dataframe cross_dists: Compute cross-correlation distance

Supported Metrics:

‘euclidean’: Euclidean distance
‘manhattan’: Manhattan/L1 distance
‘cosine’: Cosine distance
‘rbf_kernel’: RBF kernel distance (default gamma=0.5)
‘rbf_kernel{gamma}’: RBF with custom gamma, e.g., ‘rbf_kernel0.1’
Other scipy.spatial.distance.cdist metrics

Example

>>> import pandas as pd
>>> from pdmlabs.utils.distances import calculate_distance_many_to_many
>>> reference_data = pd.DataFrame([[1,2,3], [4,5,6]])
>>> test_data = pd.DataFrame([[1,2,4], [5,6,7]])
>>> distances = calculate_distance_many_to_many(reference_data, test_data, 'euclidean')
>>> # Returns 2x2 distance matrix

Functions

`calculate_distance_many_to_many`(a, b, metric)	Compute all-pairs distances between rows of two dataframes.
`calculate_distance_many_to_one`(a, b, metric)	Compute distances from one vector to all rows in a dataframe.
`cross_dists`(s1, s2)	Compute normalized cross-correlation distance between two time-series.

pdmlabs.utils.distances.calculate_distance_many_to_many(a, b, metric)#

Compute all-pairs distances between rows of two dataframes.

Calculates pairwise distances between all rows of dataframe a and all rows of dataframe b, essentially building a distance matrix of shape (n_samples_a, n_samples_b).

Parameters:

a (pd.DataFrame) – First dataframe, each row is a time-series or feature vector.
b (pd.DataFrame) – Second dataframe, each row is a time-series or feature vector.
metric (str) – Distance metric to use: - ‘euclidean’: L2 norm (default for most metrics) - ‘manhattan’: L1 norm - ‘cosine’: Cosine distance - ‘rbf_kernel’: RBF kernel distance (1 - rbf_kernel(a, b)) - ‘rbf_kernel{gamma}’: RBF with custom gamma, e.g., ‘rbf_kernel0.1’ - Any metric supported by scipy.spatial.distance.cdist

Returns:

Distance matrix of shape (len(a), len(b)). Element [i,j] is the distance between row i of a and row j of b.

Return type:

np.ndarray

Examples

>>> import pandas as pd
>>> a = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
>>> b = pd.DataFrame([[1, 2, 4], [7, 8, 9]])
>>> dist = calculate_distance_many_to_many(a, b, 'euclidean')
>>> print(dist)
[[1.0, 6.08276...]
 [6.08276..., 1.0]]

Notes

For RBF kernel: returns 1 - rbf_kernel, so distance increases as similarity decreases
Handles empty dataframes gracefully

pdmlabs.utils.distances.calculate_distance_many_to_one(a, b, metric)#

Compute distances from one vector to all rows in a dataframe.

Calculates distances between a single vector b and each row of dataframe a. Essentially the first row of calculate_distance_many_to_many when b is a single sample.

Parameters:

a (pd.DataFrame) – Dataframe where each row is a time-series or feature vector.
b (np.ndarray or array-like) – Single vector (1D array) to compare against all rows of a. Will be reshaped to (1, -1) for distance computation.
metric (str) – Distance metric (same options as calculate_distance_many_to_many): ‘euclidean’, ‘manhattan’, ‘cosine’, ‘rbf_kernel’, etc.

Returns:

1D array of shape (len(a),) where distances[i] is the distance between b and row i of a.

Return type:

np.ndarray

Examples

>>> import pandas as pd
>>> import numpy as np
>>> profile_data = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> test_point = np.array([1, 2, 4])
>>> distances = calculate_distance_many_to_one(profile_data, test_point, 'euclidean')
>>> print(distances)
[1.0, 5.38516..., 10.29563...]

Notes

More efficient than using calculate_distance_many_to_many with single sample
Useful for online/streaming scenarios where you score one sample against many
Input b doesn’t need to be in dataframe format

pdmlabs.utils.distances.cross_dists(s1, s2)#

Compute normalized cross-correlation distance between two time-series.

Uses the maximum normalized cross-correlation coefficient between two sequences to measure similarity. Distance is 1 - (max correlation).

Parameters:

s1 (np.ndarray) – First time-series (1D array).
s2 (np.ndarray) – Second time-series (1D array).

Returns:

Distance value between 0 (identical) and 1 (completely anticorrelated).

Return type:

float

Notes

Uses tslearn.metrics.cdist_normalized_cc for normalized cross-correlation
Effectively measures template matching or shape similarity
Invariant to small phase shifts and time warping (local)
Range: [0, 1] where 0 is perfect match

pdmlabs.utils.distances

Contents

pdmlabs.utils.distances#