pdmlabs.utils.distances#
Pairwise distance calculation utilities for time-series data.
This module provides efficient functions for computing distances between multiple time-series subsequences using various metrics (Euclidean, DTW, cross-correlation, RBF kernels, etc.).
- Functions:
calculate_distance_many_to_many: Compute all-pairs distances between two dataframes calculate_distance_many_to_one: Compute distances from one point to all rows in dataframe cross_dists: Compute cross-correlation distance
- Supported Metrics:
‘euclidean’: Euclidean distance
‘manhattan’: Manhattan/L1 distance
‘cosine’: Cosine distance
‘rbf_kernel’: RBF kernel distance (default gamma=0.5)
‘rbf_kernel{gamma}’: RBF with custom gamma, e.g., ‘rbf_kernel0.1’
Other scipy.spatial.distance.cdist metrics
Example
>>> import pandas as pd
>>> from pdmlabs.utils.distances import calculate_distance_many_to_many
>>> reference_data = pd.DataFrame([[1,2,3], [4,5,6]])
>>> test_data = pd.DataFrame([[1,2,4], [5,6,7]])
>>> distances = calculate_distance_many_to_many(reference_data, test_data, 'euclidean')
>>> # Returns 2x2 distance matrix
Functions
|
Compute all-pairs distances between rows of two dataframes. |
|
Compute distances from one vector to all rows in a dataframe. |
|
Compute normalized cross-correlation distance between two time-series. |
- pdmlabs.utils.distances.calculate_distance_many_to_many(a, b, metric)#
Compute all-pairs distances between rows of two dataframes.
Calculates pairwise distances between all rows of dataframe a and all rows of dataframe b, essentially building a distance matrix of shape (n_samples_a, n_samples_b).
- Parameters:
a (pd.DataFrame) – First dataframe, each row is a time-series or feature vector.
b (pd.DataFrame) – Second dataframe, each row is a time-series or feature vector.
metric (str) – Distance metric to use: - ‘euclidean’: L2 norm (default for most metrics) - ‘manhattan’: L1 norm - ‘cosine’: Cosine distance - ‘rbf_kernel’: RBF kernel distance (1 - rbf_kernel(a, b)) - ‘rbf_kernel{gamma}’: RBF with custom gamma, e.g., ‘rbf_kernel0.1’ - Any metric supported by scipy.spatial.distance.cdist
- Returns:
Distance matrix of shape (len(a), len(b)). Element [i,j] is the distance between row i of a and row j of b.
- Return type:
np.ndarray
Examples
>>> import pandas as pd >>> a = pd.DataFrame([[1, 2, 3], [4, 5, 6]]) >>> b = pd.DataFrame([[1, 2, 4], [7, 8, 9]]) >>> dist = calculate_distance_many_to_many(a, b, 'euclidean') >>> print(dist) [[1.0, 6.08276...] [6.08276..., 1.0]]
Notes
For RBF kernel: returns 1 - rbf_kernel, so distance increases as similarity decreases
Handles empty dataframes gracefully
- pdmlabs.utils.distances.calculate_distance_many_to_one(a, b, metric)#
Compute distances from one vector to all rows in a dataframe.
Calculates distances between a single vector b and each row of dataframe a. Essentially the first row of calculate_distance_many_to_many when b is a single sample.
- Parameters:
a (pd.DataFrame) – Dataframe where each row is a time-series or feature vector.
b (np.ndarray or array-like) – Single vector (1D array) to compare against all rows of a. Will be reshaped to (1, -1) for distance computation.
metric (str) – Distance metric (same options as calculate_distance_many_to_many): ‘euclidean’, ‘manhattan’, ‘cosine’, ‘rbf_kernel’, etc.
- Returns:
1D array of shape (len(a),) where distances[i] is the distance between b and row i of a.
- Return type:
np.ndarray
Examples
>>> import pandas as pd >>> import numpy as np >>> profile_data = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) >>> test_point = np.array([1, 2, 4]) >>> distances = calculate_distance_many_to_one(profile_data, test_point, 'euclidean') >>> print(distances) [1.0, 5.38516..., 10.29563...]
Notes
More efficient than using calculate_distance_many_to_many with single sample
Useful for online/streaming scenarios where you score one sample against many
Input b doesn’t need to be in dataframe format
- pdmlabs.utils.distances.cross_dists(s1, s2)#
Compute normalized cross-correlation distance between two time-series.
Uses the maximum normalized cross-correlation coefficient between two sequences to measure similarity. Distance is 1 - (max correlation).
- Parameters:
s1 (np.ndarray) – First time-series (1D array).
s2 (np.ndarray) – Second time-series (1D array).
- Returns:
Distance value between 0 (identical) and 1 (completely anticorrelated).
- Return type:
float
Notes
Uses tslearn.metrics.cdist_normalized_cc for normalized cross-correlation
Effectively measures template matching or shape similarity
Invariant to small phase shifts and time warping (local)
Range: [0, 1] where 0 is perfect match