pdmlabs.evaluation.vus.utils.utility#
A set of utility functions to support outlier detection.
Functions
|
given a list of positive values as a histogram drawn from any information source, returns the empirical entropy of its discrete probability function. |
|
|
|
|
|
Return the index of top n elements in the list if order is set to 'desc', otherwise return the index of n smallest ones. :param value_list: A list containing all values. :type value_list: list, array, numpy array of shape (n_samples,) :param n: The number of elements to select. :type n: int :param order: The order to sort {'desc', 'asc'}: - 'desc': descending - 'asc': ascending :type order: str, optional (default='desc'). |
|
|
|
|
|
Checks if fit and decision_function methods exist for given detector :param detector: Detector instance for which the check is performed. |
|
Check if an input is within the defined range. |
|
Creates an DiFF tree using a sample of size sample_size of the original data. |
|
|
|
|
|
Randomly draw feature indices. Internal use only. Modified from sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random permutations generator. :type random_state: RandomState :param bootstrap_features: Specifies whether to bootstrap indice generation :type bootstrap_features: bool :param n_features: Specifies the population size when generating indices :type n_features: int :param min_features: Lower limit for number of features to randomly sample :type min_features: int :param max_features: Upper limit for number of features to randomly sample :type max_features: int. |
|
Draw randomly sampled indices. Internal use only. See sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random permutations generator. :type random_state: RandomState :param bootstrap: Specifies whether to bootstrap indice generation :type bootstrap: bool :param n_population: Specifies the population size when generating indices :type n_population: int :param n_samples: Specifies number of samples to draw :type n_samples: int. |
|
Randomly selects a split value from set of scalar data 'X'. |
|
get the elements in li1 but not li2, and vice versa :param li1: Input list 1. |
|
get the overlapping between two lists :param li1: Input list 1. |
|
Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores. |
|
|
|
get the elements in li1 but not li2. |
|
Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different. :param scores: The list of values to be inverted :type scores: list, array or numpy array with shape (n_samples,) :param method: Methods used for order inversion. Valid methods are: - 'multiplication': multiply by -1 - 'subtraction': max(scores) - scores :type method: str, optional (default='multiplication'). |
Utility function to calculate row-wise euclidean distance of two matrix. |
|
|
Utility function to calculate precision @ rank n. |
|
Turn raw outlier outlier scores to binary labels (0 or 1). |
|
Given a set of instances S falling into node and a value alpha >=0, returns for all element x in S the weighted similarity score between x and the centroid M of S (node.M) |
|
Conduct Z-normalization on data to turn input samples become zero-mean and unit variance. |
|
Recursive function that walks a tree from an already fitted forest to compute the path length of the new observations. |
|
Given a list of values corresponding to a feature dimension, returns a weight (in [0,1]) that is one minus the normalized empirical entropy, a way to characterize the importance of the feature dimension. |
- pdmlabs.evaluation.vus.utils.utility.EE(hist)#
given a list of positive values as a histogram drawn from any information source, returns the empirical entropy of its discrete probability function.
- Parameters:
hist (array) ā histogram
- Returns:
empirical entropy estimated from the histogram
- Return type:
float
- pdmlabs.evaluation.vus.utils.utility.EuclideanDist(x, y)#
- pdmlabs.evaluation.vus.utils.utility.all_branches(node, current=[], branches=None)#
- pdmlabs.evaluation.vus.utils.utility.argmaxn(value_list, n, order='desc')#
Return the index of top n elements in the list if order is set to ādescā, otherwise return the index of n smallest ones. :param value_list: A list containing all values. :type value_list: list, array, numpy array of shape (n_samples,) :param n: The number of elements to select. :type n: int :param order: The order to sort {ādescā, āascā}:
ādescā: descending
āascā: ascending
- Returns:
index_list ā The index of the top n elements.
- Return type:
numpy array of shape (n,)
- pdmlabs.evaluation.vus.utils.utility.branch2num(branch, init_root=0)#
- pdmlabs.evaluation.vus.utils.utility.c_factor(n)#
- pdmlabs.evaluation.vus.utils.utility.check_detector(detector)#
Checks if fit and decision_function methods exist for given detector :param detector: Detector instance for which the check is performed. :type detector: pyod.models
- pdmlabs.evaluation.vus.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)#
Check if an input is within the defined range. :param param: The input parameter to check. :type param: int, float :param low: The lower bound of the range. :type low: int, float :param high: The higher bound of the range. :type high: int, float :param param_name: The name of the parameter. :type param_name: str, optional (default=āā) :param include_left: Whether includes the lower bound (lower bound <=). :type include_left: bool, optional (default=False) :param include_right: Whether includes the higher bound (<= higher bound). :type include_right: bool, optional (default=False)
- Returns:
within_range ā Whether the parameter is within the range of (low, high)
- Return type:
bool or raise errors
- pdmlabs.evaluation.vus.utils.utility.create_tree(X, featureDistrib, sample_size, max_height)#
Creates an DiFF tree using a sample of size sample_size of the original data.
- Parameters:
X (nD array.) ā nD array with the observations. Dimensions should be (n_obs, n_features).
sample_size (int) ā Size of the sample from which a DiFF tree is built.
max_height (int) ā Maximum height of the tree.
- Return type:
a DiFF tree
- pdmlabs.evaluation.vus.utils.utility.dist2set(x, X)#
- pdmlabs.evaluation.vus.utils.utility.gen_graph(branches, g=None, init_root=0, pre='')#
- pdmlabs.evaluation.vus.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)#
Randomly draw feature indices. Internal use only. Modified from sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random
permutations generator.
- Parameters:
bootstrap_features (bool) ā Specifies whether to bootstrap indice generation
n_features (int) ā Specifies the population size when generating indices
min_features (int) ā Lower limit for number of features to randomly sample
max_features (int) ā Upper limit for number of features to randomly sample
- Returns:
feature_indices ā Indices for features to bag
- Return type:
numpy array, shape (n_samples,)
- pdmlabs.evaluation.vus.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)#
Draw randomly sampled indices. Internal use only. See sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random
permutations generator.
- Parameters:
bootstrap (bool) ā Specifies whether to bootstrap indice generation
n_population (int) ā Specifies the population size when generating indices
n_samples (int) ā Specifies number of samples to draw
- Returns:
indices ā randomly drawn indices
- Return type:
numpy array, shape (n_samples,)
- pdmlabs.evaluation.vus.utils.utility.getSplit(X)#
Randomly selects a split value from set of scalar data āXā. Returns the split value.
- Parameters:
X (array) ā Array of scalar values
- Returns:
split value
- Return type:
float
- pdmlabs.evaluation.vus.utils.utility.get_diff_elements(li1, li2)#
get the elements in li1 but not li2, and vice versa :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array
- Returns:
difference ā The difference between li1 and li2.
- Return type:
list
- pdmlabs.evaluation.vus.utils.utility.get_intersection(lst1, lst2)#
get the overlapping between two lists :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array
- Returns:
difference ā The overlapping between li1 and li2.
- Return type:
list
- pdmlabs.evaluation.vus.utils.utility.get_label_n(y, y_pred, n=None)#
Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores. :param y: The ground truth. Binary (0: inliers, 1: outliers). :type y: list or numpy array of shape (n_samples,) :param y_pred: The raw outlier scores as returned by a fitted model. :type y_pred: list or numpy array of shape (n_samples,) :param n: The number of outliers. if not defined, infer using ground truth. :type n: int, optional (default=None)
- Returns:
labels ā binary labels 0: normal points and 1: outliers
- Return type:
numpy array of shape (n_samples,)
Examples
>>> from pyod.utils.utility import get_label_n >>> y = [0, 1, 1, 0, 0] >>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7] >>> get_label_n(y, y_pred) array([0, 1, 0, 0, 1])
- pdmlabs.evaluation.vus.utils.utility.get_list_anomaly(labels)#
- pdmlabs.evaluation.vus.utils.utility.get_list_diff(li1, li2)#
get the elements in li1 but not li2. li1-li2 :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array
- Returns:
difference ā The difference between li1 and li2.
- Return type:
list
- pdmlabs.evaluation.vus.utils.utility.invert_order(scores, method='multiplication')#
Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different. :param scores: The list of values to be inverted :type scores: list, array or numpy array with shape (n_samples,) :param method: Methods used for order inversion. Valid methods are:
āmultiplicationā: multiply by -1
āsubtractionā: max(scores) - scores
- Returns:
inverted_scores ā The inverted list
- Return type:
numpy array of shape (n_samples,)
Examples
>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1] >>> invert_order(scores1) array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1]) >>> invert_order(scores1, method='subtraction') array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])
- pdmlabs.evaluation.vus.utils.utility.pairwise_distances_no_broadcast(X, Y)#
Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast. For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4). :param X: First input samples :type X: array of shape (n_samples, n_features) :param Y: Second input samples :type Y: array of shape (n_samples, n_features)
- Returns:
distance ā Row-wise euclidean distance of X and Y
- Return type:
array of shape (n_samples,)
- pdmlabs.evaluation.vus.utils.utility.precision_n_scores(y, y_pred, n=None)#
Utility function to calculate precision @ rank n. :param y: The ground truth. Binary (0: inliers, 1: outliers). :type y: list or numpy array of shape (n_samples,) :param y_pred: The raw outlier scores as returned by a fitted model. :type y_pred: list or numpy array of shape (n_samples,) :param n: The number of outliers. if not defined, infer using ground truth. :type n: int, optional (default=None)
- Returns:
precision_at_rank_n ā Precision at rank n score.
- Return type:
float
- pdmlabs.evaluation.vus.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)#
Turn raw outlier outlier scores to binary labels (0 or 1). :param pred_scores: Raw outlier scores. Outliers are assumed have larger values. :type pred_scores: list or numpy array of shape (n_samples,) :param outliers_fraction: Percentage of outliers. :type outliers_fraction: float in (0,1)
- Returns:
outlier_labels ā For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
- Return type:
numpy array of shape (n_samples,)
- pdmlabs.evaluation.vus.utils.utility.similarityScore(S, node, alpha)#
Given a set of instances S falling into node and a value alpha >=0, returns for all element x in S the weighted similarity score between x and the centroid M of S (node.M)
- Parameters:
S (array of instances) ā Array of instances that fall into a node
node (a DiFF tree node) ā S is the set of instances āfallingā into the node
alpha (float) ā alpha is the distance scaling hyper-parameter
- Returns:
the array of similarity values between the instances in S and the mean of training instances falling in node
- Return type:
array
- pdmlabs.evaluation.vus.utils.utility.standardizer(X, X_t=None, keep_scalar=False)#
Conduct Z-normalization on data to turn input samples become zero-mean and unit variance. :param X: The training samples :type X: numpy array of shape (n_samples, n_features) :param X_t: The data to be converted :type X_t: numpy array of shape (n_samples_new, n_features), optional (default=None) :param keep_scalar: The flag to indicate whether to return the scalar :type keep_scalar: bool, optional (default=False)
- Returns:
X_norm (numpy array of shape (n_samples, n_features)) ā X after the Z-score normalization
X_t_norm (numpy array of shape (n_samples, n_features)) ā X_t after the Z-score normalization
scalar (sklearn scalar object) ā The scalar used in conversion
- pdmlabs.evaluation.vus.utils.utility.walk_tree(forest, node, treeIdx, obsIdx, X, featureDistrib, depth=0, alpha=0.01)#
Recursive function that walks a tree from an already fitted forest to compute the path length of the new observations.
- Parameters:
forest (DiFF_RF) ā A fitted forest of DiFF trees
node (DiFF Tree node) ā the current node
treeIdx (int) ā index of the tree that is being walked.
obsIdx (array) ā 1D array of length n_obs. 1/0 if the obs has reached / has not reached the node.
X (nD array.) ā array of observations/instances.
depth (int) ā current depth.
- Return type:
None
- pdmlabs.evaluation.vus.utils.utility.weightFeature(s, nbins)#
Given a list of values corresponding to a feature dimension, returns a weight (in [0,1]) that is one minus the normalized empirical entropy, a way to characterize the importance of the feature dimension.
- Parameters:
s (array) ā list of scalar values corresponding to a feature dimension
nbins (int) ā the number of bins used to discretize the feature dimension using an histogram.
- Returns:
the importance weight for feature s.
- Return type:
float