pdmlabs.evaluation.vus.utils.utility#

A set of utility functions to support outlier detection.

Functions

EE(hist)

given a list of positive values as a histogram drawn from any information source, returns the empirical entropy of its discrete probability function.

EuclideanDist(x,Ā y)

all_branches(node[,Ā current,Ā branches])

argmaxn(value_list,Ā n[,Ā order])

Return the index of top n elements in the list if order is set to 'desc', otherwise return the index of n smallest ones. :param value_list: A list containing all values. :type value_list: list, array, numpy array of shape (n_samples,) :param n: The number of elements to select. :type n: int :param order: The order to sort {'desc', 'asc'}: - 'desc': descending - 'asc': ascending :type order: str, optional (default='desc').

branch2num(branch[,Ā init_root])

c_factor(n)

check_detector(detector)

Checks if fit and decision_function methods exist for given detector :param detector: Detector instance for which the check is performed.

check_parameter(param[,Ā low,Ā high,Ā ...])

Check if an input is within the defined range.

create_tree(X,Ā featureDistrib,Ā sample_size,Ā ...)

Creates an DiFF tree using a sample of size sample_size of the original data.

dist2set(x,Ā X)

gen_graph(branches[,Ā g,Ā init_root,Ā pre])

generate_bagging_indices(random_state,Ā ...)

Randomly draw feature indices. Internal use only. Modified from sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random permutations generator. :type random_state: RandomState :param bootstrap_features: Specifies whether to bootstrap indice generation :type bootstrap_features: bool :param n_features: Specifies the population size when generating indices :type n_features: int :param min_features: Lower limit for number of features to randomly sample :type min_features: int :param max_features: Upper limit for number of features to randomly sample :type max_features: int.

generate_indices(random_state,Ā bootstrap,Ā ...)

Draw randomly sampled indices. Internal use only. See sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random permutations generator. :type random_state: RandomState :param bootstrap: Specifies whether to bootstrap indice generation :type bootstrap: bool :param n_population: Specifies the population size when generating indices :type n_population: int :param n_samples: Specifies number of samples to draw :type n_samples: int.

getSplit(X)

Randomly selects a split value from set of scalar data 'X'.

get_diff_elements(li1,Ā li2)

get the elements in li1 but not li2, and vice versa :param li1: Input list 1.

get_intersection(lst1,Ā lst2)

get the overlapping between two lists :param li1: Input list 1.

get_label_n(y,Ā y_pred[,Ā n])

Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores.

get_list_anomaly(labels)

get_list_diff(li1,Ā li2)

get the elements in li1 but not li2.

invert_order(scores[,Ā method])

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different. :param scores: The list of values to be inverted :type scores: list, array or numpy array with shape (n_samples,) :param method: Methods used for order inversion. Valid methods are: - 'multiplication': multiply by -1 - 'subtraction': max(scores) - scores :type method: str, optional (default='multiplication').

pairwise_distances_no_broadcast(X,Ā Y)

Utility function to calculate row-wise euclidean distance of two matrix.

precision_n_scores(y,Ā y_pred[,Ā n])

Utility function to calculate precision @ rank n.

score_to_label(pred_scores[,Ā outliers_fraction])

Turn raw outlier outlier scores to binary labels (0 or 1).

similarityScore(S,Ā node,Ā alpha)

Given a set of instances S falling into node and a value alpha >=0, returns for all element x in S the weighted similarity score between x and the centroid M of S (node.M)

standardizer(X[,Ā X_t,Ā keep_scalar])

Conduct Z-normalization on data to turn input samples become zero-mean and unit variance.

walk_tree(forest,Ā node,Ā treeIdx,Ā obsIdx,Ā X,Ā ...)

Recursive function that walks a tree from an already fitted forest to compute the path length of the new observations.

weightFeature(s,Ā nbins)

Given a list of values corresponding to a feature dimension, returns a weight (in [0,1]) that is one minus the normalized empirical entropy, a way to characterize the importance of the feature dimension.

pdmlabs.evaluation.vus.utils.utility.EE(hist)#

given a list of positive values as a histogram drawn from any information source, returns the empirical entropy of its discrete probability function.

Parameters:

hist (array) – histogram

Returns:

empirical entropy estimated from the histogram

Return type:

float

pdmlabs.evaluation.vus.utils.utility.EuclideanDist(x, y)#
pdmlabs.evaluation.vus.utils.utility.all_branches(node, current=[], branches=None)#
pdmlabs.evaluation.vus.utils.utility.argmaxn(value_list, n, order='desc')#

Return the index of top n elements in the list if order is set to ā€˜desc’, otherwise return the index of n smallest ones. :param value_list: A list containing all values. :type value_list: list, array, numpy array of shape (n_samples,) :param n: The number of elements to select. :type n: int :param order: The order to sort {ā€˜desc’, ā€˜asc’}:

  • ā€˜desc’: descending

  • ā€˜asc’: ascending

Returns:

index_list – The index of the top n elements.

Return type:

numpy array of shape (n,)

pdmlabs.evaluation.vus.utils.utility.branch2num(branch, init_root=0)#
pdmlabs.evaluation.vus.utils.utility.c_factor(n)#
pdmlabs.evaluation.vus.utils.utility.check_detector(detector)#

Checks if fit and decision_function methods exist for given detector :param detector: Detector instance for which the check is performed. :type detector: pyod.models

pdmlabs.evaluation.vus.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)#

Check if an input is within the defined range. :param param: The input parameter to check. :type param: int, float :param low: The lower bound of the range. :type low: int, float :param high: The higher bound of the range. :type high: int, float :param param_name: The name of the parameter. :type param_name: str, optional (default=’’) :param include_left: Whether includes the lower bound (lower bound <=). :type include_left: bool, optional (default=False) :param include_right: Whether includes the higher bound (<= higher bound). :type include_right: bool, optional (default=False)

Returns:

within_range – Whether the parameter is within the range of (low, high)

Return type:

bool or raise errors

pdmlabs.evaluation.vus.utils.utility.create_tree(X, featureDistrib, sample_size, max_height)#

Creates an DiFF tree using a sample of size sample_size of the original data.

Parameters:
  • X (nD array.) – nD array with the observations. Dimensions should be (n_obs, n_features).

  • sample_size (int) – Size of the sample from which a DiFF tree is built.

  • max_height (int) – Maximum height of the tree.

Return type:

a DiFF tree

pdmlabs.evaluation.vus.utils.utility.dist2set(x, X)#
pdmlabs.evaluation.vus.utils.utility.gen_graph(branches, g=None, init_root=0, pre='')#
pdmlabs.evaluation.vus.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)#

Randomly draw feature indices. Internal use only. Modified from sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random

permutations generator.

Parameters:
  • bootstrap_features (bool) – Specifies whether to bootstrap indice generation

  • n_features (int) – Specifies the population size when generating indices

  • min_features (int) – Lower limit for number of features to randomly sample

  • max_features (int) – Upper limit for number of features to randomly sample

Returns:

feature_indices – Indices for features to bag

Return type:

numpy array, shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)#

Draw randomly sampled indices. Internal use only. See sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random

permutations generator.

Parameters:
  • bootstrap (bool) – Specifies whether to bootstrap indice generation

  • n_population (int) – Specifies the population size when generating indices

  • n_samples (int) – Specifies number of samples to draw

Returns:

indices – randomly drawn indices

Return type:

numpy array, shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.getSplit(X)#

Randomly selects a split value from set of scalar data ā€˜X’. Returns the split value.

Parameters:

X (array) – Array of scalar values

Returns:

split value

Return type:

float

pdmlabs.evaluation.vus.utils.utility.get_diff_elements(li1, li2)#

get the elements in li1 but not li2, and vice versa :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array

Returns:

difference – The difference between li1 and li2.

Return type:

list

pdmlabs.evaluation.vus.utils.utility.get_intersection(lst1, lst2)#

get the overlapping between two lists :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array

Returns:

difference – The overlapping between li1 and li2.

Return type:

list

pdmlabs.evaluation.vus.utils.utility.get_label_n(y, y_pred, n=None)#

Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores. :param y: The ground truth. Binary (0: inliers, 1: outliers). :type y: list or numpy array of shape (n_samples,) :param y_pred: The raw outlier scores as returned by a fitted model. :type y_pred: list or numpy array of shape (n_samples,) :param n: The number of outliers. if not defined, infer using ground truth. :type n: int, optional (default=None)

Returns:

labels – binary labels 0: normal points and 1: outliers

Return type:

numpy array of shape (n_samples,)

Examples

>>> from pyod.utils.utility import get_label_n
>>> y = [0, 1, 1, 0, 0]
>>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7]
>>> get_label_n(y, y_pred)
array([0, 1, 0, 0, 1])
pdmlabs.evaluation.vus.utils.utility.get_list_anomaly(labels)#
pdmlabs.evaluation.vus.utils.utility.get_list_diff(li1, li2)#

get the elements in li1 but not li2. li1-li2 :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array

Returns:

difference – The difference between li1 and li2.

Return type:

list

pdmlabs.evaluation.vus.utils.utility.invert_order(scores, method='multiplication')#

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different. :param scores: The list of values to be inverted :type scores: list, array or numpy array with shape (n_samples,) :param method: Methods used for order inversion. Valid methods are:

  • ā€˜multiplication’: multiply by -1

  • ā€˜subtraction’: max(scores) - scores

Returns:

inverted_scores – The inverted list

Return type:

numpy array of shape (n_samples,)

Examples

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])
pdmlabs.evaluation.vus.utils.utility.pairwise_distances_no_broadcast(X, Y)#

Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast. For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4). :param X: First input samples :type X: array of shape (n_samples, n_features) :param Y: Second input samples :type Y: array of shape (n_samples, n_features)

Returns:

distance – Row-wise euclidean distance of X and Y

Return type:

array of shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.precision_n_scores(y, y_pred, n=None)#

Utility function to calculate precision @ rank n. :param y: The ground truth. Binary (0: inliers, 1: outliers). :type y: list or numpy array of shape (n_samples,) :param y_pred: The raw outlier scores as returned by a fitted model. :type y_pred: list or numpy array of shape (n_samples,) :param n: The number of outliers. if not defined, infer using ground truth. :type n: int, optional (default=None)

Returns:

precision_at_rank_n – Precision at rank n score.

Return type:

float

pdmlabs.evaluation.vus.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)#

Turn raw outlier outlier scores to binary labels (0 or 1). :param pred_scores: Raw outlier scores. Outliers are assumed have larger values. :type pred_scores: list or numpy array of shape (n_samples,) :param outliers_fraction: Percentage of outliers. :type outliers_fraction: float in (0,1)

Returns:

outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

Return type:

numpy array of shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.similarityScore(S, node, alpha)#

Given a set of instances S falling into node and a value alpha >=0, returns for all element x in S the weighted similarity score between x and the centroid M of S (node.M)

Parameters:
  • S (array of instances) – Array of instances that fall into a node

  • node (a DiFF tree node) – S is the set of instances ā€œfallingā€ into the node

  • alpha (float) – alpha is the distance scaling hyper-parameter

Returns:

the array of similarity values between the instances in S and the mean of training instances falling in node

Return type:

array

pdmlabs.evaluation.vus.utils.utility.standardizer(X, X_t=None, keep_scalar=False)#

Conduct Z-normalization on data to turn input samples become zero-mean and unit variance. :param X: The training samples :type X: numpy array of shape (n_samples, n_features) :param X_t: The data to be converted :type X_t: numpy array of shape (n_samples_new, n_features), optional (default=None) :param keep_scalar: The flag to indicate whether to return the scalar :type keep_scalar: bool, optional (default=False)

Returns:

  • X_norm (numpy array of shape (n_samples, n_features)) – X after the Z-score normalization

  • X_t_norm (numpy array of shape (n_samples, n_features)) – X_t after the Z-score normalization

  • scalar (sklearn scalar object) – The scalar used in conversion

pdmlabs.evaluation.vus.utils.utility.walk_tree(forest, node, treeIdx, obsIdx, X, featureDistrib, depth=0, alpha=0.01)#

Recursive function that walks a tree from an already fitted forest to compute the path length of the new observations.

Parameters:
  • forest (DiFF_RF) – A fitted forest of DiFF trees

  • node (DiFF Tree node) – the current node

  • treeIdx (int) – index of the tree that is being walked.

  • obsIdx (array) – 1D array of length n_obs. 1/0 if the obs has reached / has not reached the node.

  • X (nD array.) – array of observations/instances.

  • depth (int) – current depth.

Return type:

None

pdmlabs.evaluation.vus.utils.utility.weightFeature(s, nbins)#

Given a list of values corresponding to a feature dimension, returns a weight (in [0,1]) that is one minus the normalized empirical entropy, a way to characterize the importance of the feature dimension.

Parameters:
  • s (array) – list of scalar values corresponding to a feature dimension

  • nbins (int) – the number of bins used to discretize the feature dimension using an histogram.

Returns:

the importance weight for feature s.

Return type:

float