pdmlabs.evaluation.vus.utils.utility

pdmlabs.evaluation.vus.utils.utility#

A set of utility functions to support outlier detection.

Functions

`EE`(hist)	given a list of positive values as a histogram drawn from any information source, returns the empirical entropy of its discrete probability function.
`EuclideanDist`(x, y)
`all_branches`(node[, current, branches])
`argmaxn`(value_list, n[, order])	Return the index of top n elements in the list if order is set to 'desc', otherwise return the index of n smallest ones. :param value_list: A list containing all values. :type value_list: list, array, numpy array of shape (n_samples,) :param n: The number of elements to select. :type n: int :param order: The order to sort {'desc', 'asc'}: - 'desc': descending - 'asc': ascending :type order: str, optional (default='desc').
`branch2num`(branch[, init_root])
`c_factor`(n)
`check_detector`(detector)	Checks if fit and decision_function methods exist for given detector :param detector: Detector instance for which the check is performed.
`check_parameter`(param[, low, high, ...])	Check if an input is within the defined range.
`create_tree`(X, featureDistrib, sample_size, ...)	Creates an DiFF tree using a sample of size sample_size of the original data.
`dist2set`(x, X)
`gen_graph`(branches[, g, init_root, pre])
`generate_bagging_indices`(random_state, ...)	Randomly draw feature indices. Internal use only. Modified from sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random permutations generator. :type random_state: RandomState :param bootstrap_features: Specifies whether to bootstrap indice generation :type bootstrap_features: bool :param n_features: Specifies the population size when generating indices :type n_features: int :param min_features: Lower limit for number of features to randomly sample :type min_features: int :param max_features: Upper limit for number of features to randomly sample :type max_features: int.
`generate_indices`(random_state, bootstrap, ...)	Draw randomly sampled indices. Internal use only. See sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random permutations generator. :type random_state: RandomState :param bootstrap: Specifies whether to bootstrap indice generation :type bootstrap: bool :param n_population: Specifies the population size when generating indices :type n_population: int :param n_samples: Specifies number of samples to draw :type n_samples: int.
`getSplit`(X)	Randomly selects a split value from set of scalar data 'X'.
`get_diff_elements`(li1, li2)	get the elements in li1 but not li2, and vice versa :param li1: Input list 1.
`get_intersection`(lst1, lst2)	get the overlapping between two lists :param li1: Input list 1.
`get_label_n`(y, y_pred[, n])	Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores.
`get_list_anomaly`(labels)
`get_list_diff`(li1, li2)	get the elements in li1 but not li2.
`invert_order`(scores[, method])	Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different. :param scores: The list of values to be inverted :type scores: list, array or numpy array with shape (n_samples,) :param method: Methods used for order inversion. Valid methods are: - 'multiplication': multiply by -1 - 'subtraction': max(scores) - scores :type method: str, optional (default='multiplication').
`pairwise_distances_no_broadcast`(X, Y)	Utility function to calculate row-wise euclidean distance of two matrix.
`precision_n_scores`(y, y_pred[, n])	Utility function to calculate precision @ rank n.
`score_to_label`(pred_scores[, outliers_fraction])	Turn raw outlier outlier scores to binary labels (0 or 1).
`similarityScore`(S, node, alpha)	Given a set of instances S falling into node and a value alpha >=0, returns for all element x in S the weighted similarity score between x and the centroid M of S (node.M)
`standardizer`(X[, X_t, keep_scalar])	Conduct Z-normalization on data to turn input samples become zero-mean and unit variance.
`walk_tree`(forest, node, treeIdx, obsIdx, X, ...)	Recursive function that walks a tree from an already fitted forest to compute the path length of the new observations.
`weightFeature`(s, nbins)	Given a list of values corresponding to a feature dimension, returns a weight (in [0,1]) that is one minus the normalized empirical entropy, a way to characterize the importance of the feature dimension.

pdmlabs.evaluation.vus.utils.utility.EE(hist)#

given a list of positive values as a histogram drawn from any information source, returns the empirical entropy of its discrete probability function.

Parameters:: hist (array) – histogram
Returns:: empirical entropy estimated from the histogram
Return type:: float

pdmlabs.evaluation.vus.utils.utility.EuclideanDist(x, y)#

pdmlabs.evaluation.vus.utils.utility.all_branches(node, current=[], branches=None)#

pdmlabs.evaluation.vus.utils.utility.argmaxn(value_list, n, order='desc')#

Return the index of top n elements in the list if order is set to ‘desc’, otherwise return the index of n smallest ones. :param value_list: A list containing all values. :type value_list: list, array, numpy array of shape (n_samples,) :param n: The number of elements to select. :type n: int :param order: The order to sort {‘desc’, ‘asc’}:

‘desc’: descending

‘asc’: ascending

Returns:: index_list – The index of the top n elements.
Return type:: numpy array of shape (n,)

pdmlabs.evaluation.vus.utils.utility.branch2num(branch, init_root=0)#

pdmlabs.evaluation.vus.utils.utility.c_factor(n)#

pdmlabs.evaluation.vus.utils.utility.check_detector(detector)#: Checks if fit and decision_function methods exist for given detector :param detector: Detector instance for which the check is performed. :type detector: pyod.models

pdmlabs.evaluation.vus.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)#

Check if an input is within the defined range. :param param: The input parameter to check. :type param: int, float :param low: The lower bound of the range. :type low: int, float :param high: The higher bound of the range. :type high: int, float :param param_name: The name of the parameter. :type param_name: str, optional (default=’’) :param include_left: Whether includes the lower bound (lower bound <=). :type include_left: bool, optional (default=False) :param include_right: Whether includes the higher bound (<= higher bound). :type include_right: bool, optional (default=False)

Returns:: within_range – Whether the parameter is within the range of (low, high)
Return type:: bool or raise errors

pdmlabs.evaluation.vus.utils.utility.create_tree(X, featureDistrib, sample_size, max_height)#

Creates an DiFF tree using a sample of size sample_size of the original data.

Parameters:

X (nD array.) – nD array with the observations. Dimensions should be (n_obs, n_features).
sample_size (int) – Size of the sample from which a DiFF tree is built.
max_height (int) – Maximum height of the tree.

Return type:

a DiFF tree

pdmlabs.evaluation.vus.utils.utility.dist2set(x, X)#

pdmlabs.evaluation.vus.utils.utility.gen_graph(branches, g=None, init_root=0, pre='')#

pdmlabs.evaluation.vus.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)#

Randomly draw feature indices. Internal use only. Modified from sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random

permutations generator.

Parameters:

bootstrap_features (bool) – Specifies whether to bootstrap indice generation
n_features (int) – Specifies the population size when generating indices
min_features (int) – Lower limit for number of features to randomly sample
max_features (int) – Upper limit for number of features to randomly sample

Returns:

feature_indices – Indices for features to bag

Return type:

numpy array, shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)#

Draw randomly sampled indices. Internal use only. See sklearn/ensemble/bagging.py :param random_state: A random number generator instance to define the state of the random

permutations generator.

Parameters:

bootstrap (bool) – Specifies whether to bootstrap indice generation
n_population (int) – Specifies the population size when generating indices
n_samples (int) – Specifies number of samples to draw

Returns:

indices – randomly drawn indices

Return type:

numpy array, shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.getSplit(X)#

Randomly selects a split value from set of scalar data ‘X’. Returns the split value.

Parameters:: X (array) – Array of scalar values
Returns:: split value
Return type:: float

pdmlabs.evaluation.vus.utils.utility.get_diff_elements(li1, li2)#

get the elements in li1 but not li2, and vice versa :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array

Returns:: difference – The difference between li1 and li2.
Return type:: list

pdmlabs.evaluation.vus.utils.utility.get_intersection(lst1, lst2)#

get the overlapping between two lists :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array

Returns:: difference – The overlapping between li1 and li2.
Return type:: list

pdmlabs.evaluation.vus.utils.utility.get_label_n(y, y_pred, n=None)#

Function to turn raw outlier scores into binary labels by assign 1 to top n outlier scores. :param y: The ground truth. Binary (0: inliers, 1: outliers). :type y: list or numpy array of shape (n_samples,) :param y_pred: The raw outlier scores as returned by a fitted model. :type y_pred: list or numpy array of shape (n_samples,) :param n: The number of outliers. if not defined, infer using ground truth. :type n: int, optional (default=None)

Returns:: labels – binary labels 0: normal points and 1: outliers
Return type:: numpy array of shape (n_samples,)

Examples

>>> from pyod.utils.utility import get_label_n
>>> y = [0, 1, 1, 0, 0]
>>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7]
>>> get_label_n(y, y_pred)
array([0, 1, 0, 0, 1])

pdmlabs.evaluation.vus.utils.utility.get_list_anomaly(labels)#

pdmlabs.evaluation.vus.utils.utility.get_list_diff(li1, li2)#

get the elements in li1 but not li2. li1-li2 :param li1: Input list 1. :type li1: list or numpy array :param li2: Input list 2. :type li2: list or numpy array

Returns:: difference – The difference between li1 and li2.
Return type:: list

pdmlabs.evaluation.vus.utils.utility.invert_order(scores, method='multiplication')#

Invert the order of a list of values. The smallest value becomes the largest in the inverted list. This is useful while combining multiple detectors since their score order could be different. :param scores: The list of values to be inverted :type scores: list, array or numpy array with shape (n_samples,) :param method: Methods used for order inversion. Valid methods are:

‘multiplication’: multiply by -1

‘subtraction’: max(scores) - scores

Returns:: inverted_scores – The inverted list
Return type:: numpy array of shape (n_samples,)

Examples

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])

pdmlabs.evaluation.vus.utils.utility.pairwise_distances_no_broadcast(X, Y)#

Utility function to calculate row-wise euclidean distance of two matrix. Different from pair-wise calculation, this function would not broadcast. For instance, X and Y are both (4,3) matrices, the function would return a distance vector with shape (4,), instead of (4,4). :param X: First input samples :type X: array of shape (n_samples, n_features) :param Y: Second input samples :type Y: array of shape (n_samples, n_features)

Returns:: distance – Row-wise euclidean distance of X and Y
Return type:: array of shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.precision_n_scores(y, y_pred, n=None)#

Utility function to calculate precision @ rank n. :param y: The ground truth. Binary (0: inliers, 1: outliers). :type y: list or numpy array of shape (n_samples,) :param y_pred: The raw outlier scores as returned by a fitted model. :type y_pred: list or numpy array of shape (n_samples,) :param n: The number of outliers. if not defined, infer using ground truth. :type n: int, optional (default=None)

Returns:: precision_at_rank_n – Precision at rank n score.
Return type:: float

pdmlabs.evaluation.vus.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)#

Turn raw outlier outlier scores to binary labels (0 or 1). :param pred_scores: Raw outlier scores. Outliers are assumed have larger values. :type pred_scores: list or numpy array of shape (n_samples,) :param outliers_fraction: Percentage of outliers. :type outliers_fraction: float in (0,1)

Returns:: outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].
Return type:: numpy array of shape (n_samples,)

pdmlabs.evaluation.vus.utils.utility.similarityScore(S, node, alpha)#

Given a set of instances S falling into node and a value alpha >=0, returns for all element x in S the weighted similarity score between x and the centroid M of S (node.M)

Parameters:

S (array of instances) – Array of instances that fall into a node
node (a DiFF tree node) – S is the set of instances “falling” into the node
alpha (float) – alpha is the distance scaling hyper-parameter

Returns:

the array of similarity values between the instances in S and the mean of training instances falling in node

Return type:

array

pdmlabs.evaluation.vus.utils.utility.standardizer(X, X_t=None, keep_scalar=False)#

Conduct Z-normalization on data to turn input samples become zero-mean and unit variance. :param X: The training samples :type X: numpy array of shape (n_samples, n_features) :param X_t: The data to be converted :type X_t: numpy array of shape (n_samples_new, n_features), optional (default=None) :param keep_scalar: The flag to indicate whether to return the scalar :type keep_scalar: bool, optional (default=False)

Returns:

X_norm (numpy array of shape (n_samples, n_features)) – X after the Z-score normalization
X_t_norm (numpy array of shape (n_samples, n_features)) – X_t after the Z-score normalization
scalar (sklearn scalar object) – The scalar used in conversion

pdmlabs.evaluation.vus.utils.utility.walk_tree(forest, node, treeIdx, obsIdx, X, featureDistrib, depth=0, alpha=0.01)#

Recursive function that walks a tree from an already fitted forest to compute the path length of the new observations.

Parameters:

forest (DiFF_RF) – A fitted forest of DiFF trees
node (DiFF Tree node) – the current node
treeIdx (int) – index of the tree that is being walked.
obsIdx (array) – 1D array of length n_obs. 1/0 if the obs has reached / has not reached the node.
X (nD array.) – array of observations/instances.
depth (int) – current depth.

Return type:

None

pdmlabs.evaluation.vus.utils.utility.weightFeature(s, nbins)#

Given a list of values corresponding to a feature dimension, returns a weight (in [0,1]) that is one minus the normalized empirical entropy, a way to characterize the importance of the feature dimension.

Parameters:

s (array) – list of scalar values corresponding to a feature dimension
nbins (int) – the number of bins used to discretize the feature dimension using an histogram.

Returns:

the importance weight for feature s.

Return type:

float

pdmlabs.evaluation.vus.utils.utility

Contents

pdmlabs.evaluation.vus.utils.utility#