Evaluation & Metrics

Evaluation & Metrics#

Evaluating Predictive Maintenance (PdM) models requires moving beyond simple accuracy due to the inherently imbalanced nature of industrial data (failures are rare events) and the different nature of the tasks (classification, regression, survival analysis). PdMLabs utilizes a comprehensive set of metrics adapted to each modeling task, encapsulated in a modular evaluation architecture.

Evaluation Architecture#

PdMLabs evaluates experiments using an extensible interface pattern. Internally, PdMExperiment uses an orchestrator method (_run_evaluators()) to compute and log metrics.

Depending on the chosen experiment flavor, a built-in “Default Evaluator” is automatically assigned:

TSAD / Classification: Uses DefaultADEvaluator to compute classification and anomaly detection metrics.
Remaining Useful Life (RUL): Uses DefaultRULEvaluator, which computes regression metrics and re-uses AD metrics to evaluate thresholded classifications.
Survival Analysis (SA): Uses DefaultSurvEvaluator, computing survival-specific metrics while also integrating AD/RUL metrics.

All evaluators inherit from the EvaluatorInterface located in pdmlabs/evaluation/evaluator.py.

Supported Metrics#

Time-Series Anomaly Detection (TSAD) & Classification

For anomaly detection, detecting early signs of failure before they occur is critical.

Recall (AD1/AD2/AD3): Fraction of true anomalies detected. AD1, AD2, and AD3 levels correspond to different perspectives on detection tolerance and episode-aware splitting around failure timestamps.
Precision: Fraction of detected anomalies that actually precede faults in a specific predictive horizon (PH).
F1-Score (AD1/AD2/AD3): Harmonic mean of precision and recall.
AUC-PR (AD1/AD2/AD3): Area Under the Precision-Recall Curve.
VUS (Volume Under the Surface): Optional volume-based metrics for time-series anomaly detection.

Remaining Useful Life (RUL) & Survival Analysis (SA) Cross-Evaluation

A unique feature of PdMLabs is the cross-evaluation between RUL and Survival Analysis models. Inspired by the TITEUF SYSTEM (see TITEUF on GitHub), PdMLabs calculates Survival Analysis metrics for deterministic RUL predictions, and vice versa (calculating regression metrics from survival probabilities). This provides a holistic view of a model’s performance regardless of its foundational approach.

For RUL prediction, models are evaluated as regression tasks, while also computing survival metrics:

MAPE: Mean Absolute Percentage Error.
MDAPE: Median Absolute Percentage Error.
MSE / RMSE: Mean Squared Error and Root Mean Squared Error.
MAE: Mean Absolute Error.
R² Score: Coefficient of determination.

Survival models output probabilities over time, evaluated with metrics that handle right-censored data, while also extracting expected RUL to compute standard regression errors:

IBS (Integrated Brier Score): Measures Brier calibration and discrimination ability of survival probabilities over time.
Max Brier Score: Maximum Brier score observed across time points.
C-Index (Concordance Index): Measures how well the model predicts the ordering of survival times.
Mean AUC-ROC: Mean Area Under the Receiver Operating Characteristic Curve over time.

Adding Custom Metrics#

PdMLabs is designed to be easily extensible. If you want to compute custom business metrics, cost-based metrics, or domain-specific logic, you can inject your own evaluators into the pipeline without modifying the core experiment code.

Step 1: Implement the Interface

Create a new class inheriting from EvaluatorInterface and implement the evaluate() method. You have full access to the experiment object, the resulting predictions, and labels through kwargs.

from pdmlabs.evaluation.evaluator import EvaluatorInterface
import mlflow

class MyCustomCostEvaluator(EvaluatorInterface):
    def evaluate(self, experiment, **kwargs) -> dict:
        # Extract necessary variables from kwargs
        result_scores = kwargs.get('result_scores')
        result_labels = kwargs.get('result_labels')
        results_isfailure = kwargs.get('results_isfailure')

        # Compute your custom metric
        # e.g., total_cost = calculate_maintenance_cost(result_scores, results_isfailure)
        total_cost = 42.0

        my_metrics = {
            "business_cost": total_cost,
            "custom_roi": 1.5
        }

        # Optionally log metrics directly to MLflow
        mlflow.log_metrics(my_metrics)

        # Return metrics dict so it gets included in the experiment summary
        return my_metrics

Step 2: Pass to the Experiment

Pass an instance of your custom evaluator to the run_experiment() function using the custom_evaluators argument.

from pdmlabs.RunExperiment import run_experiment

my_evaluator = MyCustomCostEvaluator()

best_params = run_experiment(
    dataset=dataset_handler,
    methods=methods,
    param_space_dict_per_method=param_spaces,
    method_names=['MyMethod'],
    experiments=[SupervisedPdMExperiment],
    experiment_names=['My Custom Eval Run'],
    custom_evaluators=[my_evaluator]  # Inject your evaluator here!
)

PdMLabs will seamlessly run the built-in default evaluator to give you standard metrics, and then iterate through your custom_evaluators to augment the MLflow logs with your proprietary metrics.

Evaluation & Metrics

Contents

Evaluation & Metrics#

Evaluation Architecture#

Supported Metrics#

Adding Custom Metrics#