🎯 Implementing Classification Methods#
Classification methods distinguish between normal (healthy) and anomalous (degraded) states as a binary classification problem. This is the standard supervised learning approach for anomaly detection.
When to Use: When you have labeled training data annotated as normal vs anomalous. Works only with SupervisedPdMExperiment.
Key Requirement: Training data must include binary labels (0=normal, 1=anomalous).
Interface Overview#
Classification methods inherit from SupervisedMethodInterface:
from pdmlabs.method.supervised_method import SupervisedMethodInterface
from pdmlabs.pdm_evaluation_types.types import EventPreferences
class MyClassificationMethod(SupervisedMethodInterface):
def __init__(self, event_preferences: EventPreferences, **kwargs):
super().__init__(event_preferences=event_preferences)
# Your initialization
Key Characteristics:
- Has fit() method (training on labeled data)
- Binary classification: normal vs anomaly
- Returns anomaly probabilities as scores
- Must predict probability of positive class (anomalous)
- Works with SupervisedPdMExperiment only
Required Methods#
All classification methods must implement:
fit(historic_data, historic_sources, event_data, anomaly_ranges)— Train on labeled datapredict(target_data, source, event_data)— Score test datapredict_one(new_sample, source, is_event)— Score single sampleget_params()— Return configuration dictionary__str__()— Return method nameget_library()— Return library name (usually'no_save')get_all_models()— Return trained models (for export)
Example: XGBoost Classification#
The XGBoost is a reference implementation of supervised binary classification.
File Location: pdmlabs/method/xgboost.py
What It Does: - Trains XGBoost classifier to distinguish normal vs anomalous samples - Maintains separate classifiers per data source - Returns probability of anomaly as score
Implementation Details:
class XGBoost(SupervisedMethodInterface):
def __init__(self, event_preferences: EventPreferences, *args, **kwargs):
super().__init__(event_preferences=event_preferences)
self.model_per_source = {}
self.initial_args = args
self.initial_kwargs = kwargs
Training Phase:
def fit(self, historic_data: list[pd.DataFrame],
historic_sources: list[str],
event_data: pd.DataFrame,
anomaly_ranges: list[list]) -> None:
"""Train XGBoost classifier on labeled data.
Args:
historic_data: Training features (one DataFrame per source)
historic_sources: Source identifiers
event_data: Event log (for reference)
anomaly_ranges: Binary labels (0=normal, 1=anomaly)
"""
for data, source, labels in zip(historic_data, historic_sources, anomaly_ranges):
model = xgb.XGBClassifier(*self.initial_args, **self.initial_kwargs)
model.fit(data, labels)
self.model_per_source[source] = model
Key Parameters:
- learning_rate — Speed of learning (default: 0.1)
- max_depth — Tree depth (default: 5)
- n_estimators — Number of boosting rounds (default: 100)
- subsample — Fraction of samples for training (default: 1.0)
- colsample_bytree — Fraction of features per tree (default: 1.0)
Prediction Phase:
def predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame) -> list[float]:
"""Score test data as probability of anomaly."""
model = self.model_per_source[source]
# predict_proba returns [[prob_normal, prob_anomaly], ...]
scores = model.predict_proba(target_data)[:, 1] # Get anomaly probability
return scores.tolist()
Creating Your Own Classification Method#
Follow this template:
Step 1: Create File
Create pdmlabs/method/my_classifier.py:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier # Your chosen algorithm
from pdmlabs.method.supervised_method import SupervisedMethodInterface
from pdmlabs.pdm_evaluation_types.types import EventPreferences
class MyClassifier(SupervisedMethodInterface):
"""Binary anomaly classifier for predictive maintenance.
This method learns to distinguish normal operation from degraded/anomalous
states using supervised binary classification.
"""
def __init__(self,
event_preferences: EventPreferences,
n_estimators: int = 100,
max_depth: int = 10,
*args,
**kwargs):
super().__init__(event_preferences=event_preferences)
self.n_estimators = n_estimators
self.max_depth = max_depth
self.initial_args = args
self.initial_kwargs = kwargs
self.model_per_source = {}
Step 2: Implement fit()
Train separate classifiers per source:
def fit(self, historic_data: list[pd.DataFrame],
historic_sources: list[str],
event_data: pd.DataFrame,
anomaly_ranges: list[list]) -> None:
"""
Train a binary classifier for each data source.
Args:
historic_data: Training features (one per source)
historic_sources: Source names (e.g., ['bearing_1', 'bearing_2'])
event_data: Event log (optional reference)
anomaly_ranges: Binary labels (list per source, values 0 or 1)
The labels should be:
- 0 for samples representing normal/healthy operation
- 1 for samples showing degradation or failure conditions
"""
for data, source, labels in zip(historic_data, historic_sources, anomaly_ranges):
# Create classifier
classifier = RandomForestClassifier(
n_estimators=self.n_estimators,
max_depth=self.max_depth,
*self.initial_args,
**self.initial_kwargs
)
# Train on labeled data
classifier.fit(data, labels)
# Store model for this source
self.model_per_source[source] = classifier
Step 3: Implement predict()
Score using trained classifiers:
def predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame) -> list[float]:
"""
Score test data using trained classifier for source.
Scores represent probability of anomaly (higher = more anomalous).
Args:
target_data: Test features (rows=samples, cols=features)
source: Source identifier to look up trained model
event_data: Event log (optional reference)
Returns:
List of anomaly probabilities [0.0, 1.0]
"""
if source not in self.model_per_source:
raise ValueError(f"No model trained for source '{source}'")
classifier = self.model_per_source[source]
# Get probability estimates
# Most sklearn classifiers: predict_proba returns [[prob_class_0, prob_class_1], ...]
probabilities = classifier.predict_proba(target_data)
# Extract probability of class 1 (anomaly)
anomaly_scores = probabilities[:, 1]
return anomaly_scores.tolist()
Step 4: Implement predict_one()
Score individual samples:
def predict_one(self, new_sample: pd.Series, source: str, is_event: bool) -> float:
"""
Score a single new sample.
Args:
new_sample: Single sample as pandas Series
source: Source identifier
is_event: Whether marked as event (context only)
Returns:
Anomaly probability for this sample
"""
if source not in self.model_per_source:
raise ValueError(f"No model trained for source '{source}'")
classifier = self.model_per_source[source]
# Reshape to 2D: (1, num_features)
sample_array = new_sample.to_numpy().reshape(1, -1)
# Get probabilities
probabilities = classifier.predict_proba(sample_array)
# Return probability of class 1 (anomaly)
return float(probabilities[0, 1])
Step 5: Implement get_params()
Return hyperparameters:
def get_params(self) -> dict:
"""Return all hyperparameters for reproducibility."""
# Get params from first trained model
first_source = list(self.model_per_source.keys())[0]
model = self.model_per_source[first_source]
return {
**model.get_params(),
'n_estimators': self.n_estimators,
'max_depth': self.max_depth,
}
Step 6: Implement remaining methods
def __str__(self) -> str:
"""Human-readable method name."""
return 'MyClassifier'
def get_library(self) -> str:
"""Library for serialization."""
return 'no_save'
def get_all_models(self):
"""Export trained models."""
return self.model_per_source
Data Labeling Guidelines#
Classification requires accurate labeling of training data:
Normal (0) Samples: - Healthy sensor readings - Within expected parameter ranges - Stable baseline measurements - Pre-failure or early-stage operation
Anomalous (1) Samples: - Degraded behavior indicators - Fault condition signatures - Early warning signs of failure - Anomalous sensor patterns
Labeling Strategies:
Time-window based: - Mark samples 0-N days before failure as anomaly (1) - Mark remaining as normal (0) - Adjustable lookback window affects class balance
Threshold-based: - Mark samples exceeding thresholds as anomalies - Use domain expertise to set thresholds
Human annotation: - Domain experts manually review and label - Most accurate but time-consuming
Event-driven: - Use maintenance/failure events as boundaries - Samples after event = anomalous, before = normal
Testing Your Implementation#
With your labeled dataset prepared, test your custom classifier using run_experiment:
from pdmlabs.utils.dataset import Dataset
from pdmlabs.experiment.batch.supervised_experiment import SupervisedPdMExperiment
from pdmlabs.RunExperiment import run_experiment
from my_classifier import MyClassifier
from pdmlabs.pdm_evaluation_types.types import EventPreferences
# 1. Load data (must have binary labels: 0=normal, 1=anomaly)
df = pd.read_csv('your_labeled_data.csv')
dataset_handler = Dataset(
data=df,
datetime_column="timestamp",
source_column="source",
train_sources=0.6,
val_sources=0.2,
test_sources=0.2
)
ds_class, _ = dataset_handler.get_Classification_dataset()
# 2. Define hyperparameters for your classifier
method_param_space = {
'n_estimators': [50, 100, 150],
'max_depth': [8, 10, 12],
}
# 3. Run experiment with run_experiment
best_params = run_experiment(
dataset=ds_class,
methods=[MyClassifier],
param_space_dict_per_method=[method_param_space],
method_names=['MyClassifier'],
experiments=[SupervisedPdMExperiment],
experiment_names=['Classification'],
MAX_RUNS=15,
MAX_JOBS=2,
INITIAL_RANDOM=2,
profile_size=10,
optimization_param='AD1_AUC'
)
# 5. Check results
print(f"Best parameters: {best_params[0]}")
Next Steps#
Review
XGBoostimplementation inpdmlabs/method/xgboost.pyCheck classification metrics in
pdmlabs/evaluation/vus/Explore
SupervisedPdMExperimentinpdmlabs/experiment/batch/supervised_experiment.py