📊 Implementing Semi-Supervised Methods#
Semi-supervised methods learn from clean profile/baseline data to establish normal behavior patterns, then detect deviations as anomalies. This is the most common scenario in predictive maintenance.
When to Use: When you have clean operational data (profiles) and want to detect degradation (anomalies). Works with AutoProfile, Incremental, and Historical semi-supervised experiments.
Key Difference from Unsupervised: Your method receives labeled training data representing normal operation.
Interface Overview#
Semi-supervised methods inherit from SemiSupervisedMethodInterface:
from pdmlabs.method.semi_supervised_method import SemiSupervisedMethodInterface
from pdmlabs.pdm_evaluation_types.types import EventPreferences
class MySemiSupervisedMethod(SemiSupervisedMethodInterface):
def __init__(self, event_preferences: EventPreferences, **kwargs):
super().__init__(event_preferences=event_preferences)
# Your initialization
Key Characteristics:
- Has fit() method (training on clean profiles)
- Trains separate models per data source
- Can run with AutoProfile, Incremental, or Historical experiments
- Scores test data against learned normal patterns
Required Methods#
All semi-supervised methods must implement:
fit(historic_data, historic_sources, event_data)— Train on clean profilespredict(target_data, source, event_data)— Score test datapredict_one(new_sample, source, is_event)— Score single sample (online)get_params()— Return configuration dictionary__str__()— Return method nameget_library()— Return library name for serialization (usually'no_save')get_all_models()— Return trained models (for export)
Example: Local Outlier Factor Semi-Supervised#
The LocalOutlierFactor is a reference implementation of semi-supervised anomaly detection.
File Location: pdmlabs/method/lof_semi.py
What It Does: - Trains LOF models on historical clean profiles - Detects anomalies as deviations from learned local density - Maintains separate models per data source (bearing, pump, etc.)
Implementation Details:
class LocalOutlierFactor(SemiSupervisedMethodInterface):
def __init__(self, event_preferences: EventPreferences, *args, **kwargs):
super().__init__(event_preferences=event_preferences)
self.initial_args = args
self.initial_kwargs = kwargs
# Remove profile_size if present (framework-specific param)
if 'profile_size' in kwargs:
del self.initial_kwargs['profile_size']
self.clf_class = local_outlier_factor # sklearn's LocalOutlierFactor
self.model_per_source = {} # Stores trained models per source
Training Phase:
def fit(self, historic_data: list[pd.DataFrame],
historic_sources: list[str],
event_data: pd.DataFrame) -> None:
"""Train LOF model on each data source."""
for data, source in zip(historic_data, historic_sources):
# Create model with novelty=True for out-of-distribution detection
model = self.clf_class(novelty=True, *self.initial_args, **self.initial_kwargs)
model.fit(data)
self.model_per_source[source] = model
Key Parameters:
- novelty=True — Switches to novelty detection (one-class classification)
- n_neighbors — Number of neighbors for local density (default: 20)
- contamination — Expected fraction of anomalies (default: auto)
Prediction Phase:
def predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame) -> list[float]:
"""Score test data against trained LOF model."""
model = self.model_per_source[source]
# score_samples returns decision scores (negative = anomaly)
scores = -model.score_samples(target_data)
return scores.tolist()
Creating Your Own Semi-Supervised Method#
Follow this template:
Step 1: Create File
Create pdmlabs/method/my_semi_supervised_method.py:
import pandas as pd
from sklearn.base import BaseEstimator # Your chosen algorithm
from pdmlabs.method.semi_supervised_method import SemiSupervisedMethodInterface
from pdmlabs.pdm_evaluation_types.types import EventPreferences
class MySemiSupervisedMethod(SemiSupervisedMethodInterface):
"""Describe your semi-supervised anomaly detection method.
This method learns normal behavior from clean training data,
then detects anomalies as deviations from the learned model.
"""
def __init__(self,
event_preferences: EventPreferences,
n_neighbors: int = 20,
contamination: str = 'auto',
*args,
**kwargs):
super().__init__(event_preferences=event_preferences)
self.n_neighbors = n_neighbors
self.contamination = contamination
self.initial_args = args
self.initial_kwargs = kwargs
# Important: Clean up framework-specific parameters
if 'profile_size' in self.initial_kwargs:
del self.initial_kwargs['profile_size']
self.model_per_source = {} # Will store trained models
Step 2: Implement fit()
Train on clean profiles (one model per source):
def fit(self, historic_data: list[pd.DataFrame],
historic_sources: list[str],
event_data: pd.DataFrame) -> None:
"""Train anomaly detector on clean historical data.
Args:
historic_data: List of training DataFrames (one per source).
Each represents clean operation periods.
historic_sources: List of source names (e.g., ['bearing_1', 'bearing_2'])
event_data: Event log for reference (optional)
"""
for data, source in zip(historic_data, historic_sources):
# Initialize model for this source
model = SomeAlgorithm(
n_neighbors=self.n_neighbors,
contamination=self.contamination,
*self.initial_args,
**self.initial_kwargs
)
# Train on clean profile data
model.fit(data)
# Store for later prediction
self.model_per_source[source] = model
Step 3: Implement predict()
Score test data using trained models:
def predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame) -> list[float]:
"""
Score all test samples against trained model for source.
Args:
target_data: Test features (rows=samples, cols=features)
source: Source identifier to look up trained model
event_data: Event log (optional reference)
Returns:
List of anomaly scores (higher = more anomalous)
"""
if source not in self.model_per_source:
raise ValueError(f"No model trained for source '{source}'")
model = self.model_per_source[source]
# Get anomaly scores (method-specific)
# Most sklearn detectors: score_samples() returns decision scores
scores = model.score_samples(target_data)
# Ensure higher = more anomalous (negate if needed)
if hasattr(model, '_invert_scores'):
scores = -scores
return scores.tolist()
Step 4: Implement predict_one()
Score single samples (for online/incremental experiments):
def predict_one(self, new_sample: pd.Series, source: str, is_event: bool) -> float:
"""
Score a single new sample.
Args:
new_sample: Single sample as pandas Series
source: Source identifier
is_event: Whether marked as event (context only)
Returns:
Anomaly score for the sample
"""
if source not in self.model_per_source:
raise ValueError(f"No model trained for source '{source}'")
model = self.model_per_source[source]
# Convert to expected shape
sample_array = new_sample.to_numpy().reshape(1, -1)
# Score using same logic as batch predict
score = model.score_samples(sample_array)[0]
if hasattr(model, '_invert_scores'):
score = -score
return float(score)
Step 5: Implement get_params()
Return configuration for reproducibility:
def get_params(self) -> dict:
"""Return all hyperparameters."""
# Get params from first trained model as reference
first_source = list(self.model_per_source.keys())[0]
model = self.model_per_source[first_source]
return {
**model.get_params(), # Include underlying algorithm params
'n_neighbors': self.n_neighbors,
'contamination': self.contamination,
}
Step 6: Implement __str__(), get_library(), get_all_models()
def __str__(self) -> str:
"""Human-readable method name."""
return 'MySemiSupervisedMethod'
def get_library(self) -> str:
"""Return library for serialization."""
return 'no_save' # or your library name
def get_all_models(self):
"""Return trained models for export (optional)."""
return self.model_per_source
Testing Your Implementation#
With your dataset prepared, test your custom semi-supervised method using run_experiment:
from pdmlabs.utils.dataset import Dataset
from pdmlabs.experiment.batch.auto_profile_semi_supervised_experiment import AutoProfileSemiSupervisedPdMExperiment
from pdmlabs.RunExperiment import run_experiment
from my_semi_supervised_method import MySemiSupervisedMethod
from pdmlabs.pdm_evaluation_types.types import EventPreferences
# 1. Load data (clean profiles for training)
df = pd.read_csv('your_data.csv')
dataset_handler = Dataset(
data=df,
datetime_column="timestamp",
source_column="source",
train_sources=0.6,
val_sources=0.2,
test_sources=0.2
)
ds_semi, _ = dataset_handler.get_semi_dataset()
# 2. Define hyperparameters for your method
method_param_space = {
'n_neighbors': [15, 20, 25],
'contamination': ['auto'],
}
# 3. Run experiment with run_experiment
best_params = run_experiment(
dataset=ds_semi,
methods=[MySemiSupervisedMethod],
param_space_dict_per_method=[method_param_space],
method_names=['MySemiSupervisedMethod'],
experiments=[AutoProfileSemiSupervisedPdMExperiment],
experiment_names=['AutoProfile'],
MAX_RUNS=10,
MAX_JOBS=2,
INITIAL_RANDOM=2,
profile_size=[5, 10],
fit_size=5,
optimization_param='AD1_AUC'
)
# 5. Check results
print(f"Best parameters: {best_params[0]}")
Next Steps#
Review
LocalOutlierFactorsource code inpdmlabs/method/lof_semi.pyExplore parameter generation in
pdmlabs/utils/automatic_parameter_generation.py::online_technique()Check dataset structure in
pdmlabs/utils/dataset.py::get_semi_dataset()