π Implementing Unsupervised Methods#
Unsupervised methods detect anomalies without any labeled training data. Theyβre ideal for discovering novel or unexpected patterns in sensor streams.
When to Use: Early exploratory analysis, systems where labeling is expensive, or when normal behavior is hard to define.
Framework: π User Guide β unsupervised-overview (if exists)
Interface Overview#
Unsupervised methods inherit from UnsupervisedMethodInterface:
from pdmlabs.method.unsupervised_method import UnsupervisedMethodInterface
from pdmlabs.pdm_evaluation_types.types import EventPreferences
class MyUnsupervisedMethod(UnsupervisedMethodInterface):
def __init__(self, event_preferences: EventPreferences, **kwargs):
super().__init__(event_preferences=event_preferences)
# Your initialization
Key Characteristics:
- No fit() method (no training phase)
- Models train online on a sliding window of data
- No labeled training data required
- Works with UnsupervisedPdMExperiment only
Required Methods#
All unsupervised methods must implement:
predict(target_data, source, event_data)β Score entire datasetpredict_one(new_sample, source, is_event)β Score single sample (online)get_params()β Return configuration dictionary__str__()β Return method nameget_library()β Return library name for serialization (usually'no_save')
Example: Isolation Forest Unsupervised#
The IsolationForestUnsupervised is a reference implementation of unsupervised anomaly detection.
File Location: pdmlabs/method/isolation_forest_uns.py
What It Does: - Trains Isolation Forest models on sliding windows of data - Each window produces scores based on isolation depth - Combines overlapping window scores using configurable policies (or/and/first/last)
Implementation Details:
class IsolationForestUnsupervised(UnsupervisedMethodInterface):
def __init__(self,
event_preferences: EventPreferences,
window=40, # Sliding window size
slide=0.5, # Step size as fraction of window
policy="or", # Score aggregation: or/and/first/last
*args,
**kwargs):
super().__init__(event_preferences=event_preferences)
self.window = window
self.slide = int(slide * window)
self.policy = policy
self.clf_class = isolation_forest # sklearn's IsolationForest
self.model_per_source = {} # Not used for unsupervised, but kept for consistency
Key Components:
windowβ Size of sliding window for training (default: 40 samples)slideβ Step size between windows as fraction (default: 0.5 = 50% overlap)policyβ How to combine scores from overlapping windows: -"or"β Take maximum score (most sensitive) -"and"β Take minimum score (most conservative) -"first"β Keep first score seen -"last"β Keep last score seen
Prediction Flow:
Initialize empty score dictionary
Slide window over data in steps of
slidesamplesFor each window position: - Extract window data - Train new Isolation Forest on window - Score all samples in window - Combine with existing scores using policy
Return final scores for all samples
Scoring:
- -tempmodel.score_samples(window_data) β Negative because sklearn returns decision scores (negative = anomaly)
- Higher score = more anomalous
Creating Your Own Unsupervised Method#
Follow this template:
Step 1: Create File
Create pdmlabs/method/my_unsupervised_method.py:
import pandas as pd
from sklearn.ensemble import SomeDetector # Your chosen algorithm
from pdmlabs.method.unsupervised_method import UnsupervisedMethodInterface
from pdmlabs.pdm_evaluation_types.types import EventPreferences
class MyUnsupervisedMethod(UnsupervisedMethodInterface):
"""Describe your unsupervised anomaly detection method."""
def __init__(self,
event_preferences: EventPreferences,
window: int = 50,
threshold: float = 0.5,
*args,
**kwargs):
super().__init__(event_preferences=event_preferences)
self.window = window
self.threshold = threshold
self.initial_args = args
self.initial_kwargs = kwargs
self.model_per_source = {} # Keep for consistency even if not used
Step 2: Implement predict()
Score entire dataset:
def predict(self, target_data: pd.DataFrame, source: str, event_data: pd.DataFrame) -> list[float]:
"""
Score all samples in target_data.
Args:
target_data: Features (rows=samples, cols=features)
source: Source identifier (not needed for pure unsupervised)
event_data: Event log (not needed for unsupervised)
Returns:
List of anomaly scores (one per sample)
"""
scores = self._compute_scores(target_data)
return scores
Step 3: Implement predict_one()
Score single sample (for online settings):
def predict_one(self, new_sample: pd.Series, source: str, is_event: bool) -> float:
"""
Score a single new sample.
For fully streaming methods, you might:
- Update a model buffer
- Recompute on-demand
- Use approximate scores
Args:
new_sample: Single sample as Series
source: Source identifier
is_event: Whether this sample is marked as event
Returns:
Anomaly score for the sample
"""
# For algorithms that need recent context:
# - Keep a sliding buffer
# - Update statistics incrementally
# - Approximate scores efficiently
score = self._compute_score_single(new_sample)
return score
Step 4: Implement get_params()
Return configuration:
def get_params(self) -> dict:
"""Return all hyperparameters of this method instance."""
return {
'window': self.window,
'threshold': self.threshold,
# Include model-specific params if applicable
}
Step 5: Implement __str__() and get_library()
def __str__(self) -> str:
"""Human-readable method name."""
return 'MyUnsupervisedMethod'
def get_library(self) -> str:
"""Return 'no_save' for non-serializable methods."""
return 'no_save'
Step 6: (Optional) Implement get_all_models()
Some methods need to export models:
def get_all_models(self):
"""Return any trained models (for persistence)."""
# Return None if not applicable
return None
Testing Your Implementation#
With your dataset prepared, test your custom unsupervised method using run_experiment:
from pdmlabs.utils.dataset import Dataset
from pdmlabs.experiment.batch.unsupervised_experiment import UnsupervisedPdMExperiment
from pdmlabs.RunExperiment import run_experiment
from my_unsupervised_method import MyUnsupervisedMethod
# 1. Load data
df = pd.read_csv('your_data.csv')
dataset_handler = Dataset(
data=df,
datetime_column="timestamp",
source_column="source",
train_sources=0.6,
val_sources=0.2,
test_sources=0.2
)
ds_unsup, _ = dataset_handler.get_unsupervised_dataset()
# 2. Define hyperparameters for your method
method_param_space = {
'window': [30, 40, 50],
'slide': [0.5],
}
# 3. Run experiment with run_experiment
best_params = run_experiment(
dataset=ds_unsup,
methods=[MyUnsupervisedMethod],
param_space_dict_per_method=[method_param_space],
method_names=['MyUnsupervisedMethod'],
experiments=[UnsupervisedPdMExperiment],
experiment_names=['Unsupervised Detection'],
MAX_RUNS=10,
MAX_JOBS=2,
INITIAL_RANDOM=2,
profile_size=10,
optimization_param='AD1_AUC'
)
# 4. Check results
print(f"Best parameters: {best_params[0]}")
Next Steps#
Review
IsolationForestUnsupervisedsource code for a working referenceExplore other unsupervised methods in
pdmlabs/method/(e.g., PCA, autoencoder-based)Implement parameter generation function in
pdmlabs/utils/automatic_parameter_generation.pyif needed