pdmlabs.preprocessing.record_level.record_level_pre_processor

pdmlabs.preprocessing.record_level.record_level_pre_processor#

Interface definition for record-level (feature-level) preprocessing components.

A preprocessor transforms raw features before anomaly detection. Common operations: - Normalization/scaling (MinMaxScaler, StandardScaler) - Feature engineering/aggregation - Windowing (sliding window transformations) - Feature selection - Imputation/handling missing values

Preprocessing can be: - Stateless: transform() produces same output regardless of fit (e.g., FeatureSelector) - Stateful: fit() learns statistics from training data, transform() applies them

All preprocessors implement RecordLevelPreProcessorInterface to ensure consistency.

Classes

RecordLevelPreProcessorInterface(...)

Abstract base class for record-level (feature) preprocessing.

class pdmlabs.preprocessing.record_level.record_level_pre_processor.RecordLevelPreProcessorInterface(event_preferences: EventPreferences)#

Bases: ABC

Abstract base class for record-level (feature) preprocessing.

Record-level preprocessing operates on features/columns of individual records (as opposed to raw_level preprocessing which might handle multi-file combinations).

Subclasses must implement: - fit(): Learn statistics from training data - transform(): Apply learned transformation to test data - transform_one(): Transform a single sample (for online scenarios) - get_params(): Return hyperparameters as dict

event_preferences#

Event configuration dict with ‘failure’, ‘reset’, etc. for context-aware preprocessing if needed.

Type:: EventPreferences

Examples

>>> class MyScaler(RecordLevelPreProcessorInterface):
...     def fit(self, historic_data, historic_sources, event_data, anomaly_ranges=None):
...         # Learn scaling factors from historic_data
...         pass
...     def transform(self, target_data, source, event_data):
...         # Apply scaling to target_data
...         return target_data
...     def transform_one(self, new_sample, source, is_event):
...         # Transform single row
...         return new_sample
...     def get_params(self):
...         # Return {param_name: value}
...         return {}
...     def __str__(self):
...         return 'MyScaler'

abstract fit(historic_data: list, historic_sources: list[str], event_data: DataFrame, anomaly_ranges=None) → None#

Learn preprocessing statistics from training data.

Called once per experiment fold on all historic (training) data. Subclasses use this to compute statistics (e.g., min/max, mean/std) that are later applied in transform().

Parameters:

historic_data (list[pd.DataFrame]) – List of training DataFrames, one per source.
historic_sources (list[str]) – List of source identifiers corresponding to historic_data (e.g., [‘bearing_1’, ‘bearing_2’]).
event_data (pd.DataFrame) – Complete event log with columns [‘date’, ‘type’, ‘description’, ‘source’]. May be used to segment preprocessing (e.g., per-episode statistics).
anomaly_ranges (optional) – Pre-computed anomaly ranges (rarely used).

abstract get_params()#

Return hyperparameters and configuration as a dictionary.

Used for logging to MLflow and for reproducibility. Should return a flat dict with string keys and JSON-serializable values.

Returns:: Hyperparameters, e.g., {‘scale’: ‘minmax’, ‘feature_count’: 10}.
Return type:: dict

Examples

>>> scaler = MinMaxScaler(event_preferences={...})
>>> scaler.fit(...)
>>> print(scaler.get_params())
{}

abstract transform(target_data: DataFrame, source: str, event_data: DataFrame) → DataFrame#

Apply learned preprocessing to test data.

Called after fit() to transform test/target data using statistics learned during fit. Must preserve index and alignment with original data.

Parameters:

target_data (pd.DataFrame) – Test data to transform.
source (str) – Source identifier (e.g., ‘bearing_1’). Use to select per-source statistics if learned separately.
event_data (pd.DataFrame) – Complete event log (same as in fit()).

Returns:

Transformed data with same shape and index as target_data.

Return type:

pd.DataFrame

Examples

>>> preprocessor = MinMaxScaler(event_preferences={...})
>>> preprocessor.fit([train_df], ['bearing_1'], events_df)
>>> test_df_scaled = preprocessor.transform(test_df, 'bearing_1', events_df)

abstract transform_one(new_sample: Series, source: str, is_event: bool) → Series#

Transform a single record (for online/streaming scenarios).

Applied to individual rows as they arrive (not in batch). Used by streaming and online experiments to preprocess samples incrementally.

Parameters:

new_sample (pd.Series) – Single row with column names as index.
source (str) – Source identifier for this sample.
is_event (bool) – Whether this sample corresponds to an event timestamp (may trigger special handling, e.g., reset normalization).

Returns:

Transformed sample with same index as input.

Return type:

pd.Series

Note

For stateful preprocessors, ensures consistency with batch transform(). For streaming, typically applies pre-learned statistics (from fit()).

pdmlabs.preprocessing.record_level.record_level_pre_processor

Contents

pdmlabs.preprocessing.record_level.record_level_pre_processor#