pdmlabs.preprocessing.record_level.min_max_scaler_cheat

pdmlabs.preprocessing.record_level.min_max_scaler_cheat#

Min-Max scaling preprocessor that fits on test data (data leakage variant).

WARNING: MinMaxScalerCheat is for TESTING ONLY. It intentionally violates the train/test separation principle by fitting the scaler on the test data during transform(). This is a form of data leakage that provides unrealistically optimistic results.

This implementation is primarily for: - Baseline/upper-bound performance (best possible scenario) - Debugging and understanding preprocessing impact - Sanity checking (what performance could we achieve if we knew test data?)

DO NOT USE IN PRODUCTION or for realistic performance estimates.

Compare against MinMaxScaler (correct) to see preprocessing impact.

Classes

MinMaxScalerCheat(event_preferences)

Min-Max scaling that fits on test data (DATA LEAKAGE - FOR TESTING ONLY).

class pdmlabs.preprocessing.record_level.min_max_scaler_cheat.MinMaxScalerCheat(event_preferences: EventPreferences)#

Bases: RecordLevelPreProcessorInterface

Min-Max scaling that fits on test data (DATA LEAKAGE - FOR TESTING ONLY).

This preprocessor fits the scaler on test data during transform(), which is cheating and provides unrealistically good results. It’s useful for: - Measuring best-case performance with perfect normalization - Debugging preprocessing pipeline - Academic/research comparison to show preprocessing limits

WARNING: This violates proper machine learning practice. Use only for experimental analysis, not for model evaluation.

scaler_per_source#

Maps source identifier to MinMaxScaler fitted on TEST data (cheating).

Type:: dict

Examples

>>> # Bad: This is how NOT to do preprocessing
>>> scaler_cheat = MinMaxScalerCheat(event_preferences={'failure': [], 'reset': []})
>>> scaler_cheat.fit([df_train], ['bearing_1'], events_df)  # Does nothing
>>> df_test_cheated = scaler_cheat.transform(df_test, 'bearing_1', events_df)
>>> # Results will be unrealistically good because scaler was fit on df_test!

fit(historic_data: list, historic_sources: list[str], event_data: DataFrame, anomaly_ranges=None) → None#

No-op fit (does nothing).

The scaler is fitted on test data during transform() instead, which is why this is cheating.

Parameters:

historic_data (list[pd.DataFrame]) – Ignored.
historic_sources (list[str]) – Ignored.
event_data (pd.DataFrame) – Ignored.
anomaly_ranges – Ignored.

get_params()#

Return hyperparameters (none for this preprocessor).

Returns:: Empty dict {} (no hyperparameters).
Return type:: dict

transform(target_data: DataFrame, source: str, event_data: DataFrame) → DataFrame#

Fit scaler on target data, then scale it (DATA LEAKAGE).

WARNING: This method violates train/test separation by fitting on the test data. Results are unrealistically optimistic.

Parameters:

target_data (pd.DataFrame) – Test data (used to fit AND transform).
source (str) – Source identifier.
event_data (pd.DataFrame) – Event log (unused).

Returns:

Scaled test data using scaler fitted on that same: test data (cheating).

Return type:

pd.DataFrame

Examples

>>> df_test_cheated = scaler_cheat.transform(df_test, 'bearing_1', events_df)
>>> # Results will have suspiciously perfect scaling

transform_one(new_sample: Series, source: str, is_event: bool) → Series#

Scale a single sample using fitted scaler.

Parameters:

new_sample (pd.Series) – Single row to scale.
source (str) – Source identifier.
is_event (bool) – Event flag (unused).

Returns:

Scaled sample.

Return type:

pd.Series

Note

transform_one() implementation has potential issues with the deprecated append() method and may not work as intended.

pdmlabs.preprocessing.record_level.min_max_scaler_cheat

Contents

pdmlabs.preprocessing.record_level.min_max_scaler_cheat#