pdmlabs.utils.dataset

pdmlabs.utils.dataset#

Dataset preparation and management for predictive maintenance tasks.

This module provides the Dataset class for handling time-series data preparation, episode management, train/validation/test splitting, and generation of labeled datasets for various learning paradigms (supervised, unsupervised, semi-supervised).

Key Features:

Automatic episode extraction from time-series data
Intelligent train/val/test splitting with failure-aware strategy
Support for multiple dataset formats (RUL, survival analysis, classification)
Event data integration and wildcard-based event preference handling
Configurable predictive horizon and sliding window parameters

Example

>>> import pandas as pd
>>> from pdmlabs.utils.dataset import Dataset
>>> data = pd.read_csv('sensor_data.csv')
>>> dataset = Dataset(
...     data=data,
...     datetime_column='timestamp',
...     failure_column='is_failure',
...     source_column='equipment_id'
... )
>>> train_data, test_data = dataset.get_rul_dataset()

Functions

`data_split_by_event`(df_source, event_source, ...)
`episodes_formulation`(data, datetime_column)

Classes

Dataset(data, datetime_column[, ...])

A class to handle dataset preparation and processing for predictive maintenance tasks.

class pdmlabs.utils.dataset.Dataset(data, datetime_column, event_indicator=None, maintenance_column=None, failure_column=None, event_df=None, source_column='source', beta=1, slide=None, lead='0 seconds', predictive_horizon=None, train_sources=0.6, val_sources=0.2, test_sources=0.2, max_wait_time=None, in_source_split=False, DIVIDER=3600)#

Bases: object

A class to handle dataset preparation and processing for predictive maintenance tasks. This includes splitting data into episodes, calculating sliding windows, and preparing training, validation, and testing datasets.

Parameters:

data (pd.DataFrame) – The input data containing time-series information. If event_df is not provided but maintenance_column and failure_column are provided it except that maintenance_column and failure_column are included in data, and use them to derive on the episodes. If maintenance_column and failure_column and event_df are not provided it assumes every source as a single run-to-failure episode.
datetime_column (str) – The name of the column representing datetime values.
event_indicator (str, default=None) – The name of the column indicating event occurrence (binary). If provided, it is used to derive episode ending (0: maintenance/reset or 1: failure).
maintenance_column (str, default=None) – The name of the column indicating maintenance events.
failure_column (str, default=None) – The name of the column indicating failure events.
event_df (pd.DataFrame, optional) – A DataFrame containing event data. If provided, data, datetime_column, and failure_column must also be given. The event_df must contain the columns datetime_column, source_column, maintenance_column, and failure_column. Events are generated based on the following: - Failure event: failure_column=1 - Maintenance (resetting) event: maintenance_column=1
source_column (str, default='source') – The name of the column representing the source of the data.
beta (int, default=1) – A parameter used for objective calculations.
slide (int, optional) – The sliding window size. If None, it is calculated automatically.
lead (str, default="2 seconds") – The lead time for predictions.
predictive_horizon (str, optional) – The predictive horizon for the dataset. If None, it is calculated automatically.
train_sources (float or list, default=0.6) – The ratio (float) or list of source names used for training. If a float, it represents the proportion of sources used for training.
val_sources (float or list, default=0.2) – The ratio (float) or list of source names used for validation. If a float, it represents the proportion of sources used for validation.
test_sources (float or list, default=0.2) – The ratio (float) or list of source names used for testing. If a float, it represents the proportion of sources used for testing.
max_wait_time (int, control the maximym length of profile parameter in OnlineFlavor and Sliding Window flavor (i.e. the maximum length of the) – data to fit anomaly detectors). This is the time that the user is willing to wait before detectors produce alarms. If None, it is set to 2/3 of the minimum episode length.
in_source_split (bool, default=False Whether to select train/val/test sources from within each source (True) or from the overall sources (False).)

df_to_x_y_surv(df, indicator=None)#

Convert dataframe to survival analysis format (time, event) tuples.

Parameters:

df (pd.DataFrame) – Dataframe containing RUL and event columns.
indicator (int, optional) – Event indicator value (0 or 1). If None, uses ‘event’ column from df.

Returns:

List of (rul, event) tuples for survival analysis models.

Return type:

list[tuple]

generate_binary_labels(sources, list_dfs)#

Generate binary anomaly labels based on predictive horizon and lead time.

Creates binary labels (0=normal, 1=anomaly) by identifying samples within the predictive horizon before failure events and considering lead time.

Parameters:

sources (list[str]) – Source identifiers corresponding to dataframes.
list_dfs (list[pd.DataFrame]) – List of episode dataframes to label.

Returns:

(final_ranges, leadranges) - Lists of binary label arrays and lead time flags.

Return type:

tuple[list, list]

Notes

Uses predictive_horizon and lead time from Dataset initialization
Any sample within lead range (before failure) is marked as 1
Helper uses _data_formulation and extract_anomaly_ranges from evaluation module

get_Classification_dataset(keep_sources=None)#: From train episodes without failures, we ignore the last predictive_horizon period to ensure healthy operation, based on the objective the user wants to optimize. Then generate binary labels for all training data, labeling every record as 0, except those that lie within the predictive horizon before a failure event, which are labeled

as 1.

get_SA_dataset(keep_sources=None)#

Generate Survival Analysis dataset with reliability labels.

Creates datasets for survival regression tasks where the goal is to predict survival probabilities or remaining time until events. Combines all training episodes and marks event indicators (failure/maintenance).

Parameters:: keep_sources (str, optional) – If provided, preserves this column for source tracking.
Returns:: (dataset, test_dataset) - Dictionaries containing: - ‘target_labels’: Tuples of (RUL, event_indicator) for each sample - ‘anomaly_labels’: Tuples of (RUL, event_flag) for training - Other fields same as get_rul_dataset()
Return type:: tuple[dict, dict]

Notes

Survival analysis labels are tuples (time, event) used by survival methods
Event indicator: 1 for failure, 0 for maintenance/reset
Combines event information from the rtf_dict (run-to-failure mapping)

get_events_from_df(df_list)#

get_rul_dataset(keep_sources=None)#

Generate RUL (Remaining Useful Life) prediction dataset.

Creates training, validation, and testing datasets optimized for RUL regression tasks. Uses only run-to-failure episodes for training and generates RUL labels indicating time remaining until failure.

Parameters:: keep_sources (str, optional) – If provided, preserves this column (e.g., ‘source’) in the dataset for source tracking. Otherwise, removes source and RUL columns.
Returns:: (dataset, test_dataset) - Two dictionaries containing: - ‘match_sources’: Source mapping for transfer learning - ‘target_sources’: Sources used for validation/testing - ‘target_data’: Feature data for val/test - ‘target_labels’: RUL values (time to failure) for val/test - ‘is_failure’: Whether each source had failures - ‘historic_data’: Training data (run-to-failure episodes only) - ‘historic_sources’: Source names for training data - ‘anomaly_labels’: RUL labels for training data - ‘predictive_horizon’: Time window before failure - ‘slide’: Sliding window step size - ‘lead’: Lead time for predictions - ‘beta’: Objective weighting parameter
Return type:: tuple[dict, dict]

Examples

>>> dataset_obj = Dataset(data, 'timestamp', failure_column='is_failure')
>>> train_set, test_set = dataset_obj.get_rul_dataset()
>>> # Access training RUL data
>>> rul_labels = train_set['anomaly_labels'][0]

get_semi_dataset()#: From train episodes we only keep those without failures, and we ignore the last predictive_horizon period to ensure healthy operation, based on the objective the user wants to optimize. These are used as historical data without labels, to train Semi Supervised anomaly detector.

get_unsupervised_dataset()#: From train episodes we only keep those without failures, and we ignore the last predictive_horizon period to ensure healthy operation, based on the objective the user wants to optimize. These are used as historical data without labels, to train Semi Supervised anomaly detector.

safe_splitting(source_episodes, at_least_one_failure_in_train=False)#

slide_calculation(episodes, run_to_failure)#

Calculate optimal sliding window step size for dataset generation.

Ensures that slide + predictive_horizon equals approximately 1/3 of the smallest failure episode. This balances training data size with prediction lead time.

Parameters:

episodes (list[pd.DataFrame]) – List of episode dataframes, each representing one run-to-failure sequence.
run_to_failure (list[int]) – List indicating which episodes contain failures (1) or are healthy runs (0).

Returns:

Optimal sliding window step size. Minimum value is 1.

Return type:

int

Notes

The sliding window step determines how many samples between consecutive windows. Larger steps = fewer training samples but faster processing. Smaller steps = more training samples but more computation.

Formula: slide = (episode_length / 3) - predictive_horizon_length

split_sources_to_train_test_val(episodes, ran_to_failure)#

Splits the sources into training, validation, and testing datasets.

Parameters:

episodes (list) – A list of dataframes, where each dataframe corresponds to an episode.
ran_to_failure (list) – A list of integers indicating whether each episode is a run-to-failure (1) or not (0).

Returns:

The method updates the following attributes of the class: - self.train_dfs: Dataframes for training. - self.val_dfs: Dataframes for validation. - self.test_dfs: Dataframes for testing. - self.sources_for_train: Sources used for training. - self.sources_for_val: Sources used for validation. - self.sources_for_test: Sources used for testing. - self.matches: A dictionary mapping training sources to validation and testing sources.

Return type:

None

pdmlabs.utils.dataset.data_split_by_event(df_source, event_source, datetime_column, failure_column, maintenance_column, source_column='source', DIVIDER=3600)#

pdmlabs.utils.dataset.episodes_formulation(data, datetime_column, event_indicator=None, maintenance_list=None, failure_list=None, event_df=None, source_column='source', DIVIDER=3600)#

pdmlabs.utils.dataset

Contents

pdmlabs.utils.dataset#