Skip to content

Dataset Organization and Requirements

To ensure compatibility with DatasetSequence, datasets should be preprocessed and saved in a structured format before being used for training or inference. Below are the requirements and best practices for organizing datasets.


1. General Requirements

  • Preprocessing: Data should be saved in a preprocessed state, meaning all necessary filtering or transformations (except min-max normalization, which can optionally be performed when passing the signals to the model) should be done before loading into the dataset.
  • File Format: Data should be stored as .npy files for efficient loading.
  • Consistency: Each input sample (X) must have a corresponding output (Y) with the same filename to ensure proper mapping (read next section).

2. Directory Structure and File Naming Correspondence

Datasets should follow a structured directory format with separate folders for training, validation, and testing data.

│── x/                   # Input signals ("path_x")
│   ├── train/           # Training set
│   │   ├── sample_001.npy
│   │   ├── sample_002.npy
│   │   ├── ...
│   ├── val/             # Validation set
│   │   ├── sample_101.npy
│   │   ├── sample_102.npy
│   ├── test/            # Test set
│       ├── sample_201.npy
│       ├── sample_202.npy

│── y/                   # Output ("path_y")
│   ├── train/           # Training set
│   │   ├── sample_001.npy   # Must match the filenames in x/train/
│   │   ├── sample_002.npy
│   ├── val/             # Validation 
│   │   ├── sample_101.npy
│   │   ├── sample_102.npy
│   ├── test/            # Test 
│       ├── sample_201.npy   # Must match filenames in x/test/
│       ├── sample_202.npy
  • Each file in x/train/, x/val/, and x/test/ must have a corresponding file in y/train/, y/val/, and y/test/, respectively. For example:
x/train/sample_001.npy  ↔  y/train/sample_001.npy

3. Data Formatting Requirements

Input Data (X) Format

  • Each input sample (item_x) must have shape [seq_len, num_features].
  • The number of features must match n_features specified when initializing the model.
  • Automatic Formatting by DatasetSequence:
    • If input is a 1D signal ([seq_len]), it is reshaped to [seq_len, 1].
    • If input is incorrectly stored as [num_features, seq_len], it is transposed before returning.

Output Data (Y) Format

  • For sequence-to-one tasks (e.g., classification, regression):

    • Each item_y should be a single value or vector, with shape:
      • Multiclass classification (single-label): [1] (a single integer representing the class index).
      • Multilabel classification: [1, num_classes] (a binary vector where each position indicates whether a class is active).
      • Regression: [1, 1] (a single scalar value for continuous outputs).
  • For sequence-to-sequence tasks:

    • Regression: Each item_y must have shape [seq_len, num_features], matching item_x.
    • Multiclass classification (single-label): Each item_y should have shape [seq_len], where each timestep contains an integer representing the class.
    • Multilabel classification: Each item_y should have shape [seq_len, num_classes], where each timestep has a binary vector for active classes.
  • Automatic Formatting by DatasetSequence:

    • If output is a 1D signal, it is reshaped to [seq_len, 1] if seq2seq, or [1, num_classes] if seq2one for multilabel classification.
    • If output is incorrectly stored as [num_features, seq_len], it is transposed before returning (for seq2seq).
    • If output is a single value, it is reshaped to [1, 1] (for seq2one regression).

Sequence lengths

Sequence lengths can differ from signal to signal, as the signals are automatically padded if that is the case.

Pre-processing

It is assumed that the datasets have been pre-processed. However, it is possible to perform minmax normalization of the signals (each signal individually) if min_max_norm_sig is set to True.