Dataset Organization and Requirements
To ensure compatibility with DatasetSequence, datasets should be preprocessed and saved in a structured format before being used for training or inference. Below are the requirements and best practices for organizing datasets.
1. General Requirements
- Preprocessing: Data should be saved in a preprocessed state, meaning all necessary filtering or transformations (except min-max normalization, which can optionally be performed when passing the signals to the model) should be done before loading into the dataset.
- File Format: Data should be stored as
.npyfiles for efficient loading. - Consistency: Each input sample (
X) must have a corresponding output (Y) with the same filename to ensure proper mapping (read next section).
2. Directory Structure and File Naming Correspondence
Datasets should follow a structured directory format with separate folders for training, validation, and testing data.
│── x/ # Input signals ("path_x")
│ ├── train/ # Training set
│ │ ├── sample_001.npy
│ │ ├── sample_002.npy
│ │ ├── ...
│ ├── val/ # Validation set
│ │ ├── sample_101.npy
│ │ ├── sample_102.npy
│ ├── test/ # Test set
│ ├── sample_201.npy
│ ├── sample_202.npy
│── y/ # Output ("path_y")
│ ├── train/ # Training set
│ │ ├── sample_001.npy # Must match the filenames in x/train/
│ │ ├── sample_002.npy
│ ├── val/ # Validation
│ │ ├── sample_101.npy
│ │ ├── sample_102.npy
│ ├── test/ # Test
│ ├── sample_201.npy # Must match filenames in x/test/
│ ├── sample_202.npy
- Each file in
x/train/,x/val/, andx/test/must have a corresponding file iny/train/,y/val/, andy/test/, respectively. For example:
x/train/sample_001.npy ↔ y/train/sample_001.npy
3. Data Formatting Requirements
Input Data (X) Format
- Each input sample (
item_x) must have shape[seq_len, num_features]. - The number of features must match
n_featuresspecified when initializing the model. - Automatic Formatting by
DatasetSequence:- If input is a 1D signal (
[seq_len]), it is reshaped to[seq_len, 1]. - If input is incorrectly stored as
[num_features, seq_len], it is transposed before returning.
- If input is a 1D signal (
Output Data (Y) Format
-
For sequence-to-one tasks (e.g., classification, regression):
- Each
item_yshould be a single value or vector, with shape:- Multiclass classification (single-label):
[1](a single integer representing the class index). - Multilabel classification:
[1, num_classes](a binary vector where each position indicates whether a class is active). - Regression:
[1, 1](a single scalar value for continuous outputs).
- Multiclass classification (single-label):
- Each
-
For sequence-to-sequence tasks:
- Regression: Each
item_ymust have shape[seq_len, num_features], matchingitem_x. - Multiclass classification (single-label): Each
item_yshould have shape[seq_len], where each timestep contains an integer representing the class. - Multilabel classification: Each
item_yshould have shape[seq_len, num_classes], where each timestep has a binary vector for active classes.
- Regression: Each
-
Automatic Formatting by
DatasetSequence:- If output is a 1D signal, it is reshaped to
[seq_len, 1]ifseq2seq, or[1, num_classes]ifseq2onefor multilabel classification. - If output is incorrectly stored as
[num_features, seq_len], it is transposed before returning (forseq2seq). - If output is a single value, it is reshaped to
[1, 1](forseq2oneregression).
- If output is a 1D signal, it is reshaped to
Sequence lengths
Sequence lengths can differ from signal to signal, as the signals are automatically padded if that is the case.
Pre-processing
It is assumed that the datasets have been pre-processed. However, it is possible to perform minmax normalization of the signals (each signal individually) if min_max_norm_sig is set to True.