Cross validation makes duplicate samples

## Problem Description
My dataset has 59 samples (confirmed in `data_info.json` with `"rows": 59`), but I've observed the following issues:

1. `fold_i_validation_indices.npy` files in the `folds` directory contain indices 59-63 which exceed the dataset size
2. `predictions_out_of_folds.csv` files for all models contain 64 true labels and predictions (5 extra samples)
 
## Environment
Version: 1.1.15

Models in my run:
- Baseline
- Neural Network
- Random Forest
- XGBoost
- Decision Tree
- Logistic Regression

## Investigation
- Tested with sklearn's StratifiedKFold which works correctly with the same dataset
- The total number of validation samples is 64 (2^6), which might suggest oversampling or padding for computational efficiency
- Spent over 2 hours investigating the `mljar-supervised` source code but unable to identify where/why the additional indices are generated

## Impact
1. Cannot properly evaluate individual fold models due to unknown origin of extra indices
2. Validation set effectiveness is compromised due to duplicate samples

## Questions
1. Is this intended behavior for specific models?
2. If oversampling/padding is required (e.g., for neural network batch size), how can we identify/remove the those extra samples?
3. How can we obtain the correct mapping between predictions and original data indices?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cross validation makes duplicate samples #790

Problem Description

Environment

Investigation

Impact

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cross validation makes duplicate samples #790

Description

Problem Description

Environment

Investigation

Impact

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions