-
Notifications
You must be signed in to change notification settings - Fork 427
Open
Description
Problem Description
My dataset has 59 samples (confirmed in data_info.json with "rows": 59), but I've observed the following issues:
fold_i_validation_indices.npyfiles in thefoldsdirectory contain indices 59-63 which exceed the dataset sizepredictions_out_of_folds.csvfiles for all models contain 64 true labels and predictions (5 extra samples)
Environment
Version: 1.1.15
Models in my run:
- Baseline
- Neural Network
- Random Forest
- XGBoost
- Decision Tree
- Logistic Regression
Investigation
- Tested with sklearn's StratifiedKFold which works correctly with the same dataset
- The total number of validation samples is 64 (2^6), which might suggest oversampling or padding for computational efficiency
- Spent over 2 hours investigating the
mljar-supervisedsource code but unable to identify where/why the additional indices are generated
Impact
- Cannot properly evaluate individual fold models due to unknown origin of extra indices
- Validation set effectiveness is compromised due to duplicate samples
Questions
- Is this intended behavior for specific models?
- If oversampling/padding is required (e.g., for neural network batch size), how can we identify/remove the those extra samples?
- How can we obtain the correct mapping between predictions and original data indices?
Metadata
Metadata
Assignees
Labels
No labels