Skip to content

Cross validation makes duplicate samples #790

@ov3rfit

Description

@ov3rfit

Problem Description

My dataset has 59 samples (confirmed in data_info.json with "rows": 59), but I've observed the following issues:

  1. fold_i_validation_indices.npy files in the folds directory contain indices 59-63 which exceed the dataset size
  2. predictions_out_of_folds.csv files for all models contain 64 true labels and predictions (5 extra samples)

Environment

Version: 1.1.15

Models in my run:

  • Baseline
  • Neural Network
  • Random Forest
  • XGBoost
  • Decision Tree
  • Logistic Regression

Investigation

  • Tested with sklearn's StratifiedKFold which works correctly with the same dataset
  • The total number of validation samples is 64 (2^6), which might suggest oversampling or padding for computational efficiency
  • Spent over 2 hours investigating the mljar-supervised source code but unable to identify where/why the additional indices are generated

Impact

  1. Cannot properly evaluate individual fold models due to unknown origin of extra indices
  2. Validation set effectiveness is compromised due to duplicate samples

Questions

  1. Is this intended behavior for specific models?
  2. If oversampling/padding is required (e.g., for neural network batch size), how can we identify/remove the those extra samples?
  3. How can we obtain the correct mapping between predictions and original data indices?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions