PermutationImportance fails when data has too few rows

`PermutationImportance` was enhanced in #208 to limit excessive computation when the number of columns is large:
```py
                rows, cols = X_validation.shape
                if cols > 5000:
                    X_vald, _, y_vald, _ = subsample(
                        X_validation, y_validation, train_size=100, ml_task=ml_task
                    )
                elif cols > 50 and rows * cols > 200000:
                    X_vald, _, y_vald, _ = subsample(
                        X_validation, y_validation, train_size=1000, ml_task=ml_task
                    )
                else:
                    X_vald = X_validation
                    y_vald = y_validation
```
_Originally posted by @pplonski in https://github.com/mljar/mljar-supervised/issues/208#issuecomment-697521246_

If a dataset has fewer rows than these hardwired `train_size` values, `subsample` throws an exception and `PermutationImportance` fails. 

An obvious fix is to replace these with `train_size=min(nRows, constant)`.

Wide and short datasets are quite common in biological applications, and feature importance is one of the most valuable outcomes of an analysis. 

Thanks very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PermutationImportance fails when data has too few rows #324

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PermutationImportance fails when data has too few rows #324

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions