Skip to content

PermutationImportance fails when data has too few rows #324

@DWgit

Description

@DWgit

PermutationImportance was enhanced in #208 to limit excessive computation when the number of columns is large:

                rows, cols = X_validation.shape
                if cols > 5000:
                    X_vald, _, y_vald, _ = subsample(
                        X_validation, y_validation, train_size=100, ml_task=ml_task
                    )
                elif cols > 50 and rows * cols > 200000:
                    X_vald, _, y_vald, _ = subsample(
                        X_validation, y_validation, train_size=1000, ml_task=ml_task
                    )
                else:
                    X_vald = X_validation
                    y_vald = y_validation

Originally posted by @pplonski in #208 (comment)

If a dataset has fewer rows than these hardwired train_size values, subsample throws an exception and PermutationImportance fails.

An obvious fix is to replace these with train_size=min(nRows, constant).

Wide and short datasets are quite common in biological applications, and feature importance is one of the most valuable outcomes of an analysis.

Thanks very much!

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions