Skip to content

Allow pickle when loading numpy array file #836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 4, 2023
Merged

Conversation

tomglk
Copy link
Contributor

@tomglk tomglk commented Jul 12, 2023

Hi, we had had problems loading a detector which we stored using save_detector.

When we want to load it using load_detector, we get the error value error: 'Object arrays cannot be loaded when allow_pickle=False'

Currently we apply a hack of overwriting np.load with the parameter allow_pickle=True in order to be able to use the loading mechanism. (https://stackoverflow.com/a/56243777)

However, we think that it makes sense to set this param in the loading function itself.

If you have any questions, feel free to ask.
Any tips on how to solve this issue with other ways are also more than welcome. :)

Thanks, Tobi & Tom

Co-authored-by: tokaessm

@jklaise jklaise requested a review from ascillitoe July 12, 2023 17:13
@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Merging #836 (ac43508) into master (4a1b4f7) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #836      +/-   ##
==========================================
- Coverage   81.98%   81.91%   -0.07%     
==========================================
  Files         159      159              
  Lines       10375    10375              
==========================================
- Hits         8506     8499       -7     
- Misses       1869     1876       +7     
Files Coverage Δ
alibi_detect/saving/loading.py 93.92% <100.00%> (ø)

... and 1 file with indirect coverage changes

@ascillitoe
Copy link
Contributor

Hi @tomglk, thank you for contributing this.

Are you able to share why you need to serialise/deserialise dtype='object' ndarray's here? We'd be interested to know since this isn't a use case we had envisioned (although that doesn't mean it isn't valid!), and want to be sure there is strong motivation for allow_pickle=True due to the associated CVE-2019-644.

@tomglk
Copy link
Contributor Author

tomglk commented Jul 21, 2023

Hi @ascillitoe ,
I know that allow_pickle = True has bad security implications. I would welcome any other solution.

We store strings, that seems to be the problem.

This is a minimal example that breaks:

data = pd.DataFrame(
    columns=["c1", "c2", "c3"],
    data=[
        ["42", 1, datetime.now().timestamp()],
        ["42", 2, datetime.now().timestamp()],
        ["42", 3, datetime.now().timestamp()],
    ]
)

detector = TabularDrift(
    x_ref=data.to_numpy(),
    p_val=0.05,
    x_ref_preprocessed=True
)

save_detector(
    detector,
    "detector"
)

loaded_detector = load_detector(
    "detector"
)

Changing "42" to 42 fixes it.

It also breaks when you remove the .timestamp(), but storing a timestamp instead of a datetime object is not a problem imo.

Edit:
Setting the c1 column as category-type & adding the param categories_per_feature={0: None} also does not change the behaviour.

@tomglk
Copy link
Contributor Author

tomglk commented Jul 22, 2023

New commit does not set the value to true all the time (because that is unnecessary, if you do not have the problem).
I see that the first commit was overly aggressive in that regard.

Instead, the value can now be controlled via parameter.

Intended usage:

loaded_detector = load_detector(
    "detector",
    enable_unsafe_loading=True
)

This way users who actually have the problem do not have to hack the np.load()-function themselves. But they are forced to make a conscious and informed decision (CVE you mentioned is linked in docstrings).

Tests:
I did not touch the tests, because I cannot get the project running locally.
I can try to make it work next week. Or look into test cases without running anything, but that would be mildly horrible.

tomglk added 2 commits September 8, 2023 20:51
Merge branch 'master' of github.com:SeldonIO/alibi-detect
@tomglk
Copy link
Contributor Author

tomglk commented