Skip to content

[Data] ArrowInvalid during ray.data.from_huggingface: Parquet magic bytes not found in footer #54101

@lk-chen

Description

@lk-chen

What happened + What you expected to happen

ray.data.from_huggingface doesn't load from a (seemingly properly) loaded huggingface dataset

Expect: from_huggingface should working with any datasets.dataset_dict.DatasetDict object

Exception stack
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/read_api.py", line 3231, in from_huggingface
    return read_parquet(
           ^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/read_api.py", line 946, in read_parquet
    datasource = ParquetDatasource(
                 ^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 241, in __init__
    pq_ds = get_parquet_dataset(paths, filesystem, dataset_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 629, in get_parquet_dataset
    dataset = pq.ParquetDataset(
              ^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1361, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/dataset.py", line 797, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3198, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'https://huggingface.co/api/datasets/abisee/cnn_dailymail/parquet/3.0.0/train/0.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'https://huggingface.co/api/datasets/abisee/cnn_dailymail/parquet/3.0.0/train/0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

It happened when I'm trying Anyscale template LLM offline batch inference with Ray Data LLM APIs

Versions / Dependencies

datasets==3.6.0
ray==2.47.1

Reproduction script

import ray 
import datasets

df = datasets.load_dataset("cnn_dailymail", "3.0.0")
print(type(df))
ds = ray.data.from_huggingface(df["train"])

Issue Severity

High: It blocks me from completing my task. Specifically I cannot finish template notebook mentioned above

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions