-
Notifications
You must be signed in to change notification settings - Fork 133
feat: Make preprocess rigorous with IOFactory and pydantic dataclasses #1398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Make preprocess rigorous with IOFactory and pydantic dataclasses #1398
Conversation
@lgray @nsmith- @ikrommyd This is ready for a fresh set of eyes, particularly for anything I may have missed/overlooked so far. Keep in mind, supporting runner, preprocess_parquet needs to come in followup PRs (latter will need to be converted for pydantic, former I have not considered yet - do we want to (semi-)unify runner and preprocess+apply_to_fileset interfaces first?) |
581245f
to
187113f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Pydantic-based dataclasses to replace dictionary-based dataset specifications, providing type safety, validation, and serialization capabilities for Coffea's dataset processing workflow. This is a foundational change to support upcoming parquet reading and column-joining features.
Key changes:
- Introduces comprehensive Pydantic models for file specifications (
UprootFileSpec
,ParquetFileSpec
, and their Coffea variants) - Adds
DatasetSpec
andFilesetSpec
classes with validation and serialization - Updates preprocessing and manipulation functions to support both legacy dictionaries and new Pydantic models
- Provides dual-path compatibility for gradual migration
Reviewed Changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/coffea/dataset_tools/filespec.py |
Core Pydantic models defining file and dataset specifications with validation |
tests/test_dataset_tools_filespec.py |
Comprehensive test suite for all new Pydantic models and validation logic |
src/coffea/dataset_tools/preprocess.py |
Updated preprocessing to handle both dict and Pydantic model inputs |
src/coffea/dataset_tools/manipulations.py |
Modified manipulation functions for dual-path compatibility |
tests/test_dataset_tools.py |
Extended existing tests to validate both dict and Pydantic model workflows |
src/coffea/dataset_tools/apply_processor.py |
Updated processor application to handle new model types |
src/coffea/dataset_tools/__init__.py |
Added exports for new Pydantic classes |
pyproject.toml |
Added Pydantic dependency |
docs/source/examples.rst |
Added reference to new filespec notebook |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Ping @lgray |
@NJManganelli looking at the volume of code that's been produced I assume this is AI generated for the most part? Not that it matters, just understanding how to approach it. |
The core code (non- For the tests, it was a mix, lets call it 2/3rds Copilot-initiated and 1/3rd by-hand (edits + consolidating via parametrization for the copilot tests, and updating older tests to explicitly parametrize Pydantic variation). Notebook is almost entirely Copilot, with some edits for style, presentation, and corrections. Hopefully that helps in what/how to review them respectively |
An initial comment looking through the user-facing parts |
src/coffea/dataset_tools/filespec.py
Outdated
|
||
def identify_file_format(name_or_directory: str) -> str: | ||
root_expression = re.compile(r"\.root") | ||
parquet_expression = re.compile(r"\.parq(?:uet)?$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also seen .pq
running around as well for parquet files, may want to add this.
The right way to do it for both would be to check the magic bytes in the files but that's very slow for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, riffing on that, when we go to process the files in preprocess we could also validate the determined file type against the magic bytes. However, this is largely already accomplished just by assuming the extension is correct and opening with the appropriate package.
Maybe we could give a more helpful error message? Then again I expect this kind of error (file with wrong extension) is very rare.
Moreover the actual filename is more a bookkeeping tool at the end of the day, so perhaps it's better to fall back to trying to open whatever ways we have available and notify the user (possibly by failing) if we exhausted all those options.
Aside from that one additional comment and some pondering, I think this looks reasonable. I appreciate the automation of writing core functionality unit tests. @ikrommyd anything that catches your attention? |
As for unifying the actual processing and running interfaces. My idea around Dask vs. non-dask can be driven by a boolean flag passed to Then we introduce one new function |
Yeah, I deleted Copilot's suggestion for the IOFactory, partially because it was lying, and partially because while that class was the User-facing gateway to using the pure In short, I was thinking of not advertising the IOFactory much anymore for users, in case it makes more sense to break into standalone functions. But I'll write an example for the few use-cases it still has Let me also add the For the magic-byte checking, lemme have a look into that in tandem with the preprocess_parquet PR. If I get that, I'll add it for both preprocess and preprocess_parquet (and maybe I can upstream some of the functionality from |
Ah - OK if it's not user facing then we shouldn't talk about it unless we have to. :-) It just sorta stuck out to me! For the magic-byte checking I started thinking too much about the right way to do it. The end of that thinking on my side is:
i.e. add in some logic if a file doesn't open to check if a root file is actually parquet (and vice versa), and if it's not either valid input format then we mark the file as bad, otherwise open the file as what it really is based on the filename guess and then make a note that it has the wrong postfix. This is more a creature comfort than anything else, consider it low priority. |
Sounds appropriate to me, do you think the format checking should be exclusively in e.g. preprocess, or both preprocess and apply_to_*set? |
That's a good question. My first instinct is that once it's in This is just the beginning of an idea though. |
…as precursor to column-joining, with a hard-requirement that a form is stored and can be decoded with checking
…s can construct and utilize them directly
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
…r ParquetFileSpec
…of tests from Copilot
…n now) and fixup the datasetspec_to_dict method (serves mostly as an example, IOFactory is no long truly necessary, at least outside column-joining where more complicated methods are needed
for more information, see https://pre-commit.ci
…ic unions work with | rather than Union
…f filename as well; and expand tests to cover some of these cases better
…ion of DatasetSpec and joinable function
… point about dict(AModel) vs AModel.model_dump() methods
0442203
to
23ee888
Compare
Okay, .pq formats accounted for, more tests for formats, factorized code for joinable and the datasetspec validation function (renamed), and some docs in the notebook for IOFactory showing off the still-relevant bits. Branch was out of date, rebased, and the tests are off to the races... Ciao |
Giving it another look through I think it looks pretty good to me. I think it sets us up for a much more robust future when it comes to dataset communication, manipulation, and tracking. @nsmith- @ikrommyd if y'all could have a look as well, at whatever depth of attention you can spare, I'd appreciate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used pydantic so I don't know good and bad patterns with pydantic. I've looked at this by looking at the tests and the example notebook mainly and I like the features and I think it sets up a good path for good dataset specifications and also the unification of executor-dask preprocessing.
@nsmith- Could you give this a look since you've got the most pydantic experience among people I know in HEP. |
@nsmith- do you have any preliminary thoughts that might dramatically change things? Otherwise I will start to work on the next couple PRs in the second half of this month, hoping only minor changes will be requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments now just to better understand the idea. Will look more.
@@ -62,6 +60,8 @@ def apply_to_dataset( | |||
report : dask_awkward.Array, optional | |||
The file access report for running the analysis on the input dataset. Needs to be computed in simultaneously with the analysis to be accurate. | |||
""" | |||
if isinstance(dataset, DatasetSpec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit backwards from how I would expect: the interface prefers a narrow type (DatasetSpec) but allows a very generic type (any dict) and then the first line of the implementation weakens type, losing any type information for the remainder of the function. How come?
I would expect
if isinstance(dataset, dict):
dataset = DatasetSpec.model_validate(dataset)
or if the dict is not the same structure then to do some conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit this was a bit of exhaustion-guided coding, and is just the minimal shim for users to try inserting the pydantic filesets. What I had been thinking was to start "opting" users in by, in one of the followup PRs, trying to coerce dictionaries to the pydantic classes in preprocessing/downstream, and then either converting back at the end (upon success) to not surprise users, or (upon a pydantic-path failure) falling back to pure-dictionary processing and raising a warning to users with a link to an issue where they'd be encouraged to upload their failing fileset, to start building a set of failure cases and address them. Then, sometime down the lines, fully deprecate the pure-dictionary inputs, with sufficient testing-in-the-wild behind us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we want to:
- for backwards compatibility, return dictionaries to users
- validate user input
- ensure our pydantic model exhaustively describes the currently supported input types
I would think all these can be accomplished by immediately running DatasetSpec.model_validate
on the input, making sure that the body of the functions passes the type checker, and then casting back to dict on output.
] | ||
|
||
|
||
class UprootFileSpec(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name might be ROOTFileSpec, since ParquetFileSpec is for Parquet files and this is for ROOT files
class UprootFileSpec(BaseModel): | ||
object_path: str | ||
steps: Annotated[list[StepPair], Field(min_length=1)] | StepPair | None = None | ||
format: Literal["root"] = "root" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could inherit a FileSpec
for the pieces that are common between the root and parquet files
Union[ | ||
CoffeaUprootFileSpec, | ||
CoffeaParquetFileSpec, | ||
CoffeaUprootFileSpecOptional, | ||
CoffeaParquetFileSpecOptional, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an opportunity to use a discriminated union using format
as the discriminator. Though, if we want to keep separate types for the preprocessed and input file sets then we need a function discriminator. Personally I would separate the preprocessed and input at this level: CoffeaFileDict -> InputFiles, PreprocessedFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by separating the preprocessed and input levels? I suppose I've been thinking of this as "maybe-maybenot" ready-to-use and "definitively ready-to-use"; the latter may be accomplished via preprocessing, but may be manually set (if e.g. one doesn't need/care about the saved_form). As this all coalesces, maybe it makes sense to be even stricter, so that's not an objection, but where do the boundaries get defined?
steps: Annotated[list[StepPair], Field(min_length=1)] | StepPair | ||
num_entries: Annotated[int, Field(ge=0)] | ||
uuid: str | ||
# directory: Literal[True, False] #identify whether it's a directory of parquet files or a single parquet file, may be useful or necessary to distinguish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_directory: bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider support for directories with e.g. xrootd/root files in general, too? Similar to the format validation/magic-bytes discussion @lgray brought up, that's only possible to check rigorously by opening the paths/files, and I'm unsure what/how much is supported via fsspec and what alternative paths would be needed. But if this is the right moment to crystallize some handling of directories, lets do that?
class CoffeaUprootFileSpecOptional(UprootFileSpec): | ||
num_entries: Annotated[int, Field(ge=0)] | None = None | ||
uuid: str | None = None | ||
|
||
|
||
class CoffeaUprootFileSpec(CoffeaUprootFileSpecOptional): | ||
steps: Annotated[list[StepPair], Field(min_length=1)] | StepPair | ||
num_entries: Annotated[int, Field(ge=0)] | ||
uuid: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, we basically have two cases, something like FileSpec
and FileSpecPreprocessed
that dictate whether the extra fields exist, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preprocessing (successfully) is sufficient, though not always necessary? I suppose "for sure" nobody is manually finding/calculating uuids though (under normal circumstances), but the "Optional" / "Concrete" binary has been floating in my head, naming-wise. Nevertheless, yes, minimal info vs fully-suffiicent-for-scaled-up-analysis
self.root.update(other, **kwargs) | ||
|
||
|
||
class CoffeaFileDict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What motivates using RootModel? To make it act like a dictionary so that existing code doesn't have to be touched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say both to ensure current patterns continue working, and because it makes sense to me that it remains iterable like a dict. The other thing that came to mind was completely eliminating the "filename" as a level, such that the FileSpec itself contained the path information. But I'm unsure that's in any way an improvement (rather than an orthogonal change), and more wary of re-architecting things unnecessarily.
stored_formats_by_name = { | ||
k: v.format for k, v in self.root.items() if hasattr(v, "format") | ||
} | ||
assert all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not ...
raise
but really this check should happen in a model validator function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I left out format validation here because it should be enforced at the Dataset level just above this in the pydantic heirarchy, so it's a matter of whether we want to be very strict up-front and check at both levels? From the testing so far, I suppose that's more likely to give a clear validation/instantiation error to users.
This PR is a precursor to both parquet reading being re-enabled (pre-processing parquet data, most importantly) and column-joining (where dataset specifications become very complicated and hard-to-construct/validate by hand). This exposes and expands the explicit Dataclasses using Pydantic models.
preprocess
is updated to handle these dataclasses explicitly. Column-joining imposes some requirements on the status of a dataset for automatic column-determination, so ajoinable()
method is introduced to check that status after preprocessing, though some followup work may be generated by propagating the pydantic models through all the followup PRs.Sidenote: These dataclasses are especially crucial for column-joining, as creating a joinable dataset spec becomes complicated when one must combine two disparate datasets (with different forms, if one wishes to understand their structure and determine necessary columns automatically) AND a non-trivial join specification (which tells column-joining how to handle the inputs to deliver a joined dataset to the user), and that spec needs to be preprocessed (separately for root and parquet sub-datasets to be joined) then recomposed into a joinable (i.e. non-Optional FileSpec) version of the joinable spec
For now, most of the simpler functions interacting with datasets have been updated with dual-path code to simultaneously support legacy pure-dictionary inputs and the pydantic models. This includes e.g. apply_to_(file|data)set, slice_(chunks|files), preprocess, etc. An example notebook including usage is introduced. In the future, once these are thoroughly tested in the wild, the old legacy dictionaries can be removed from code paths, since the Pydantic models already handle conversions trivially)
A followup PR will introduce the
preprocess_parquet
function. Another will be needed to handle threading the classes through the runner (where there's an open question of whether to harmonize the runner and preprocess+apply_to_fileset interfaces first). Another is expected to finish the changes for processing parquet again.Tests should cover the overwhelming majority of cases, originally generated by Copilot, with heavy editing, expansion, corrections. The test code is not concise, unfortunately.
Pydantic automatically buys us serialization, as they can be saved into json via each model's
.model_dump()
functionThis PR is an alternative to #1395
NOTE: considering the huge number of commits, squash and merge may be desired instead...