Skip to content

Conversation

adrianlyjak
Copy link
Contributor

pydantic was parsing nested dicts to an empty ExtractedFieldMetadata, because that was the first value in the union

@adrianlyjak adrianlyjak force-pushed the adrian/fix-lost-dimension branch from 0f33e62 to 1dd0eba Compare August 13, 2025 17:27
@@ -203,7 +203,7 @@ class ExtractedFieldMetadata(BaseModel):


ExtractedFieldMetaDataDict = Dict[
str, Union[ExtractedFieldMetadata, Dict[str, Any], list[Any]]
str, Union[Dict[str, Any], ExtractedFieldMetadata, list[Any]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fix 🤦

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, why? I'm just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I think it iterates through the union parsing the first type that matches, and since ExtractedFieldMetadata is all optional, it will match any dict. However this can't be the full explanation, otherwise the ExtractedFieldMetadata values would be parsed to Dict, which isn't happening

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose after this change, ExtractedFieldMetadata will never be hit, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose after this change, ExtractedFieldMetadata will never be hit, no?

@zhaotai you make a good point. Looks like parsing from json has the behavior where ExtractedFieldMetadata wouldn't ever parse (whereas whatever the normalization that happens in the ExtractedData constructor kept the classes)

Modified this so that ExtractedFieldMetadata instead only parses if there are no extra fields, which seems more robust.

data = json.load(f)
result = ExtractedData.from_extraction_result(ExtractRun.parse_obj(data), Capacitor)
assert result.field_metadata == {
"dimensions": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic was converting this to "dimensions": ExtractedFieldMetadata(None, None, None, None)

@adrianlyjak adrianlyjak marked this pull request as ready for review August 13, 2025 17:28
@@ -203,7 +203,7 @@ class ExtractedFieldMetadata(BaseModel):


ExtractedFieldMetaDataDict = Dict[
str, Union[ExtractedFieldMetadata, Dict[str, Any], list[Any]]
str, Union[Dict[str, Any], ExtractedFieldMetadata, list[Any]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, why? I'm just curious.

@adrianlyjak adrianlyjak force-pushed the adrian/fix-lost-dimension branch from 08b6315 to d63d610 Compare August 13, 2025 18:30
@adrianlyjak adrianlyjak force-pushed the adrian/fix-lost-dimension branch from 0912614 to 625cc82 Compare August 13, 2025 19:22
@adrianlyjak adrianlyjak merged commit 79fe193 into main Aug 13, 2025
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants