Fix datetime rounding logic and ensure round=False is respected #483

bsr-the-mngrm · 2025-07-25T19:43:24Z

TL;DR

Fixes two issues in datetime profiling:

round=False option was not respected due to missing opts propagation.
datetime.max caused an OverflowError when rounded up — now safely handled.

Changes

Fixes a bug where opts={"round": False} was ignored during min/max adjustment due to opts not being passed to _adjust_min_max_limits.
Prevents OverflowError when attempting to round up datetime.datetime.max by capping the result to datetime.datetime.max.
Adds a warning log when overflow is caught during datetime rounding.

Linked issues

Resolves #475

Tests

manually tested
added unit tests
added integration tests

A unit test for _round_datetime was considered but intentionally omitted to avoid accessing a protected method directly. The logic is indirectly covered by existing profiling workflows.

github-actions · 2025-07-25T19:43:33Z

All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

src/databricks/labs/dqx/profiler/profiler.py

…n min/max adjustment

…s_remove_outliers_no_outlier_columns` test

…s_with_rounding_enabled` test

Copilot

Pull Request Overview

This PR fixes datetime profiling logic by addressing two key issues: ensuring the round=False option is respected when processing datetime columns, and preventing OverflowError when attempting to round datetime.max values. The changes add proper timezone handling to timestamp conversions and include comprehensive test coverage for both rounding scenarios.

Fixes opts parameter propagation to _adjust_min_max_limits method
Adds overflow protection for datetime rounding operations
Includes timezone-aware timestamp handling with UTC timezone

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/databricks/labs/dqx/profiler/profiler.py	Core fix for opts parameter passing and datetime overflow handling with timezone support
tests/integration/test_profiler.py	Integration tests verifying round=False behavior and datetime.max handling scenarios

tests/integration/test_profiler.py

Co-authored-by: Copilot <[email protected]>

mwojtyczka

LGTM

tests/integration/test_profiler.py

* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. * Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. * Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. * Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions. * Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. * Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. * Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. * Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`. * Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. * Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. * The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities. DEPRECIATION CHANGES! If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions. * Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`. * Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`. The `save_checks` and `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: * `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON) * `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON) * `TableChecksStorageConfig`: a table * `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.

## TL;DR Fixes two issues in datetime profiling: - `round=False` option was not respected due to missing `opts` propagation. - `datetime.max` caused an `OverflowError` when rounded up — now safely handled. ## Changes - Fixes a bug where `opts={"round": False}` was ignored during min/max adjustment due to `opts` not being passed to `_adjust_min_max_limits`. - Prevents `OverflowError` when attempting to round up `datetime.datetime.max` by capping the result to `datetime.datetime.max`. - Adds a warning log when overflow is caught during datetime rounding. ### Linked issues Resolves [#475](#475) ### Tests - [x] manually tested - [ ] added unit tests - [x] added integration tests > A unit test for `_round_datetime` was considered but intentionally omitted to avoid accessing a protected method directly. The logic is indirectly covered by existing profiling workflows. --------- Co-authored-by: Sandor R. Bakos <[email protected]> Co-authored-by: Marcin Wojtyczka <[email protected]> Co-authored-by: Copilot <[email protected]>

* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. * Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. * Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. * Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions. * Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. * Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. * Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. * Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`. * Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. * Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. * The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities. DEPRECIATION CHANGES! If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions. * Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`. * Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`. The `save_checks` and `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: * `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON) * `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON) * `TableChecksStorageConfig`: a table * `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.

bsr-the-mngrm requested a review from a team as a code owner July 25, 2025 19:43

bsr-the-mngrm requested review from gergo-databricks and removed request for a team July 25, 2025 19:43

bsr-the-mngrm force-pushed the fix/datetime-rounding-overflow branch 2 times, most recently from 7b9008e to a39afa3 Compare July 25, 2025 20:00

Sandor R. Bakos added 2 commits July 25, 2025 22:03

fix: pass opts to _adjust_min_max_limits to respect round=False

85d6b63

fix: prevent OverflowError when rounding datetime.max

1c2c2d9

bsr-the-mngrm force-pushed the fix/datetime-rounding-overflow branch from a39afa3 to 1c2c2d9 Compare July 25, 2025 20:03

bsr-the-mngrm closed this Jul 25, 2025

bsr-the-mngrm reopened this Jul 25, 2025

bsr-the-mngrm mentioned this pull request Jul 25, 2025

[BUG]: Datetime profiling fails on datetime.max and ignores round=False option #475

Closed

1 task

mwojtyczka requested a review from Copilot July 26, 2025 09:42

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into fix/datetime-rounding-overflow

9b8bd4b

mwojtyczka requested changes Jul 26, 2025

View reviewed changes

src/databricks/labs/dqx/profiler/profiler.py Show resolved Hide resolved

bsr-the-mngrm added 8 commits July 27, 2025 17:47

fix(profiler): ensure UTC-aware datetime handling for TimestampType i…

f1a4429

…n min/max adjustment

test(test_profiler.py): add `test_profiler_non_default_profile_option…

b10f270

…s_remove_outliers_no_outlier_columns` test

format(profiler.py): run make fmt

84b5108

format(test_profiler.py): run make fmt

667a703

test(test_profiler.py): fix test_profiler_non_default_profile_options

3f30ad7

test(test_profiler.py): fix test_profiler test

15970c4

test(test_profiler.py): add `test_profiler_non_default_profile_option…

b6e1273

…s_with_rounding_enabled` test

format(test_profiler.py): run make fmt

13300f8

mwojtyczka requested a review from Copilot July 28, 2025 17:13

Copilot AI reviewed Jul 28, 2025

View reviewed changes

tests/integration/test_profiler.py Show resolved Hide resolved

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

mwojtyczka and others added 4 commits July 28, 2025 19:30

Update tests/integration/test_profiler.py

7e8a1c5

Co-authored-by: Copilot <[email protected]>

Update tests/integration/test_profiler.py

121b73e

Co-authored-by: Copilot <[email protected]>

Update tests/integration/test_profiler.py

d57870d

Co-authored-by: Copilot <[email protected]>

Update tests/integration/test_profiler.py

5022cdc

Co-authored-by: Copilot <[email protected]>

mwojtyczka approved these changes Jul 28, 2025

View reviewed changes

mwojtyczka reviewed Jul 28, 2025

View reviewed changes

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

Update tests/integration/test_profiler.py

1b7e6bc

mwojtyczka reviewed Jul 28, 2025

View reviewed changes

tests/integration/test_profiler.py Show resolved Hide resolved

Update tests/integration/test_profiler.py

defb7fa

mwojtyczka merged commit 0fb0767 into databrickslabs:main Jul 28, 2025
11 checks passed

bsr-the-mngrm deleted the fix/datetime-rounding-overflow branch July 28, 2025 18:30

mwojtyczka mentioned this pull request Aug 6, 2025

Release v0.8.0 #515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix datetime rounding logic and ensure round=False is respected #483

Fix datetime rounding logic and ensure round=False is respected #483

Uh oh!

bsr-the-mngrm commented Jul 25, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix datetime rounding logic and ensure round=False is respected #483

Fix datetime rounding logic and ensure round=False is respected #483

Uh oh!

Conversation

bsr-the-mngrm commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Changes

Linked issues

Tests

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bsr-the-mngrm commented Jul 25, 2025 •

edited

Loading