Skip to content

Conversation

@bsr-the-mngrm
Copy link
Contributor

@bsr-the-mngrm bsr-the-mngrm commented Jul 25, 2025

TL;DR

Fixes two issues in datetime profiling:

  • round=False option was not respected due to missing opts propagation.
  • datetime.max caused an OverflowError when rounded up — now safely handled.

Changes

  • Fixes a bug where opts={"round": False} was ignored during min/max adjustment due to opts not being passed to _adjust_min_max_limits.
  • Prevents OverflowError when attempting to round up datetime.datetime.max by capping the result to datetime.datetime.max.
  • Adds a warning log when overflow is caught during datetime rounding.

Linked issues

Resolves #475

Tests

  • manually tested
  • added unit tests
  • added integration tests

A unit test for _round_datetime was considered but intentionally omitted to avoid accessing a protected method directly. The logic is indirectly covered by existing profiling workflows.

@bsr-the-mngrm bsr-the-mngrm requested a review from a team as a code owner July 25, 2025 19:43
@bsr-the-mngrm bsr-the-mngrm requested review from gergo-databricks and removed request for a team July 25, 2025 19:43
@github-actions
Copy link

All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

@bsr-the-mngrm bsr-the-mngrm force-pushed the fix/datetime-rounding-overflow branch 2 times, most recently from 7b9008e to a39afa3 Compare July 25, 2025 20:00

This comment was marked as outdated.

@mwojtyczka mwojtyczka requested a review from Copilot July 28, 2025 17:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes datetime profiling logic by addressing two key issues: ensuring the round=False option is respected when processing datetime columns, and preventing OverflowError when attempting to round datetime.max values. The changes add proper timezone handling to timestamp conversions and include comprehensive test coverage for both rounding scenarios.

  • Fixes opts parameter propagation to _adjust_min_max_limits method
  • Adds overflow protection for datetime rounding operations
  • Includes timezone-aware timestamp handling with UTC timezone

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/databricks/labs/dqx/profiler/profiler.py Core fix for opts parameter passing and datetime overflow handling with timezone support
tests/integration/test_profiler.py Integration tests verifying round=False behavior and datetime.max handling scenarios

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mwojtyczka mwojtyczka merged commit 0fb0767 into databrickslabs:main Jul 28, 2025
11 checks passed
@bsr-the-mngrm bsr-the-mngrm deleted the fix/datetime-rounding-overflow branch July 28, 2025 18:30
mwojtyczka added a commit that referenced this pull request Aug 6, 2025
* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided.
* Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period.
* Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability.
* Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
* Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching.
* Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning.
* Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration.  The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks.
* Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`.
* Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format.
* Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing.

BREAKING CHANGES!

* The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored.
* The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities.

DEPRECIATION CHANGES!

If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions.
* Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`:
`load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`.
* Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`:
`save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`.

The `save_checks` and  `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported:
* `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON)
* `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON)
* `TableChecksStorageConfig`: a table
* `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.
@mwojtyczka mwojtyczka mentioned this pull request Aug 6, 2025
mwojtyczka added a commit that referenced this pull request Aug 6, 2025
* Added new row-level freshness check
([#495](#495)). A new data
quality check function, `is_data_fresh`, has been introduced to identify
stale data resulting from delayed pipelines, enabling early detection of
upstream issues. This function assesses whether the values in a
specified timestamp column are within a specified number of minutes from
a base timestamp column. The function takes three parameters: the column
to check, the maximum age in minutes before data is considered stale,
and an optional base timestamp column, defaulting to the current
timestamp if not provided.
* Added new dataset-level freshess check
([#499](#499)). A new
dataset-level check function, `is_data_fresh_per_time_window`, has been
added to validate whether at least a specified minimum number of records
arrive within every specified time window, ensuring data freshness. This
function is customizable, allowing users to define the time window,
minimum records per window, and lookback period.
* Improvements have been made to the performance of aggregation check
functions, and the check message format has been updated for better
readability.
* Created llm util function to get check functions details
([#469](#469)). A new
utility function has been introduced to provide definitions of all check
functions, enabling the generation of prompts for Large Language Models
(LLMs) to create check functions.
* Added equality safe row and column matching in compare datasets check
([#473](#473)). The compare
datasets check functionality has been enhanced to handle null values
during row matching and column value comparisons, improving its
robustness and flexibility. Two new optional parameters,
`null_safe_row_matching` and `null_safe_column_value_matching`, have
been introduced to control how null values are handled, both defaulting
to True. These parameters allow for null-safe primary key matching and
column value matching, ensuring accurate comparison results even when
null values are present in the data. The check now excludes specific
columns from value comparison using the `exclude_columns` parameter
while still considering them for row matching.
* Fixed datetime rounding logic in profiler
([#483](#483)). The datetime
rounding logic has been improved in profiler to respect the
`round=False` option, which was previously ignored. The code now handles
the `OverflowError` that occurs when rounding up the maximum datetime
value by capping the result and logging a warning.
* Added loading and saving checks from file in Unity Catalog Volume
([#512](#512)). This change
introduces support for storing quality checks in a Unity Catalog Volume,
in addition to existing storage types such as tables, files, and
workspace files. The storage location of quality checks has been unified
into a single configuration field called `checks_location`, replacing
the previous `checks_file` and `checks_table` fields, to simplify the
configuration and remove ambiguity by ensuring only one storage location
can be defined per run configuration. The `checks_location` field can
point to a file in the local path, workspace, installation folder, or
Unity Catalog Volume, providing users with more flexibility and clarity
when managing their quality checks.
* Refactored methods for loading and saving checks
([#487](#487)). The
`DQEngine` class has undergone significant changes to improve modularity
and maintainability, including the unification of methods for loading
and saving checks under the `load_checks` and `save_checks` methods,
which take a `config` parameter to determine the storage type, such as
`FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`,
`TableChecksStorageConfig`, or `InstallationChecksStorageConfig`.
* Storing checks using dqx classes
([#474](#474)). The data
quality engine has been enhanced with methods to convert quality checks
between `DQRule` objects and Python dictionaries, allowing for
flexibility in check definition and usage. The `serialize_checks` method
converts a list of `DQRule` instances into a dictionary representation,
while the `deserialize_checks` method performs the reverse operation,
converting a dictionary representation back into a list of `DQRule`
instances. Additionally, the `DQRule` class now includes a `to_dict`
method to convert a `DQRule` instance into a structured dictionary,
providing a standardized representation of the rule's metadata. These
changes enable users to work with checks in both formats, store and
retrieve checks easily, and improve the overall management and storage
of data quality checks. The conversion process supports local execution
and handles non-complex column expressions, although complex PySpark
expressions or Python functions may not be fully reconstructable when
converting from class to metadata format.
* Added llm utility funciton to extract checks examples in yaml from
docs ([#506](#506)). This is
achieved through a new Python script that extracts YAML examples from
MDX documentation files and creates a combined YAML file with all the
extracted examples. The script utilizes regular expressions to extract
YAML code blocks from MDX content, validates each YAML block, and
combines all valid blocks into a single list. The combined YAML file is
then created in the LLM resources directory for use in language model
processing.

BREAKING CHANGES!

* The `checks_file` and `checks_table` fields have been removed from the
installation run configuration. They are now consolidated into the
single `checks_location` field. This change simplifies the configuration
and clearly defines where checks are stored.
* The `load_run_config` method has been moved to
`config_loader.RunConfigLoader`, as it is not intended for direct use
and falls outside the `DQEngine` core responsibilities.

DEPRECIATION CHANGES!

If you are loading or saving checks from a storage (file, workspace
file, table, installation), you are affected. We are deprecating the
below methods. We are keeping the methods in the `DQEngine` but you
should update your code as these methods will be removed in future
versions.
* Loading checks to storage has been unified under `load_checks` method.
The following methods have been removed from the `DQEngine`:
`load_checks_from_local_file`, `load_checks_from_workspace_file`,
`load_checks_from_installation`, `load_checks_from_table`.
* Saving checks in storage has been unified under `load_checks` method.
The following methods have been removed from the `DQEngine`:
`save_checks_in_local_file`, `save_checks_in_workspace_file`,
`save_checks_in_installation`, `save_checks_in_table`.

The `save_checks` and `load_checks` take `config` as a parameter, which
determines the storage types used. The following storage configs are
currently supported:
* `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON)
* `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or
JSON)
* `TableChecksStorageConfig`: a table
* `InstallationChecksStorageConfig`: storage defined in the installation
context, using either the `checks_table` or `checks_file` field from the
run configuration.
AdityaMandiwal pushed a commit that referenced this pull request Aug 21, 2025
## TL;DR

Fixes two issues in datetime profiling:
- `round=False` option was not respected due to missing `opts`
propagation.
- `datetime.max` caused an `OverflowError` when rounded up — now safely
handled.

## Changes

- Fixes a bug where `opts={"round": False}` was ignored during min/max
adjustment due to `opts` not being passed to `_adjust_min_max_limits`.
- Prevents `OverflowError` when attempting to round up
`datetime.datetime.max` by capping the result to
`datetime.datetime.max`.
- Adds a warning log when overflow is caught during datetime rounding.

### Linked issues

Resolves [#475](#475)

### Tests

- [x] manually tested
- [ ] added unit tests
- [x] added integration tests

> A unit test for `_round_datetime` was considered but intentionally
omitted to avoid accessing a protected method directly. The logic is
indirectly covered by existing profiling workflows.

---------

Co-authored-by: Sandor R. Bakos <[email protected]>
Co-authored-by: Marcin Wojtyczka <[email protected]>
Co-authored-by: Copilot <[email protected]>
AdityaMandiwal pushed a commit that referenced this pull request Aug 21, 2025
* Added new row-level freshness check
([#495](#495)). A new data
quality check function, `is_data_fresh`, has been introduced to identify
stale data resulting from delayed pipelines, enabling early detection of
upstream issues. This function assesses whether the values in a
specified timestamp column are within a specified number of minutes from
a base timestamp column. The function takes three parameters: the column
to check, the maximum age in minutes before data is considered stale,
and an optional base timestamp column, defaulting to the current
timestamp if not provided.
* Added new dataset-level freshess check
([#499](#499)). A new
dataset-level check function, `is_data_fresh_per_time_window`, has been
added to validate whether at least a specified minimum number of records
arrive within every specified time window, ensuring data freshness. This
function is customizable, allowing users to define the time window,
minimum records per window, and lookback period.
* Improvements have been made to the performance of aggregation check
functions, and the check message format has been updated for better
readability.
* Created llm util function to get check functions details
([#469](#469)). A new
utility function has been introduced to provide definitions of all check
functions, enabling the generation of prompts for Large Language Models
(LLMs) to create check functions.
* Added equality safe row and column matching in compare datasets check
([#473](#473)). The compare
datasets check functionality has been enhanced to handle null values
during row matching and column value comparisons, improving its
robustness and flexibility. Two new optional parameters,
`null_safe_row_matching` and `null_safe_column_value_matching`, have
been introduced to control how null values are handled, both defaulting
to True. These parameters allow for null-safe primary key matching and
column value matching, ensuring accurate comparison results even when
null values are present in the data. The check now excludes specific
columns from value comparison using the `exclude_columns` parameter
while still considering them for row matching.
* Fixed datetime rounding logic in profiler
([#483](#483)). The datetime
rounding logic has been improved in profiler to respect the
`round=False` option, which was previously ignored. The code now handles
the `OverflowError` that occurs when rounding up the maximum datetime
value by capping the result and logging a warning.
* Added loading and saving checks from file in Unity Catalog Volume
([#512](#512)). This change
introduces support for storing quality checks in a Unity Catalog Volume,
in addition to existing storage types such as tables, files, and
workspace files. The storage location of quality checks has been unified
into a single configuration field called `checks_location`, replacing
the previous `checks_file` and `checks_table` fields, to simplify the
configuration and remove ambiguity by ensuring only one storage location
can be defined per run configuration. The `checks_location` field can
point to a file in the local path, workspace, installation folder, or
Unity Catalog Volume, providing users with more flexibility and clarity
when managing their quality checks.
* Refactored methods for loading and saving checks
([#487](#487)). The
`DQEngine` class has undergone significant changes to improve modularity
and maintainability, including the unification of methods for loading
and saving checks under the `load_checks` and `save_checks` methods,
which take a `config` parameter to determine the storage type, such as
`FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`,
`TableChecksStorageConfig`, or `InstallationChecksStorageConfig`.
* Storing checks using dqx classes
([#474](#474)). The data
quality engine has been enhanced with methods to convert quality checks
between `DQRule` objects and Python dictionaries, allowing for
flexibility in check definition and usage. The `serialize_checks` method
converts a list of `DQRule` instances into a dictionary representation,
while the `deserialize_checks` method performs the reverse operation,
converting a dictionary representation back into a list of `DQRule`
instances. Additionally, the `DQRule` class now includes a `to_dict`
method to convert a `DQRule` instance into a structured dictionary,
providing a standardized representation of the rule's metadata. These
changes enable users to work with checks in both formats, store and
retrieve checks easily, and improve the overall management and storage
of data quality checks. The conversion process supports local execution
and handles non-complex column expressions, although complex PySpark
expressions or Python functions may not be fully reconstructable when
converting from class to metadata format.
* Added llm utility funciton to extract checks examples in yaml from
docs ([#506](#506)). This is
achieved through a new Python script that extracts YAML examples from
MDX documentation files and creates a combined YAML file with all the
extracted examples. The script utilizes regular expressions to extract
YAML code blocks from MDX content, validates each YAML block, and
combines all valid blocks into a single list. The combined YAML file is
then created in the LLM resources directory for use in language model
processing.

BREAKING CHANGES!

* The `checks_file` and `checks_table` fields have been removed from the
installation run configuration. They are now consolidated into the
single `checks_location` field. This change simplifies the configuration
and clearly defines where checks are stored.
* The `load_run_config` method has been moved to
`config_loader.RunConfigLoader`, as it is not intended for direct use
and falls outside the `DQEngine` core responsibilities.

DEPRECIATION CHANGES!

If you are loading or saving checks from a storage (file, workspace
file, table, installation), you are affected. We are deprecating the
below methods. We are keeping the methods in the `DQEngine` but you
should update your code as these methods will be removed in future
versions.
* Loading checks to storage has been unified under `load_checks` method.
The following methods have been removed from the `DQEngine`:
`load_checks_from_local_file`, `load_checks_from_workspace_file`,
`load_checks_from_installation`, `load_checks_from_table`.
* Saving checks in storage has been unified under `load_checks` method.
The following methods have been removed from the `DQEngine`:
`save_checks_in_local_file`, `save_checks_in_workspace_file`,
`save_checks_in_installation`, `save_checks_in_table`.

The `save_checks` and `load_checks` take `config` as a parameter, which
determines the storage types used. The following storage configs are
currently supported:
* `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON)
* `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or
JSON)
* `TableChecksStorageConfig`: a table
* `InstallationChecksStorageConfig`: storage defined in the installation
context, using either the `checks_table` or `checks_file` field from the
run configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Datetime profiling fails on datetime.max and ignores round=False option

2 participants