-
Notifications
You must be signed in to change notification settings - Fork 69
Fix datetime rounding logic and ensure round=False is respected #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix datetime rounding logic and ensure round=False is respected #483
Conversation
|
All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits |
7b9008e to
a39afa3
Compare
a39afa3 to
1c2c2d9
Compare
…n min/max adjustment
…s_remove_outliers_no_outlier_columns` test
…s_with_rounding_enabled` test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes datetime profiling logic by addressing two key issues: ensuring the round=False option is respected when processing datetime columns, and preventing OverflowError when attempting to round datetime.max values. The changes add proper timezone handling to timestamp conversions and include comprehensive test coverage for both rounding scenarios.
- Fixes
optsparameter propagation to_adjust_min_max_limitsmethod - Adds overflow protection for datetime rounding operations
- Includes timezone-aware timestamp handling with UTC timezone
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/databricks/labs/dqx/profiler/profiler.py | Core fix for opts parameter passing and datetime overflow handling with timezone support |
| tests/integration/test_profiler.py | Integration tests verifying round=False behavior and datetime.max handling scenarios |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. * Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. * Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. * Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions. * Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. * Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. * Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. * Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`. * Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. * Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. * The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities. DEPRECIATION CHANGES! If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions. * Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`. * Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`. The `save_checks` and `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: * `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON) * `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON) * `TableChecksStorageConfig`: a table * `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.
* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. * Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. * Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. * Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions. * Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. * Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. * Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. * Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`. * Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. * Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. * The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities. DEPRECIATION CHANGES! If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions. * Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`. * Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`. The `save_checks` and `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: * `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON) * `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON) * `TableChecksStorageConfig`: a table * `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.
## TL;DR
Fixes two issues in datetime profiling:
- `round=False` option was not respected due to missing `opts`
propagation.
- `datetime.max` caused an `OverflowError` when rounded up — now safely
handled.
## Changes
- Fixes a bug where `opts={"round": False}` was ignored during min/max
adjustment due to `opts` not being passed to `_adjust_min_max_limits`.
- Prevents `OverflowError` when attempting to round up
`datetime.datetime.max` by capping the result to
`datetime.datetime.max`.
- Adds a warning log when overflow is caught during datetime rounding.
### Linked issues
Resolves [#475](#475)
### Tests
- [x] manually tested
- [ ] added unit tests
- [x] added integration tests
> A unit test for `_round_datetime` was considered but intentionally
omitted to avoid accessing a protected method directly. The logic is
indirectly covered by existing profiling workflows.
---------
Co-authored-by: Sandor R. Bakos <[email protected]>
Co-authored-by: Marcin Wojtyczka <[email protected]>
Co-authored-by: Copilot <[email protected]>
* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. * Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. * Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. * Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions. * Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. * Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. * Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. * Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`. * Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. * Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. * The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities. DEPRECIATION CHANGES! If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions. * Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`. * Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`. The `save_checks` and `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: * `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON) * `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON) * `TableChecksStorageConfig`: a table * `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.
TL;DR
Fixes two issues in datetime profiling:
round=Falseoption was not respected due to missingoptspropagation.datetime.maxcaused anOverflowErrorwhen rounded up — now safely handled.Changes
opts={"round": False}was ignored during min/max adjustment due tooptsnot being passed to_adjust_min_max_limits.OverflowErrorwhen attempting to round updatetime.datetime.maxby capping the result todatetime.datetime.max.Linked issues
Resolves #475
Tests