Release v0.8.0 (#515)

mwojtyczka · web-flow · commit fe5d6a3e4794 · 2025-08-07T00:17:41.000+02:00
* Added new row-level freshness check ([#495](#495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. * Added new dataset-level freshess check ([#499](#499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. * Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. * Created llm util function to get check functions details ([#469](#469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions. * Added equality safe row and column matching in compare datasets check ([#473](#473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. * Fixed datetime rounding logic in profiler ([#483](#483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. * Added loading and saving checks from file in Unity Catalog Volume ([#512](#512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. * Refactored methods for loading and saving checks ([#487](#487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`. * Storing checks using dqx classes ([#474](#474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. * Added llm utility funciton to extract checks examples in yaml from docs ([#506](#506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. * The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities. DEPRECIATION CHANGES! If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions. * Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`. * Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`: `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`. The `save_checks` and `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: * `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON) * `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON) * `TableChecksStorageConfig`: a table * `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,37 @@
 # Version changelog
 
+## 0.8.0
+
+* Added new row-level freshness check ([#495](https://github.com/databrickslabs/dqx/issues/495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. 
+* Added new dataset-level freshess check ([#499](https://github.com/databrickslabs/dqx/issues/499)). A new dataset-level check function, `is_data_fresh_per_time_window`, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. 
+* Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability. 
+* Created llm util function to get check functions details ([#469](https://github.com/databrickslabs/dqx/issues/469)). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
+* Added equality safe row and column matching in compare datasets check ([#473](https://github.com/databrickslabs/dqx/issues/473)). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, `null_safe_row_matching` and `null_safe_column_value_matching`, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the `exclude_columns` parameter while still considering them for row matching. 
+* Fixed datetime rounding logic in profiler ([#483](https://github.com/databrickslabs/dqx/issues/483)). The datetime rounding logic has been improved in profiler to respect the `round=False` option, which was previously ignored. The code now handles the `OverflowError` that occurs when rounding up the maximum datetime value by capping the result and logging a warning. 
+* Added loading and saving checks from file in Unity Catalog Volume ([#512](https://github.com/databrickslabs/dqx/issues/512)). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called `checks_location`, replacing the previous `checks_file` and `checks_table` fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration.  The `checks_location` field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. 
+* Refactored methods for loading and saving checks ([#487](https://github.com/databrickslabs/dqx/issues/487)). The `DQEngine` class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the `load_checks` and `save_checks` methods, which take a `config` parameter to determine the storage type, such as `FileChecksStorageConfig`, `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, or `InstallationChecksStorageConfig`.
+* Storing checks using dqx classes ([#474](https://github.com/databrickslabs/dqx/issues/474)). The data quality engine has been enhanced with methods to convert quality checks between `DQRule` objects and Python dictionaries, allowing for flexibility in check definition and usage. The `serialize_checks` method converts a list of `DQRule` instances into a dictionary representation, while the `deserialize_checks` method performs the reverse operation, converting a dictionary representation back into a list of `DQRule` instances. Additionally, the `DQRule` class now includes a `to_dict` method to convert a `DQRule` instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format.
+* Added llm utility funciton to extract checks examples in yaml from docs ([#506](https://github.com/databrickslabs/dqx/issues/506)). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing. 
+
+BREAKING CHANGES!
+
+* The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored.
+* The `load_run_config` method has been moved to `config_loader.RunConfigLoader`, as it is not intended for direct use and falls outside the `DQEngine` core responsibilities.
+
+DEPRECIATION CHANGES!
+
+If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the `DQEngine` but you should update your code as these methods will be removed in future versions.
+* Loading checks to storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`:
+`load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_installation`, `load_checks_from_table`.
+* Saving checks in storage has been unified under `load_checks` method. The following methods have been removed from the `DQEngine`:
+`save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_installation`, `save_checks_in_table`.
+
+The `save_checks` and  `load_checks` take `config` as a parameter, which determines the storage types used. The following storage configs are currently supported: 
+* `FileChecksStorageConfig`: file in the local filesystem (YAML or JSON)
+* `WorkspaceFileChecksStorageConfig`: file in the workspace (YAML or JSON)
+* `TableChecksStorageConfig`: a table
+* `InstallationChecksStorageConfig`: storage defined in the installation context, using either the `checks_table` or `checks_file` field from the run configuration.
+
 ## 0.7.1
 
 * Added type validation for apply checks method ([#465](https://github.com/databrickslabs/dqx/issues/465)). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of `DQRule`. If invalid types are encountered, a `TypeError` is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either `warn` or "error", and raises a `ValueError` for invalid values.
diff --git a/docs/dqx/docs/demos.mdx b/docs/dqx/docs/demos.mdx
@@ -8,18 +8,18 @@ import Admonition from '@theme/Admonition';
 
 Import the following notebooks in the Databricks workspace to try DQX out:
 ## Use as Library
-* [DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
-* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
-* [DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
-* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
-* [DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
+* [DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
+* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
+* [DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
+* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
+* [DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
 
 ## Deploy as Workspace Tool
-* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
+* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
 
 ## Use Cases
-* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
-* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
+* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
+* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
 
 <br />
 <Admonition type="tip" title="Execution Environment">
diff --git a/docs/dqx/docs/guide/quality_checks.mdx b/docs/dqx/docs/guide/quality_checks.mdx
@@ -1141,7 +1141,7 @@ Below is a sample output of a check as stored in a result column:
 ```
 
 The structure of the result columns is an array of struct containing the following fields
-(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.7.1/src/databricks/labs/dqx/schema/dq_result_schema.py)):
+(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.8.0/src/databricks/labs/dqx/schema/dq_result_schema.py)):
 - `name`: name of the check (string type).
 - `message`: message describing the quality issue (string type).
 - `columns`: name of the column(s) where the quality issue was found (string type).
diff --git a/docs/dqx/docs/reference/quality_rules.mdx b/docs/dqx/docs/reference/quality_rules.mdx
@@ -6,7 +6,7 @@ import TabItem from '@theme/TabItem';
 
 This page provides a reference for the quality checks (rule functions) available in DQX.
 
-You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.7.1/src/databricks/labs/dqx/check_funcs.py).
+You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.8.0/src/databricks/labs/dqx/check_funcs.py).
 
 ## Row-level checks reference
 
@@ -2427,7 +2427,7 @@ This is because the metadata format relies on string representations that can't
 ## Detecting Personally Identifiable Information (PII) using a Python function
 
 You can write custom check to detect Personally Identifiable Information (PII) in the input data. To use PII-detection libraries (e.g. Presidio),
-install any required dependencies, define a Spark UDF to detect PII, and use `make_condition` to call the UDF as a DQX check. See [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_pii_detection.py)
+install any required dependencies, define a Spark UDF to detect PII, and use `make_condition` to call the UDF as a DQX check. See [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.8.0/demos/dqx_demo_pii_detection.py)
 for a full example.
 
 <Tabs>
diff --git a/src/databricks/labs/dqx/__about__.py b/src/databricks/labs/dqx/__about__.py
@@ -1 +1 @@
-__version__ = "0.7.1"
+__version__ = "0.8.0"

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.7.1"`
	`1`	`+__version__ = "0.8.0"`