Skip to content

Commit ed5a36e

Browse files
dinbab1984mwojtyczkaCopilot
authored
Load and Save checks from file in Unity Catalog Volume (#512)
## Changes * Added support for a new storage type for checks: Unity Catalog Volume. * Unified checks location into a single configuration field: `checks_location`. This replaces the previous `checks_file` and `checks_table` fields, removing ambiguity by ensuring only one storage location can be defined per run configuration. BREAKING CHANGES! * The `checks_file` and `checks_table` fields have been removed from the installation run configuration. They are now consolidated into the single `checks_location` field. This change simplifies the configuration and clearly defines where checks are stored. ### Linked issues This PR addresses the [FEATURE]: Load and Save quality checks from/in UC Volume #386 ### Tests - [X] manually tested - [x] added unit tests - [X] added integration tests --------- Co-authored-by: Marcin Wojtyczka <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 962990b commit ed5a36e

25 files changed

+563
-112
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Simplified Data Quality checking at Scale for PySpark Workloads on streaming and
1313
[![codecov](https://codecov.io/github/databrickslabs/dqx/graph/badge.svg)](https://codecov.io/github/databrickslabs/dqx)
1414
![linesofcode](https://aschey.tech/tokei/github/databrickslabs/dqx?category=code)
1515
[![PyPI](https://img.shields.io/pypi/v/databricks-labs-dqx?label=pypi%20package&cacheSeconds=3600)](https://pypi.org/project/databricks-labs-dqx/)
16-
![PyPI - Downloads](https://img.shields.io/pypi/dm/databricks-labs-dqx?cacheSeconds=3600)
16+
![PyPI Downloads](https://static.pepy.tech/personalized-badge/databricks-labs-dqx?period=month&units=international_system&left_color=grey&right_color=orange&left_text=PyPI%20downloads&cacheSeconds=3600)
1717

1818
# Documentation
1919

demos/dqx_demo_tool.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
# MAGIC quarantine_config:
3535
# MAGIC location: main.nytaxi.quarantine
3636
# MAGIC mode: overwrite
37-
# MAGIC checks_file: checks.yml
37+
# MAGIC checks_location: checks.yml
3838
# MAGIC profiler_config:
3939
# MAGIC summary_stats_file: profile_summary_stats.yml
4040
# MAGIC warehouse_id: your-warehouse-id

docs/dqx/docs/guide/data_profiling.mdx

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -219,9 +219,9 @@ You can use `options` parameter to pass a dictionary with custom options when pr
219219

220220
The following DQX configuration from 'config.yml' is used by the profiler workflow:
221221
- 'input_config': configuration for the input data.
222-
- 'checks_file': relative location of the generated quality rule candidates as `yaml` or `json` file inside the installation folder (default: `checks.yml`).
222+
- 'checks_location': relative location of the generated quality rule candidates as `yaml` or `json` file inside the workspace installation folder (default: `checks.yml`).
223223
- 'profiler_config': configuration for the profiler containing:
224-
- 'summary_stats_file': relative location of the summary statistics (default: `profile_summary.yml`) inside the installation folder
224+
- 'summary_stats_file': relative location of the summary statistics (default: `profile_summary.yml`) inside the workspace installation folder
225225
- 'sample_fraction': fraction of data to sample for profiling.
226226
- 'sample_seed': seed for reproducible sampling.
227227
- 'limit': maximum number of records to analyze.
@@ -348,7 +348,7 @@ The DLT generator creates Lakeflow Pipelines expectation statements from profile
348348

349349
## Storing Quality Checks
350350

351-
You can save checks defined in code or generated by the profiler to a table or file as `yaml` or `json` in the local path, workspace or installation folder.
351+
You can save checks defined in code or generated by the profiler to a table or file as `yaml` or `json` in the local path, workspace, installation folder or Unity Catalog Volume file.
352352

353353
<Tabs>
354354
<TabItem value="Python" label="Python" default>
@@ -358,7 +358,8 @@ You can save checks defined in code or generated by the profiler to a table or f
358358
FileChecksStorageConfig,
359359
WorkspaceFileChecksStorageConfig,
360360
InstallationChecksStorageConfig,
361-
TableChecksStorageConfig
361+
TableChecksStorageConfig,
362+
VolumeFileChecksStorageConfig
362363
)
363364
from databricks.sdk import WorkspaceClient
364365

@@ -383,11 +384,6 @@ You can save checks defined in code or generated by the profiler to a table or f
383384
# always overwrite the file
384385
dq_engine.save_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))
385386

386-
# save checks in file defined in 'checks_file' in the run config
387-
# always overwrite the file
388-
# only works if DQX is installed in the workspace
389-
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(assume_user=True, run_config_name="default"))
390-
391387
# save checks in a Delta table with default run config for filtering
392388
# append checks in the table for the default run config
393389
dq_engine.save_checks(checks, config=TableChecksStorageConfig(location="dq.config.checks_table", mode="append"))
@@ -396,8 +392,11 @@ You can save checks defined in code or generated by the profiler to a table or f
396392
# overwrite checks in the table for the given run config
397393
dq_engine.save_checks(checks, config=TableChecksStorageConfig(location="dq.config.checks_table", run_config_name="workflow_001", mode="overwrite"))
398394

399-
# save checks in table defined in 'checks_table' in the run config
400-
# always overwrite checks in the table for the given run config
395+
# save checks in a Unity Catalog volume location
396+
# always overwrite the file
397+
dq_engine.save_checks(checks, config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))
398+
399+
# save checks in file or table defined in 'checks_location' in the run config
401400
# only works if DQX is installed in the workspace
402401
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(assume_user=True, run_config_name="default"))
403402
```

docs/dqx/docs/guide/quality_checks.mdx

Lines changed: 44 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ There are several ways to define and apply quality checks in DQX:
2121

2222
## Quality Rules defined in a File
2323

24-
Quality rules can be defined declaratively as part of `yaml` or `json` file and stored in the installation folder, workspace, or local file system.
24+
Quality rules can be defined declaratively as part of `yaml` or `json` file and stored in the installation folder, workspace, local file system or Unity Catalog Volume file.
2525

2626
Below is an example `yaml` file ('checks.yml') defining several checks:
2727
```yaml
@@ -175,7 +175,7 @@ In addition, you can also perform a standalone syntax validation of the checks a
175175
176176
dq_engine = DQEngine(WorkspaceClient())
177177
178-
# Load check from the installation (from file defined in 'checks_file' in the run config)
178+
# Load check from the installation (from file or table defined in 'checks_location' in the run config)
179179
# Only works if DQX is installed in the workspace
180180
default_checks = dq_engine.load_checks(config=InstallationChecksStorageConfig(assume_user=True, run_config_name="default"))
181181
workflow_checks = dq_engine.load_checks(config=InstallationChecksStorageConfig(assume_user=True, run_config_name="workflow_001"))
@@ -271,6 +271,46 @@ In addition, you can also perform a standalone syntax validation of the checks a
271271
</TabItem>
272272
</Tabs>
273273

274+
#### Method 4: Loading and Applying Checks Defined in a Unity Catalog Volume File
275+
276+
<Tabs>
277+
<TabItem value="Python" label="Python" default>
278+
```python
279+
from databricks.labs.dqx.config import OutputConfig
280+
from databricks.labs.dqx.engine import DQEngine
281+
from databricks.labs.dqx.config import VolumeFileChecksStorageConfig
282+
from databricks.sdk import WorkspaceClient
283+
284+
dq_engine = DQEngine(WorkspaceClient())
285+
286+
# Load checks from multiple files and merge
287+
default_checks = dq_engine.load_checks(config=VolumeFileChecksStorageConfig(location="/Volumes/catalog/schema/App1/default_checks.yml"))
288+
workflow_checks = dq_engine.load_checks(config=VolumeFileChecksStorageConfig(location="/Volumes/catalog/schema/App1/workflow_checks.yml"))
289+
checks = default_checks + workflow_checks
290+
291+
input_df = spark.read.table("catalog1.schema1.table1")
292+
293+
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
294+
valid_df, quarantine_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks)
295+
dq_engine.save_results_in_table(
296+
output_df=valid_df,
297+
quarantine_df=quarantine_df,
298+
output_config=OutputConfig("catalog.schema.valid_data"),
299+
quarantine_config=OutputConfig("catalog.schema.quarantine_data")
300+
)
301+
302+
# Option 2: apply quality rules on the dataframe and report issues as additional columns (`_warning` and `_error`)
303+
valid_and_quarantine_df = dq_engine.apply_checks_by_metadata(input_df, checks)
304+
dq_engine.save_results_in_table(
305+
output_df=valid_df,
306+
quarantine_df=quarantine_df,
307+
output_config=OutputConfig("catalog.schema.valid_and_quarantine_data")
308+
)
309+
```
310+
</TabItem>
311+
</Tabs>
312+
313+
274314
## Quality Rules defined in a Delta Table
275315

276316
Quality rules can be stored in a Delta table in Unity Catalog. Each row represents a check with column values for the `name`, `check`, `criticality`, `filter`, and `run_config_name`.
@@ -452,7 +492,7 @@ In addition, you can also perform a standalone syntax validation of the checks a
452492
# Load checks from the "workflow_001" run config
453493
workflow_checks = dq_engine.load_checks(config=TableChecksStorageConfig(location="dq.config.checks_table", run_config_name="workflow_001"))
454494

455-
# Load checks from the installation (from a table defined in 'checks_table' in the run config)
495+
# Load checks from the installation (from a table defined in 'checks_location' in the run config)
456496
# Only works if DQX is installed in the workspace
457497
workflow_checks = dq_engine.load_checks(config=InstallationChecksStorageConfig(assume_user=True, run_config_name="workflow_001"))
458498

@@ -1025,7 +1065,7 @@ The validation cannot be used for checks defined programmatically using [DQX cla
10251065
```
10261066

10271067
The following DQX configuration from 'config.yml' will be used by default:
1028-
- 'checks_file': relative location of the quality rules defined declaratively as `yaml` or `json` inside the installation folder (default: `checks.yml`).
1068+
- 'checks_location': relative location of the quality rules defined declaratively as `yaml` or `json` inside the workspace installation folder (default: `checks.yml`).
10291069
</TabItem>
10301070
</Tabs>
10311071

docs/dqx/docs/installation.mdx

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -140,18 +140,17 @@ run_configs:
140140
trigger: # <- streaming trigger, only applicable if input_config.is_streaming is enabled
141141
availableNow: true
142142

143-
quarantine_config: # <- quarantine data configuration, if specified, bad data is written to quarantine table
143+
quarantine_config: # <- optional quarantine data configuration, if specified, bad data is written to quarantine table
144144
location: main.iot.silver_quarantine # <- quarantine location (table), used as input for quality dashboard
145-
format: delta # <- format of the quarantine table
146-
mode: append # <- write mode for the quarantine table (append or overwrite)
145+
format: delta # <- format of the quarantine table (default: delta)
146+
mode: append # <- write mode for the quarantine table (append or overwrite, default: append)
147147
options: # <- additional options for writing to the quarantine table (optional)
148148
mergeSchema: 'true'
149149
#checkpointLocation: /Volumes/catalog1/schema1/checkpoint # <- only applicable if input_config.is_streaming is enabled
150-
trigger: # <- streaming trigger, only applicable if input_config.is_streaming is enabled
150+
trigger: # <- optional streaming trigger, only applicable if input_config.is_streaming is enabled
151151
availableNow: true
152152

153-
checks_file: iot_checks.yml # <- relative location of the quality rules (checks) defined in json or yaml file
154-
checks_table: main.iot.checks # <- table storing the quality rules (checks)
153+
checks_location: iot_checks.yml # <- Quality rules (checks) can be stored in a table or defined in JSON or YAML files, located at a relative workspace file path or volume file path.
155154

156155
profiler_config: # <- profiler configuration
157156
summary_stats_file: iot_summary_stats.yml # <- relative location of profiling summary stats

0 commit comments

Comments
 (0)