-
Notifications
You must be signed in to change notification settings - Fork 10
docs: eval usage on site #1321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: eval usage on site #1321
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,218 @@ | ||
| # Eval - Configuration Validation | ||
|
|
||
| Eval provides fast configuration validation without running expensive production jobs. Use it to catch errors early in your development workflow. | ||
|
|
||
| ## What Eval Checks | ||
|
|
||
| - All source tables exist and are accessible | ||
| - Column names and types match your configuration | ||
| - Query syntax is valid (for StagingQueries) | ||
| - Derivations compile and type-check correctly | ||
| - Dependencies between configurations resolve correctly | ||
|
|
||
| ## Quick Schema Validation | ||
|
|
||
| The most common use case - validate your configuration without running any computations: | ||
|
|
||
| ```bash | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} | ||
| ``` | ||
|
|
||
| This will show you the output schema, lineage, and catch configuration errors early. Example output: | ||
|
|
||
| ``` | ||
| 🟢 Eval job finished successfully | ||
| Join Configuration: gcp.demo.user_features__1 | ||
| - Left table: data.user_activity_7d__0 | ||
| - Join parts: 2 | ||
| - Conf dependencies: 3 | ||
| - External tables: 2 | ||
| - Output Schema: | ||
| [left] user_id: string | ||
| [left] event_timestamp: long | ||
| [left] ds: string | ||
| [joinPart: gcp.user_demographics__0] user_id_age: integer | ||
| [derivation] is_adult: boolean | ||
| Lineage: | ||
| [Join] gcp.demo.user_features__1 | ||
| ├── ✅ [GroupBy] gcp.user_activity_7d__0 | ||
| │ └── External: project.events.user_clicks | ||
| └── ✅ [GroupBy] gcp.user_demographics__0 | ||
| └── ✅ [StagingQuery] gcp.raw_demographics__0 | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| ## Testing with Sample Data | ||
|
|
||
| For deeper validation, provide sample data to see actual computation output: | ||
|
|
||
| ```bash | ||
| # 1. Generate a test data skeleton | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} --generate-skeleton | ||
|
|
||
| # 2. Fill in test-data.yaml with sample data (use !epoch for timestamps) | ||
|
|
||
| # 3. Run eval with test data | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} --test-data-path test-data.yaml | ||
| ``` | ||
|
|
||
| This will show you the actual computed results with your sample data, helping you validate: | ||
| - Complex aggregations and window functions | ||
| - Derivation logic with concrete examples | ||
| - Join key matching behavior | ||
| - Null value handling | ||
|
|
||
| ## Local Eval for CI/CD | ||
|
|
||
| For automated testing in CI/CD pipelines without requiring metastore access, you can use the `ziplineai/local-eval` Docker image with a preloaded Iceberg warehouse. | ||
|
|
||
| ### Key Benefits | ||
|
|
||
| - **No metastore required** - Completely offline testing | ||
| - **No cloud access needed** - Works in any CI environment | ||
| - **Fast feedback** - Validate configurations in pull requests | ||
| - **Full control** - Define your own test schemas and data | ||
| - **Ideal for CI systems** - GitHub Actions, GitLab CI, Jenkins, etc. | ||
|
|
||
| ### Setup Steps | ||
|
|
||
| #### 1. Create Test Warehouse with PySpark | ||
|
|
||
| Create a Python script to build your Iceberg warehouse with test data. The example below is based on `platform/docker/eval/examples/build_warehouse.py`: | ||
|
|
||
| ```python | ||
| #!/usr/bin/env python3 | ||
| """Build a local Iceberg warehouse with test data for Chronon Eval testing.""" | ||
|
|
||
| import os | ||
| from datetime import datetime | ||
| from pyspark.sql import SparkSession | ||
|
|
||
| def epoch_millis(iso_timestamp): | ||
| """Convert ISO timestamp to epoch milliseconds""" | ||
| dt = datetime.fromisoformat(iso_timestamp.replace("Z", "+00:00")) | ||
| return int(dt.timestamp() * 1000) | ||
|
|
||
| def build_warehouse(warehouse_path, catalog_name="ci_catalog"): | ||
| """Create Iceberg warehouse with test data""" | ||
|
|
||
| print(f"Creating test warehouse at: {warehouse_path}") | ||
| os.makedirs(warehouse_path, exist_ok=True) | ||
|
|
||
| # Initialize Spark with Iceberg support | ||
| spark = ( | ||
| SparkSession.builder | ||
| .appName("chronon-test-warehouse-builder") | ||
| .master("local[*]") | ||
| .config("spark.jars.packages", | ||
| "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3") | ||
| .config("spark.sql.extensions", | ||
| "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") | ||
| .config(f"spark.sql.catalog.{catalog_name}", | ||
| "org.apache.iceberg.spark.SparkCatalog") | ||
| .config(f"spark.sql.catalog.{catalog_name}.type", "hadoop") | ||
| .config(f"spark.sql.catalog.{catalog_name}.warehouse", warehouse_path) | ||
| .getOrCreate() | ||
| ) | ||
|
|
||
| # Create namespace | ||
| print("Creating namespace 'data'...") | ||
| spark.sql("CREATE NAMESPACE IF NOT EXISTS data") | ||
| spark.sql(f"USE {catalog_name}") | ||
|
|
||
| # Create table with schema | ||
| print("Creating user_activities table...") | ||
| spark.sql(""" | ||
| CREATE TABLE IF NOT EXISTS data.user_activities__0 ( | ||
| user_id STRING, | ||
| event_time_ms BIGINT, | ||
| session_id STRING, | ||
| event_type STRING, | ||
| ds STRING | ||
| ) USING iceberg | ||
| PARTITIONED BY (ds) | ||
| """) | ||
|
|
||
|
Comment on lines
+120
to
+137
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: Short answer
Notes / gotchas
Sources
🏁 Script executed: # Search the documentation file to understand the full context
cd docs/source/running_on_zipline_hub || true
if [ -f "Eval.md" ]; then
head -150 Eval.md | tail -40
else
find . -name "Eval.md" -type f 2>/dev/null | head -5
fiRepository: zipline-ai/chronon Length of output: 1523 Ensure the namespace is created in the Iceberg catalog, not the default catalog. The - print("Creating namespace 'data'...")
- spark.sql("CREATE NAMESPACE IF NOT EXISTS data")
- spark.sql(f"USE {catalog_name}")
+ print(f"Creating namespace 'data' in catalog '{catalog_name}'...")
+ spark.sql(f"USE {catalog_name}")
+ spark.sql("CREATE NAMESPACE IF NOT EXISTS data")🤖 Prompt for AI Agents |
||
| # Insert test data | ||
| user_activities_data = [ | ||
| ("user_1", epoch_millis("2025-01-01T00:01:00Z"), "session_1", "view", "2025-01-01"), | ||
| ("user_2", epoch_millis("2025-01-01T00:02:00Z"), "session_2", "click", "2025-01-01"), | ||
| ] | ||
|
|
||
| df = spark.createDataFrame( | ||
| user_activities_data, | ||
| ["user_id", "event_time_ms", "session_id", "event_type", "ds"] | ||
| ) | ||
|
|
||
| df.writeTo(f"{catalog_name}.data.user_activities__0").createOrReplace() | ||
| print(f"✓ Inserted {df.count()} rows into user_activities__0") | ||
|
|
||
| spark.stop() | ||
| print(f"\n✓ Warehouse created successfully at: {warehouse_path}") | ||
|
|
||
| if __name__ == "__main__": | ||
| build_warehouse("/tmp/chronon-test-warehouse") | ||
| ``` | ||
|
Comment on lines
+85
to
+157
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: Short answer: Yes — Iceberg 1.4.3 provides a Spark 3.5 runtime and is intended to be used with Spark/PySpark 3.5.x in production, provided you use the matching runtime JAR (iceberg-spark-runtime-3.5_* with the correct Scala binary) and validate your Spark build/Scala version and any storage/SDK bundles you need. [1][2] Practical checklist before production deploy:
Sources: Verify Scala binary version and add documentation for Iceberg JAR dependency management. Iceberg 1.4.3 with Spark 3.5.x is production-ready, but the example should clarify that |
||
|
|
||
| Run this script to create your test warehouse: | ||
|
|
||
| ```bash | ||
| python scripts/build_warehouse.py | ||
| ``` | ||
|
|
||
| #### 2. Start Local Eval Service | ||
|
|
||
| Run the local-eval Docker container with the warehouse mounted: | ||
|
|
||
| ```bash | ||
| docker run -d \ | ||
| --name chronon-eval-service \ | ||
| -e CHRONON_ROOT=/configs \ | ||
| -e CHRONON_EVAL_WAREHOUSE_PATH=/warehouse \ | ||
| -e CHRONON_EVAL_WAREHOUSE_CATALOG=ci_catalog \ | ||
| -v /tmp/chronon-test-warehouse:/warehouse:ro \ | ||
| -v $(pwd):/configs:ro \ | ||
| -p 3904:8080 \ | ||
| ziplineai/local-eval:latest | ||
| ``` | ||
|
|
||
| **Environment variables:** | ||
| - `CHRONON_ROOT` - Path to directory containing `compiled/` configs (required) | ||
| - `CHRONON_EVAL_WAREHOUSE_PATH` - Path to Iceberg warehouse directory (required) | ||
| - `CHRONON_EVAL_WAREHOUSE_CATALOG` - Catalog name (must match catalog used when building warehouse) | ||
| - `SERVER_PORT` - HTTP port (default: `8080`) | ||
|
|
||
| #### 3. Run Eval Commands | ||
|
|
||
| ```bash | ||
| # Check service is running | ||
| curl http://localhost:3904/ping | ||
|
|
||
| # Run eval against local service | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} --eval-url http://localhost:3904 | ||
| ``` | ||
|
|
||
| ## Recommended Workflow | ||
|
|
||
| 1. **During development**: Use quick schema validation | ||
| ```bash | ||
| zipline compile | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} | ||
| ``` | ||
|
|
||
| 2. **Before submitting PR**: Test with sample data | ||
| ```bash | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} --test-data-path test-data.yaml | ||
| ``` | ||
|
|
||
| 3. **In CI/CD**: Use local-eval with preloaded Iceberg warehouse | ||
| - Automated validation on every PR | ||
| - No cloud dependencies | ||
| - Fast feedback loop | ||
|
|
||
| 4. **After PR approval**: Run backfill | ||
| ```bash | ||
| zipline hub backfill --conf compiled/joins/{team}/{your_conf} --start-ds 2024-01-01 --end-ds 2024-01-02 | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,6 +16,69 @@ This will show you errors if there are any in your definitions, and show you cha | |
|
|
||
| Note, if you are making a change to any existing entities **without** changing the version it will prompt you with a warning. | ||
|
|
||
| ## Eval | ||
|
|
||
| Before running expensive backfill jobs, use eval to quickly validate your configuration. Eval checks that: | ||
| - All source tables exist and are accessible | ||
| - Column names and types match your configuration | ||
| - Query syntax is valid (for StagingQueries) | ||
| - Derivations compile and type-check correctly | ||
| - Dependencies between configurations resolve correctly | ||
|
|
||
| ### Quick Schema Validation | ||
|
|
||
| The most common use case - validate your configuration without running any computations: | ||
|
|
||
| ```bash | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} | ||
| ``` | ||
|
|
||
| This will show you the output schema, lineage, and catch configuration errors early. Example output: | ||
|
|
||
| ``` | ||
| 🟢 Eval job finished successfully | ||
| Join Configuration: gcp.demo.user_features__1 | ||
| - Left table: data.user_activity_7d__0 | ||
| - Join parts: 2 | ||
| - Conf dependencies: 3 | ||
| - External tables: 2 | ||
| - Output Schema: | ||
| [left] user_id: string | ||
| [left] event_timestamp: long | ||
| [left] ds: string | ||
| [joinPart: gcp.user_demographics__0] user_id_age: integer | ||
| [derivation] is_adult: boolean | ||
|
|
||
| Lineage: | ||
| [Join] gcp.demo.user_features__1 | ||
| ├── ✅ [GroupBy] gcp.user_activity_7d__0 | ||
| │ └── External: project.events.user_clicks | ||
| └── ✅ [GroupBy] gcp.user_demographics__0 | ||
| └── ✅ [StagingQuery] gcp.raw_demographics__0 | ||
| ``` | ||
|
Comment on lines
+38
to
+58
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a language to the example-output code fence (MD040). The example output block is missing a language spec; consider marking it as plain text to satisfy markdownlint: -```
+```text🤖 Prompt for AI Agents |
||
|
|
||
|  | ||
|
|
||
| ### Testing with Sample Data | ||
|
|
||
| For deeper validation, provide sample data to see actual computation output: | ||
|
|
||
| ```bash | ||
| # 1. Generate a test data skeleton | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} --generate-skeleton | ||
|
|
||
| # 2. Fill in test-data.yaml with sample data (use !epoch for timestamps) | ||
|
|
||
| # 3. Run eval with test data | ||
| zipline hub eval --conf compiled/joins/{team}/{your_conf} --test-data-path test-data.yaml | ||
| ``` | ||
|
|
||
| This will show you the actual computed results with your sample data, helping you validate: | ||
| - Complex aggregations and window functions | ||
| - Derivation logic with concrete examples | ||
| - Join key matching behavior | ||
| - Null value handling | ||
|
|
||
| ## Backfill | ||
|
|
||
| ```sh | ||
|
|
||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify language for code fence showing example output.
The example output block is missing a language identifier, which triggers a linter warning. Use
textorplaintext:🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
23-23: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents