Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/images/eval_sample.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
218 changes: 218 additions & 0 deletions docs/source/running_on_zipline_hub/Eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# Eval - Configuration Validation

Eval provides fast configuration validation without running expensive production jobs. Use it to catch errors early in your development workflow.

## What Eval Checks

- All source tables exist and are accessible
- Column names and types match your configuration
- Query syntax is valid (for StagingQueries)
- Derivations compile and type-check correctly
- Dependencies between configurations resolve correctly

## Quick Schema Validation

The most common use case - validate your configuration without running any computations:

```bash
zipline hub eval --conf compiled/joins/{team}/{your_conf}
```

This will show you the output schema, lineage, and catch configuration errors early. Example output:

```
🟢 Eval job finished successfully
Join Configuration: gcp.demo.user_features__1
- Left table: data.user_activity_7d__0
- Join parts: 2
- Conf dependencies: 3
- External tables: 2
- Output Schema:
[left] user_id: string
[left] event_timestamp: long
[left] ds: string
[joinPart: gcp.user_demographics__0] user_id_age: integer
[derivation] is_adult: boolean
Lineage:
[Join] gcp.demo.user_features__1
├── ✅ [GroupBy] gcp.user_activity_7d__0
│ └── External: project.events.user_clicks
└── ✅ [GroupBy] gcp.user_demographics__0
└── ✅ [StagingQuery] gcp.raw_demographics__0
```
Comment on lines +23 to +43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify language for code fence showing example output.

The example output block is missing a language identifier, which triggers a linter warning. Use text or plaintext:

-```
+```text
 🟢 Eval job finished successfully
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

23-23: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/source/running_on_zipline_hub/Eval.md around lines 23 to 43, the fenced
code block showing example output lacks a language identifier which triggers a
linter warning; update the opening triple backticks to include a language such
as text or plaintext (e.g. ```text) so the block is explicitly marked as plain
text and the linter warning is resolved.


![Eval command demonstration](../../images/eval_sample.gif)

## Testing with Sample Data

For deeper validation, provide sample data to see actual computation output:

```bash
# 1. Generate a test data skeleton
zipline hub eval --conf compiled/joins/{team}/{your_conf} --generate-skeleton

# 2. Fill in test-data.yaml with sample data (use !epoch for timestamps)

# 3. Run eval with test data
zipline hub eval --conf compiled/joins/{team}/{your_conf} --test-data-path test-data.yaml
```

This will show you the actual computed results with your sample data, helping you validate:
- Complex aggregations and window functions
- Derivation logic with concrete examples
- Join key matching behavior
- Null value handling

## Local Eval for CI/CD

For automated testing in CI/CD pipelines without requiring metastore access, you can use the `ziplineai/local-eval` Docker image with a preloaded Iceberg warehouse.

### Key Benefits

- **No metastore required** - Completely offline testing
- **No cloud access needed** - Works in any CI environment
- **Fast feedback** - Validate configurations in pull requests
- **Full control** - Define your own test schemas and data
- **Ideal for CI systems** - GitHub Actions, GitLab CI, Jenkins, etc.

### Setup Steps

#### 1. Create Test Warehouse with PySpark

Create a Python script to build your Iceberg warehouse with test data. The example below is based on `platform/docker/eval/examples/build_warehouse.py`:

```python
#!/usr/bin/env python3
"""Build a local Iceberg warehouse with test data for Chronon Eval testing."""

import os
from datetime import datetime
from pyspark.sql import SparkSession

def epoch_millis(iso_timestamp):
"""Convert ISO timestamp to epoch milliseconds"""
dt = datetime.fromisoformat(iso_timestamp.replace("Z", "+00:00"))
return int(dt.timestamp() * 1000)

def build_warehouse(warehouse_path, catalog_name="ci_catalog"):
"""Create Iceberg warehouse with test data"""

print(f"Creating test warehouse at: {warehouse_path}")
os.makedirs(warehouse_path, exist_ok=True)

# Initialize Spark with Iceberg support
spark = (
SparkSession.builder
.appName("chronon-test-warehouse-builder")
.master("local[*]")
.config("spark.jars.packages",
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3")
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config(f"spark.sql.catalog.{catalog_name}",
"org.apache.iceberg.spark.SparkCatalog")
.config(f"spark.sql.catalog.{catalog_name}.type", "hadoop")
.config(f"spark.sql.catalog.{catalog_name}.warehouse", warehouse_path)
.getOrCreate()
)

# Create namespace
print("Creating namespace 'data'...")
spark.sql("CREATE NAMESPACE IF NOT EXISTS data")
spark.sql(f"USE {catalog_name}")

# Create table with schema
print("Creating user_activities table...")
spark.sql("""
CREATE TABLE IF NOT EXISTS data.user_activities__0 (
user_id STRING,
event_time_ms BIGINT,
session_id STRING,
event_type STRING,
ds STRING
) USING iceberg
PARTITIONED BY (ds)
""")

Comment on lines +120 to +137
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Spark Iceberg CREATE NAMESPACE current catalog behavior

💡 Result:

Short answer

  • Spark tracks a current catalog + namespace; unqualified CREATE NAMESPACE uses the current catalog and namespace. [1]
  • Iceberg’s SparkCatalog / SparkSessionCatalog implement namespace methods, and Spark passes the currentCatalog/currentNamespace to those methods, so CREATE NAMESPACE on an Iceberg catalog creates the namespace inside that catalog. (Iceberg’s SparkCatalog delegates namespace operations to the underlying Iceberg Catalog.) [2][3]

Notes / gotchas

  • SparkSessionCatalog wraps/creates a SparkCatalog for Iceberg tables, so behavior is the same but the catalog is backed by the configured Iceberg implementation. [3]
  • There are edge reports where people expected a new Spark catalog entry to appear in the Hive metastore but it didn’t — catalog registration vs. namespace creation are different concerns. Check your catalog config (spark.sql.catalog..*) and whether you’re using a Hive vs. session catalog. [4][1]

Sources

  • Apache Iceberg Spark configuration / using catalogs. [1]
  • Iceberg SparkCatalog javadoc (namespace/table/view API; currentCatalog/currentNamespace parameters). [2]
  • Iceberg SparkSessionCatalog javadoc (wraps/creates SparkCatalog). [3]
  • GitHub issue discussing Hive metastore vs. Spark catalog expectations. [4]

🏁 Script executed:

# Search the documentation file to understand the full context
cd docs/source/running_on_zipline_hub || true
if [ -f "Eval.md" ]; then
  head -150 Eval.md | tail -40
else
  find . -name "Eval.md" -type f 2>/dev/null | head -5
fi

Repository: zipline-ai/chronon

Length of output: 1523


Ensure the namespace is created in the Iceberg catalog, not the default catalog.

The CREATE NAMESPACE IF NOT EXISTS data statement runs before USE {catalog_name}, so the namespace is created in the default spark_catalog instead of the target Iceberg catalog. This causes a mismatch: the namespace data exists in spark_catalog, but the table and write operations target {catalog_name}.data. Reorder to switch catalogs first:

-    print("Creating namespace 'data'...")
-    spark.sql("CREATE NAMESPACE IF NOT EXISTS data")
-    spark.sql(f"USE {catalog_name}")
+    print(f"Creating namespace 'data' in catalog '{catalog_name}'...")
+    spark.sql(f"USE {catalog_name}")
+    spark.sql("CREATE NAMESPACE IF NOT EXISTS data")
🤖 Prompt for AI Agents
In docs/source/running_on_zipline_hub/Eval.md around lines 120 to 137, the
CREATE NAMESPACE runs before switching to the Iceberg catalog so the namespace
is created in the default spark_catalog; move the catalog switch so
spark.sql(f"USE {catalog_name}") is executed before creating the namespace (or
alternatively create the namespace fully qualified as {catalog_name}.data), then
proceed to create the table—this ensures the namespace is created in the
intended Iceberg catalog.

# Insert test data
user_activities_data = [
("user_1", epoch_millis("2025-01-01T00:01:00Z"), "session_1", "view", "2025-01-01"),
("user_2", epoch_millis("2025-01-01T00:02:00Z"), "session_2", "click", "2025-01-01"),
]

df = spark.createDataFrame(
user_activities_data,
["user_id", "event_time_ms", "session_id", "event_type", "ds"]
)

df.writeTo(f"{catalog_name}.data.user_activities__0").createOrReplace()
print(f"✓ Inserted {df.count()} rows into user_activities__0")

spark.stop()
print(f"\n✓ Warehouse created successfully at: {warehouse_path}")

if __name__ == "__main__":
build_warehouse("/tmp/chronon-test-warehouse")
```
Comment on lines +85 to +157
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

PySpark 3.5.0 Iceberg 1.4.3 compatibility production ready

💡 Result:

Short answer: Yes — Iceberg 1.4.3 provides a Spark 3.5 runtime and is intended to be used with Spark/PySpark 3.5.x in production, provided you use the matching runtime JAR (iceberg-spark-runtime-3.5_* with the correct Scala binary) and validate your Spark build/Scala version and any storage/SDK bundles you need. [1][2]

Practical checklist before production deploy:

  • Use the iceberg-spark-runtime-3.5_2.12 or _2.13 JAR that matches your Spark/PySpark Scala build. [2]
  • Don’t include other Iceberg modules (iceberg-core, etc.) on the Spark classpath — only the runtime and required storage bundles (aws/gcp/azure) to avoid dependency conflicts. [1][2]
  • Test your workload (reads, writes, partitioning, concurrency/transactions, compaction/manifest operations) end-to-end in a staging environment with your Spark distribution and JVM/Scala version. [1][2]
  • Keep an eye on vendor compatibility (if using a distribution like Cloudera, Databricks, etc.), since some distros bundle specific Iceberg versions. [3]

Sources:
[1] Apache Iceberg releases (1.4.3 includes Spark 3.5 runtime).
[2] Apache Iceberg multi-engine support / runtime guidance (Spark 3.5 integration and runtime-jar guidance).
[3] Example vendor compatibility notes (Cloudera) — verify distro-specific pinned Iceberg/Spark combinations.


Verify Scala binary version and add documentation for Iceberg JAR dependency management.

Iceberg 1.4.3 with Spark 3.5.x is production-ready, but the example should clarify that iceberg-spark-runtime-3.5_2.12 assumes Scala 2.12. Verify your Spark build uses matching Scala 2.12 (or use _2.13 variant if needed). Additionally, document that only the runtime JAR should be on the classpath—avoid including other Iceberg modules to prevent dependency conflicts. Recommend end-to-end testing in a staging environment with your actual Spark distribution and storage backend (S3/GCS/Azure) before production use.


Run this script to create your test warehouse:

```bash
python scripts/build_warehouse.py
```

#### 2. Start Local Eval Service

Run the local-eval Docker container with the warehouse mounted:

```bash
docker run -d \
--name chronon-eval-service \
-e CHRONON_ROOT=/configs \
-e CHRONON_EVAL_WAREHOUSE_PATH=/warehouse \
-e CHRONON_EVAL_WAREHOUSE_CATALOG=ci_catalog \
-v /tmp/chronon-test-warehouse:/warehouse:ro \
-v $(pwd):/configs:ro \
-p 3904:8080 \
ziplineai/local-eval:latest
```

**Environment variables:**
- `CHRONON_ROOT` - Path to directory containing `compiled/` configs (required)
- `CHRONON_EVAL_WAREHOUSE_PATH` - Path to Iceberg warehouse directory (required)
- `CHRONON_EVAL_WAREHOUSE_CATALOG` - Catalog name (must match catalog used when building warehouse)
- `SERVER_PORT` - HTTP port (default: `8080`)

#### 3. Run Eval Commands

```bash
# Check service is running
curl http://localhost:3904/ping

# Run eval against local service
zipline hub eval --conf compiled/joins/{team}/{your_conf} --eval-url http://localhost:3904
```

## Recommended Workflow

1. **During development**: Use quick schema validation
```bash
zipline compile
zipline hub eval --conf compiled/joins/{team}/{your_conf}
```

2. **Before submitting PR**: Test with sample data
```bash
zipline hub eval --conf compiled/joins/{team}/{your_conf} --test-data-path test-data.yaml
```

3. **In CI/CD**: Use local-eval with preloaded Iceberg warehouse
- Automated validation on every PR
- No cloud dependencies
- Fast feedback loop

4. **After PR approval**: Run backfill
```bash
zipline hub backfill --conf compiled/joins/{team}/{your_conf} --start-ds 2024-01-01 --end-ds 2024-01-02
```
63 changes: 63 additions & 0 deletions docs/source/running_on_zipline_hub/Test.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,69 @@ This will show you errors if there are any in your definitions, and show you cha

Note, if you are making a change to any existing entities **without** changing the version it will prompt you with a warning.

## Eval

Before running expensive backfill jobs, use eval to quickly validate your configuration. Eval checks that:
- All source tables exist and are accessible
- Column names and types match your configuration
- Query syntax is valid (for StagingQueries)
- Derivations compile and type-check correctly
- Dependencies between configurations resolve correctly

### Quick Schema Validation

The most common use case - validate your configuration without running any computations:

```bash
zipline hub eval --conf compiled/joins/{team}/{your_conf}
```

This will show you the output schema, lineage, and catch configuration errors early. Example output:

```
🟢 Eval job finished successfully
Join Configuration: gcp.demo.user_features__1
- Left table: data.user_activity_7d__0
- Join parts: 2
- Conf dependencies: 3
- External tables: 2
- Output Schema:
[left] user_id: string
[left] event_timestamp: long
[left] ds: string
[joinPart: gcp.user_demographics__0] user_id_age: integer
[derivation] is_adult: boolean

Lineage:
[Join] gcp.demo.user_features__1
├── ✅ [GroupBy] gcp.user_activity_7d__0
│ └── External: project.events.user_clicks
└── ✅ [GroupBy] gcp.user_demographics__0
└── ✅ [StagingQuery] gcp.raw_demographics__0
```
Comment on lines +38 to +58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to the example-output code fence (MD040).

The example output block is missing a language spec; consider marking it as plain text to satisfy markdownlint:

-```
+```text
🤖 Prompt for AI Agents
In docs/source/running_on_zipline_hub/Test.md around lines 38 to 58 the
example-output code fence is missing a language spec which triggers MD040;
update the opening triple-backtick to include a language (e.g., change ``` to
```text) so the block is explicitly marked as plain text and save the file.


![Eval command demonstration](../../images/eval_sample.gif)

### Testing with Sample Data

For deeper validation, provide sample data to see actual computation output:

```bash
# 1. Generate a test data skeleton
zipline hub eval --conf compiled/joins/{team}/{your_conf} --generate-skeleton

# 2. Fill in test-data.yaml with sample data (use !epoch for timestamps)

# 3. Run eval with test data
zipline hub eval --conf compiled/joins/{team}/{your_conf} --test-data-path test-data.yaml
```

This will show you the actual computed results with your sample data, helping you validate:
- Complex aggregations and window functions
- Derivation logic with concrete examples
- Join key matching behavior
- Null value handling

## Backfill

```sh
Expand Down
2 changes: 1 addition & 1 deletion python/test/canary/compiled/models/gcp/listing.v1__2

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.