-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Milestone
Description
What is it?
The timeseries metrics have proven too large for dbt and bigquery to execute as we hope for them to be executed.
Some things were attempted to get these things working properly which unfortunately has led to a larger investigation.
Things tried:
- Split the CTEs that compose the
int_timeseries_
models into smaller models. This led to a discovery that generating0
events for all artifacts created too large a model (55TB for developers * artifacts * days) - SQLMesh against clickhouse
- Sadly they don't yet have a clickhouse "engine" created yet. So I attempted to use the mysql engine but because clickhouse isn't a complete mysql implementation that failed.
The remaining methods to try (in order) now are:
- Use SQLMesh against bigquery
- Use Dagster against clickhouse with our own SQL execution against clickhouse
Other things that have been considered and not tried:
- Spark
- For the queries we'd like to do, having a spark cluster could make it possible against our bigquery dataset
- Trino
- Similar to spark. This gives us the lever to make those queries.
- Apache Iceberg
- Delta Lake
The reasons we didn't do most of these is mostly that we want to be able to expose exploring the dimensions of our metrics to users. These could still be useful in that world.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done