Skip to content

Investigation: New timeseries metrics models are too large for current dbt configuration #1970

@ravenac95

Description

@ravenac95

What is it?

The timeseries metrics have proven too large for dbt and bigquery to execute as we hope for them to be executed.

Some things were attempted to get these things working properly which unfortunately has led to a larger investigation.

Things tried:

  • Split the CTEs that compose the int_timeseries_ models into smaller models. This led to a discovery that generating 0 events for all artifacts created too large a model (55TB for developers * artifacts * days)
  • SQLMesh against clickhouse
    • Sadly they don't yet have a clickhouse "engine" created yet. So I attempted to use the mysql engine but because clickhouse isn't a complete mysql implementation that failed.

The remaining methods to try (in order) now are:

  • Use SQLMesh against bigquery
  • Use Dagster against clickhouse with our own SQL execution against clickhouse

Other things that have been considered and not tried:

  • Spark
    • For the queries we'd like to do, having a spark cluster could make it possible against our bigquery dataset
  • Trino
    • Similar to spark. This gives us the lever to make those queries.
  • Apache Iceberg
  • Delta Lake

The reasons we didn't do most of these is mostly that we want to be able to expose exploring the dimensions of our metrics to users. These could still be useful in that world.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions