-
Notifications
You must be signed in to change notification settings - Fork 212
[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245
Description
Is this your first time submitting a feature request?
- I have read the expectations for open source contributors
- I have searched the existing issues, and I could not find an existing issue for this feature
- I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
Community slack discussion: https://getdbt.slack.com/archives/C03QUA7DWCW/p1661529180748549
Currently, Python models in dbt using Snowflake allow you to specify packages within the config as a list of Python packages to include. Snowflake is limited to this set of packages through this method: https://repo.anaconda.com/pkgs/snowflake/
Custom packages, in addition to other scenarios (storing ML model artifacts, reading arbitrary config files) are enables via IMPORT statements in a stored procedure. There is a corresponding add_import method in the Python connector: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#module-snowflake.snowpark
We should enable this for Snowflake in the config, so:
dbt.config(
materialized = "table",
packages = ["holidays"],
imports = ["@mystage/myfile.py"]or in YAML:
- name: mymodel
config:
- packages:
- numpy
- pandas
- scikit-learn
- imports:
- "@mystage/file1"
- "@myotherstage/file2"The expectations is these files are available for use from the working directory of the Python process.
Describe alternatives you've considered
Allowing arbitrary object storage for use in the dbt DAG raises tracking/lineage and governance questions. You've now used a ML model binary uploaded manually to a stage to make predictions that power a key metric or dashboard and something goes wrong -- but the model was deleted. Is it traceable?
If you're using Python files and importing them (effectively package management), are they version-controlled with git? How do you manage the code?
I think allowing this raises broader questions, but it seems relatively low-effort and high-impact to enable this sooner rather than later as-is. We can then think about standardizing across backends in another iteration.
Who will this benefit?
Snowflake Snowpark users of Python models. This is a critical feature for using non-included Python packages, and other scenarios like machine learning.
Are you interested in contributing this feature?
maybe ;) no, requires writing SQL and not just Python ;)
Anything else?
No response