-
Notifications
You must be signed in to change notification settings - Fork 6
[Data-561] Reduce data_tranformation docker image #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Data-561] Reduce data_tranformation docker image #881
Conversation
Zhimin-arya
commented
Sep 15, 2025
- Use cpu-only torch package
- Mount linker folder to prefect volume instead of downloading them during image build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the data_transformation Docker image by reducing its size through two main changes: switching to a CPU-only PyTorch package and removing the download of 2GB linker files during the build process.
- Replaces standard PyTorch with CPU-only version to reduce image size
- Removes downloading of large linker files (~2GB) from the Docker build process
- Provides documentation for mounting linker files via Prefect volumes instead
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
requirements_data_transformation.txt | Adds CPU-only PyTorch wheel package |
ner_extract_plugin/README.md | Adds instructions for downloading linkers locally and mounting via Prefect |
Dockerfile | Removes linker download commands and unused COPY instruction |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
## Instruction for using ner_extract_plugin: | ||
### 1. Download model linkers (~2GB) to user local folder | ||
``` | ||
output_dir="path/to/linker/foler" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in path: 'foler' should be 'folder'.
Copilot uses AI. Check for mistakes.
## Instruction for using ner_extract_plugin: | ||
### 1. Download model linkers (~2GB) to user local folder | ||
``` | ||
output_dir="path/to/linker/folder" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the user update "output_dir" and then run the command or just run as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a linker folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User needs to update the output_dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linker is necessary for using different knowledge base for each entity extraction recognition models.
``` | ||
|
||
### 2. Mount the folder to Prefect Volume | ||
In the `.env` file add `"path/to/linker/foler:/root/.scispacy/datasets"` to `PREFECT_DOCKER_VOLUMES_CUSTOM` variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be "output_dir' instead of "path/to/linker/folder"?
Also, maybe easier to document the command like this - https://github.com/data2evidence/data2evidence.github.io/blob/develop/docs/2-admin_guide/5-setup/data-load/6-load-synpuf1k.md#:~:text=Run%20the%20following%20commands%3A
scipy==1.16.0 | ||
# For CPU-only PyTorch | ||
https://download.pytorch.org/whl/cpu/torch-2.8.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the image size now after removing model linkers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current image size after removing model linker and using cpu-only torch
python package, is 8.3 GB.