-
Notifications
You must be signed in to change notification settings - Fork 23
Tweaks for waiting for jobs #1262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DEFAULT_SKIP = 0 | ||
DEFAULT_LIMIT = 100 | ||
DEFAULT_POST_INFER_JOB_TIMEOUT_SEC = 10 * 60 | ||
DEFAULT_POST_INFER_JOB_TIMEOUT_SEC = 5 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes it's hard to pin the right time. I'd suggest using the extra param to set the waiting time in tests and avoid modifying the default, but let's wait until we get feedback from the people running their own jobs.
async def wait_for_job_complete( | ||
self, | ||
job_id: UUID, | ||
timeout_seconds: int = 300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No defaults at this place. Defaults only at top level, please. Also, I'm not sure we need a complex backoff scheme, but I don't have firm arguments against it at the moment :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just the timeout_seconds
or all the defaults there?
I understand why you might want constant defaults declared separately to help with centralisation/management, but to me it makes sense having them inline at the moment. Here's my summary of 'why'...
- Function signature clearly provides the values
- Reduces cognitive load in having to track/lookup the defaults
- Encapsulates the logic within the required scope, changing them doesn't impact other code not related to waiting for a job
- Not being used anywhere else at the moment
If we need to re-use them later, we can just extract them to private class constants rather than at the top of the file?
Not that it means "it's fine" but we also already have inline defaults all through the code, in function calls and schemas.
def wait_for_workflow_complete( | ||
local_client: TestClient, | ||
workflow_id: UUID, | ||
timeout_seconds: int = 300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No defaults at this level, again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous (test) code already had magic numbers further down the method (300
iterations and 1
second sleeps), I don't think this change actually makes things worse as it makes the values and purpose clear in the method signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that doesn't mean it was ok :)
accelerate==1.5.2 | ||
datasets==2.19.1 | ||
langcodes==3.5.0 | ||
litellm==1.60.6 | ||
loguru==0.7.2 | ||
pydantic>=2.10.0 | ||
python-box==7.2.0 | ||
requests-mock==1.12.1 | ||
s3fs==2024.5.0 | ||
litellm==1.63.12 | ||
loguru==0.7.3 | ||
numpy==1.26.3 | ||
pandas==2.2.3 | ||
pydantic==2.10.6 | ||
python-box==7.3.2 | ||
s3fs==2024.2.0 | ||
sentencepiece==0.2.0 | ||
torch==2.5.1 | ||
transformers==4.46.3 | ||
torch==2.6.0 | ||
transformers==4.49.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ividal can the ML team please check these versions just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only comment: please leave defaults at the top level, most probably at the interface with external settings.
wait_for_job_complete
) and in tests (wait_for_workflow_complete
) to see if it improved anything (this was more of a hail mary but I don't think it hurts)How to test it
Steps to test the changes:
Additional notes for reviewers
Locally I see jobs complete in <30secs .. so I'm interested in figuring out why the integration tests in GH are taking so long they're timing out. 😭
The dependency and notebook changes should stop this happening in all the other PRs
I already...
/docs
)