Skip to content

Conversation

peteski22
Copy link
Contributor

@peteski22 peteski22 commented Mar 20, 2025

  • Adjust the default timeout for jobs (5mins per job)
  • Adjust some test timeouts for jobs (3mins per job)
  • Update the code used in the job service (wait_for_job_complete) and in tests (wait_for_workflow_complete) to see if it improved anything (this was more of a hail mary but I don't think it hurts)
  • Added temporary 'fix' to Notebooks to sleep after a job completes, but before we try to get the dataset that should have been uploaded for it (see: [BUG]: SDK - Trying to get dataset for job immediately after a successful completion response can error #1263)

How to test it

Steps to test the changes:

Additional notes for reviewers

Locally I see jobs complete in <30secs .. so I'm interested in figuring out why the integration tests in GH are taking so long they're timing out. 😭

image

The dependency and notebook changes should stop this happening in all the other PRs

image

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required

@peteski22 peteski22 added the do-not-merge PRs that should NOT be merged while this label is present label Mar 20, 2025
@github-actions github-actions bot added backend schemas Changes to schemas (which may be public facing) labels Mar 20, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 21, 2025
@peteski22 peteski22 removed the do-not-merge PRs that should NOT be merged while this label is present label Mar 21, 2025
@peteski22 peteski22 marked this pull request as ready for review March 21, 2025 07:58
@peteski22 peteski22 requested review from agpituk and macaab26 March 21, 2025 07:59
@peteski22 peteski22 changed the title Testing: Attempting to review wait_for_x code to see if we can see any issues Update dependencies for Ray jobs + fix notebook walkthrough + tweaks for waiting for jobs Mar 21, 2025
@peteski22 peteski22 enabled auto-merge (squash) March 21, 2025 09:15
DEFAULT_SKIP = 0
DEFAULT_LIMIT = 100
DEFAULT_POST_INFER_JOB_TIMEOUT_SEC = 10 * 60
DEFAULT_POST_INFER_JOB_TIMEOUT_SEC = 5 * 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes it's hard to pin the right time. I'd suggest using the extra param to set the waiting time in tests and avoid modifying the default, but let's wait until we get feedback from the people running their own jobs.

async def wait_for_job_complete(
self,
job_id: UUID,
timeout_seconds: int = 300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No defaults at this place. Defaults only at top level, please. Also, I'm not sure we need a complex backoff scheme, but I don't have firm arguments against it at the moment :-/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just the timeout_seconds or all the defaults there?

I understand why you might want constant defaults declared separately to help with centralisation/management, but to me it makes sense having them inline at the moment. Here's my summary of 'why'...

  • Function signature clearly provides the values
  • Reduces cognitive load in having to track/lookup the defaults
  • Encapsulates the logic within the required scope, changing them doesn't impact other code not related to waiting for a job
  • Not being used anywhere else at the moment

If we need to re-use them later, we can just extract them to private class constants rather than at the top of the file?

Not that it means "it's fine" but we also already have inline defaults all through the code, in function calls and schemas.

def wait_for_workflow_complete(
local_client: TestClient,
workflow_id: UUID,
timeout_seconds: int = 300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No defaults at this level, again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous (test) code already had magic numbers further down the method (300 iterations and 1 second sleeps), I don't think this change actually makes things worse as it makes the values and purpose clear in the method signature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that doesn't mean it was ok :)

Comment on lines 1 to 13
accelerate==1.5.2
datasets==2.19.1
langcodes==3.5.0
litellm==1.60.6
loguru==0.7.2
pydantic>=2.10.0
python-box==7.2.0
requests-mock==1.12.1
s3fs==2024.5.0
litellm==1.63.12
loguru==0.7.3
numpy==1.26.3
pandas==2.2.3
pydantic==2.10.6
python-box==7.3.2
s3fs==2024.2.0
sentencepiece==0.2.0
torch==2.5.1
transformers==4.46.3
torch==2.6.0
transformers==4.49.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ividal can the ML team please check these versions just in case?

Copy link
Contributor

@javiermtorres javiermtorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only comment: please leave defaults at the top level, most probably at the interface with external settings.

@peteski22 peteski22 changed the title Update dependencies for Ray jobs + fix notebook walkthrough + tweaks for waiting for jobs Tweaks for waiting for jobs Mar 21, 2025
@peteski22 peteski22 disabled auto-merge March 21, 2025 12:40
@peteski22 peteski22 marked this pull request as draft March 26, 2025 12:22
@macaab26 macaab26 removed their request for review July 18, 2025 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend documentation Improvements or additions to documentation schemas Changes to schemas (which may be public facing)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants