Skip to content

Add exclude parameter to S3 pull functionality #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2025
Merged

Conversation

corbt
Copy link
Contributor

@corbt corbt commented Jul 6, 2025

Summary

  • Added typesafe exclude parameter to S3 pull functionality to filter out directories during sync
  • Updated pull_model_trajectories helper to automatically exclude checkpoints and logs
  • Enables pulling only trajectories for local analysis without downloading large model weights

Motivation

Often when analyzing trajectory data, we only need the trajectory files themselves, not the model checkpoints or logs. Model checkpoints can be very large (gigabytes), and downloading them consumes significant bandwidth and storage space. This is especially important when working on mobile connections or with limited bandwidth.

Changes

  1. Added ExcludableOption type with valid options: "checkpoints", "logs", "trajectories"
  2. Modified s3_sync() to accept and use exclude patterns with AWS CLI's --exclude flag
  3. Updated pull_model_from_s3() and _experimental_pull_from_s3() to accept and pass through exclude parameter
  4. Modified pull_model_trajectories() helper to automatically exclude checkpoints and logs

Test plan

Created and ran test_s3_exclude.py to verify:

  • Helper function correctly excludes checkpoints and logs
  • Direct backend call with exclude parameter works
  • Only trajectory files are downloaded, excluded directories remain empty
  • Tested with model email-agent-216-4 in project email_agent

🤖 Generated with Claude Code

@corbt corbt force-pushed the s3-exclude-directories branch 3 times, most recently from 9e4d54a to 04d5768 Compare July 6, 2025 07:20
- Add typesafe exclude parameter to filter out directories during S3 sync
- Valid exclude options: "checkpoints", "logs", "trajectories"
- Update pull_model_trajectories helper to exclude checkpoints and logs by default
- Enables pulling only trajectories for analysis without downloading large model weights

This is useful when analyzing trajectory data locally without needing the
model checkpoints, which can be very large and consume significant bandwidth.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@corbt corbt force-pushed the s3-exclude-directories branch from 04d5768 to 98f38e2 Compare July 6, 2025 07:22
@corbt corbt requested a review from bradhilton July 6, 2025 07:23
@corbt corbt merged commit 92ba8b1 into main Jul 7, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants