Skip to content

Switch trajectory storage from YAML to JSONL format #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 10, 2025

Conversation

corbt
Copy link
Contributor

@corbt corbt commented Jul 6, 2025

Summary

  • Migrate trajectory storage from YAML to JSONL format for massive performance improvements
  • 160x faster loading: 1.9 seconds vs 300 seconds (benchmarked by @kyle)
  • ~6% smaller file size

Why this change?

Kyle originally suggested using YAML for human readability, but the performance cost turned out to be too high. This PR switches to JSONL (newline-delimited JSON) format while maintaining full backward compatibility.

Changes

  • Update trajectory_logging.py to serialize as JSONL (one JSON object per line)
  • Update backend to save new files with .jsonl extension
  • Add backward compatibility to read both YAML and JSONL formats
  • Consolidate duplicate load_trajectories implementations (removed art-e specific version)
  • Add new aggregate_trajectories module for step-level metric aggregation

Backward Compatibility

  • All loading functions can read both .yaml and .jsonl files
  • Existing YAML trajectories don't need to be converted
  • New trajectories will be saved as JSONL

Trade-offs

  • Files are less human-readable (JSON instead of YAML)
  • We'll add better observability tools to compensate for this

Test plan

  • Test that new trajectories are saved as JSONL
  • Test that old YAML trajectories can still be loaded
  • Test that all trajectory loading functions work with both formats
  • Verify performance improvement

🤖 Generated with Claude Code

@corbt corbt force-pushed the yaml-to-jsonl-trajectories branch from 45e03c8 to 4a1e99b Compare July 6, 2025 06:57
@corbt corbt requested a review from bradhilton July 6, 2025 06:58
@corbt corbt force-pushed the yaml-to-jsonl-trajectories branch from a6fe2a3 to 4a1e99b Compare July 6, 2025 07:24
@corbt corbt changed the title Switch trajectory storage from YAML to JSONL format [WIP] Switch trajectory storage from YAML to JSONL format Jul 7, 2025
@corbt corbt changed the title [WIP] Switch trajectory storage from YAML to JSONL format Switch trajectory storage from YAML to JSONL format Jul 8, 2025
This change migrates our trajectory logging from YAML to JSONL format,
achieving a 160x speedup in loading (1.9s vs 300s) and ~6% smaller files.

Changes:
- Update trajectory_logging.py to serialize as JSONL (one JSON object per line)
- Update backend to save files with .jsonl extension
- Add backward compatibility to read both YAML and JSONL formats
- Consolidate duplicate load_trajectories implementations
- Add new aggregate_trajectories module for step-level aggregation

My bad for originally suggesting YAML - the human readability wasn't worth
the performance cost. We'll add better observability tools to compensate
for the reduced readability of JSONL files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@corbt corbt force-pushed the yaml-to-jsonl-trajectories branch from 4a1e99b to 46e9aea Compare July 10, 2025 00:45
@corbt corbt merged commit 5522bde into main Jul 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants