Skip to content

Conversation

medley56
Copy link
Member

Add File-like Object Support to create_dataset

Summary

Enhanced the create_dataset function to accept file-like objects (such as those created with open(filepath, "rb") or io.BytesIO)
in addition to file paths for both packet files and XTCE definitions.

Changes Made

Core Implementation (space_packet_parser/xarr.py)

  • Enhanced type annotations: Updated packet_files parameter to accept BinaryIO and bytes types alongside existing str and
    Path types
  • Added file-like object detection: Implemented logic to differentiate between file paths and file-like objects using type checking
  • Restructured packet processing: Refactored the main loop to handle both file paths (opened with context manager) and file-like
    objects (used directly)
  • Improved code organization: Extracted packet processing logic into a helper function _process_generator() to avoid code
    duplication

Comprehensive Test Coverage (tests/unit/test_xarr.py)

  • File-like objects test: Tests passing file handles opened with open(filepath, "rb") and io.BytesIO objects
  • Mixed file types test: Tests combining file paths and file-like objects in the same call
  • XTCE file-like support: Tests using io.StringIO for XTCE definitions alongside binary file-like objects for packet data

Checklist

  • Changes are fully implemented without dangling issues or TODO items
  • Deprecated/superseded code is removed or marked with deprecation warning
  • Current dependencies have been properly specified and old dependencies removed
  • New code/functionality has accompanying tests and any old tests have been updated to match any new assumptions
  • [NA] The changelog.md has been updated

@medley56 medley56 requested a review from greglucas August 19, 2025 22:11
Copy link

codecov bot commented Aug 19, 2025

Codecov Report

❌ Patch coverage is 96.20991% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.25%. Comparing base (f647e8c) to head (67caef2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
space_packet_parser/generators/fixed_length.py 83.87% 5 Missing ⚠️
space_packet_parser/generators/utils.py 95.18% 4 Missing ⚠️
space_packet_parser/xarr.py 89.18% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #183      +/-   ##
==========================================
+ Coverage   93.98%   94.25%   +0.26%     
==========================================
  Files          42       46       +4     
  Lines        3375     3568     +193     
==========================================
+ Hits         3172     3363     +191     
- Misses        203      205       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@medley56
Copy link
Member Author

@greglucas Just making sure you saw this. I realized last night that for supporting S3 files, we actually don't have to make any changes because an S3Path behaves exactly like a Path object and create_dataset at least appears to work with it, but MyPy throws a fit if you pass in a cloudpathlib.S3Path object into create_dataset.

That said, I think this is still a good thing to support.

Copy link
Collaborator

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the double nesting going on with an internal helper function. I'm curious about your thoughts of restructuring things in two potentially different ways.

  • Do the handling within the generator itself. Can we do a yield from generator(f) after doing the transformation within the generator? i.e. if we get a file that is unopened, lets open that within a context block and then call the generator again yielding from itself.
  • Perhaps this is a good usecase for singledispatch where you could choose what to do based upon the input type and register how to open the file-likes, sockets, raw bytes, ... https://docs.python.org/3/library/functools.html#functools.singledispatch

@medley56 medley56 force-pushed the support-filelike-objects-in-create-dataset branch 6 times, most recently from a561ce7 to 8e0685c Compare August 26, 2025 19:14
Copy link
Collaborator

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments on trying to consolidate a few more things.

I'm not sure how I feel about singledispatch. I think it seems OK, but I'm not sure it helped as much as I thought it would in my mind :)
The one concern I have with it is that I think the read_packet_file dispatch will read the entire file at once before entering the generator whereas your implementation before would have passed the open file handle in and then called read() (potentially a full read by default, but controllable by a user).

current_pos = 0 # Keep track of where we are in the buffer
start_time = time.time_ns()
while True:
if total_length_bytes and n_bytes_parsed == total_length_bytes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the first check? I think you could just keep the equality and 0 == None would also be False if that is what you're looking to catch. (I'm guessing this was just copied over, so just noting that here because I saw it)

Suggested change
if total_length_bytes and n_bytes_parsed == total_length_bytes:
if n_bytes_parsed == total_length_bytes:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, yeah that seems like it's unnecessary. I'll see if it breaks anything in some way I'm not thinking of.

- Refactor generator definitions into generators subpackage
- Use singledispatch to read packet files in xarr.py
- Add singledispatch for setting up generator binary reader
- Cleanup typehinting in definitions.py
- Add tests for generators module
@medley56 medley56 force-pushed the support-filelike-objects-in-create-dataset branch from 19abd78 to 67caef2 Compare August 27, 2025 21:09
@medley56 medley56 merged commit 0cf3ce7 into main Aug 27, 2025
19 checks passed
@medley56 medley56 deleted the support-filelike-objects-in-create-dataset branch August 27, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants