-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFactory (#13760) #13811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Thanks for the PR! @davisusanibar @lwhite1 would one of you mind taking a look? Is "osm_nodes.arrow" from OpenStreetMap? Are there licensing concerns around the data? Arrow already has test data files for use and/or files can be generated in-process. |
|
@lidavidm yes, it is 10 records from Openstreetmap planet dump. Could you please provide more information how to generate test data in ARROW file format to test dataset API or where existing test data located? |
|
It'd be something like File out = TMP.newFile();
Schema schema = new Schema(Collections.singletonList(Field.nullable("ints", new ArrowType.Int(32, true))));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
FileOutputStream fileOutputStream = new FileOutputStream(file);
ArrowFileWriter writer = new ArrowFileWriter(root, /*dictionaryProvider=*/null, sink)) {
// Fill root with data
IntVector ints = (IntVector) root.getVector(0);
ints.setSafe(0, 0);
root.setRowCount(1);
// ...
writer.start();
writer.writeBatch();
writer.end();
}
// Use out.getPath()... |
|
@lidavidm thank you for advise. OSM data was deleted from PR. Please check updated test TestFileSystemDataset#testBaseArrowIpcRead |
|
Hi @igor-suhorukov This looks good to me except I wish the tests were more robust. (The same is true for the Parquet test that you're emulating, but I guess that's out of scope here.) This kind of test - relying on checking sizes and names - doesn't provide much assurance that we won't see bug reports when people import complex data types or otherwise tap into some of the more advanced functionality. |
|
@lwhite1 we could file another JIRA for that? |
|
Also a general note re: Larry's comment: we currently have a mix of JUnit 4/5, ad-hoc test helpers like the one here, and a mix of assertion libraries; it might be good to start incrementally cleaning that up (e.g. it would be much easier to test complex types if there were an easy setup to parameterize a test and have the data generated for you). ARROW-6931 is sort of related, and ARROW-4740 (we added JUnit5 but didn't port the existing tests) |
|
I think it's fine to open a Jira for better Parqet testing. It would be
preferable, IMO, to get better testing for the new functionality here,
rather than file a ticket for it.
…On Mon, Aug 8, 2022 at 2:43 PM David Li ***@***.***> wrote:
@lwhite1 <https://github.com/lwhite1> we could file another JIRA for that?
—
Reply to this email directly, view it on GitHub
<#13811 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2FPAXRLPLNF4DE6OQQNWLVYFIOBANCNFSM552RMTDA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface) |
Ok. Works for me. |
lwhite1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
FWIW, looking at the JIRA/GH issue, this will only handle "IPC" files, not Arrow stream files - there's work needed on the C++ side if that is something we want to cover |
|
Thanks a lot for clarification @lidavidm @lwhite1 and for your time. Don't worries about refactoring. I have such experience with Spring/ElasticSearch projects refactoring, fix tech debt and cleanup - it can be contribution of crowd when Arrow project will be more mature - separate activities for new joiners. Good start for someone |
|
Benchmark runs are scheduled for baseline = a2f3666 and contender = 78351ce. 78351ce is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0. Main changes related to PyAmber: ## Java/Scala side: - JDBC Driver for Arrow Flight SQL ([13800](apache/arrow#13800)) - Initial implementation of immutable Table API ([14316](apache/arrow#14316)) - Substrait, transaction, cancellation for Flight SQL ([13492](apache/arrow#13492)) - Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](apache/arrow#13811), [13973](apache/arrow#13973), [14182](apache/arrow#14182)) - Add utility to bind Arrow data to JDBC parameters ([13589](apache/arrow#13589)) ## Python side: - The batch_readahead and fragment_readahead arguments for scanning Datasets are exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)). - ExtensionArrays can now be created from a storage array through the pa.array(..) constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)). - Converting ListArrays containing ExtensionArray values to numpy or pandas works by falling back to the storage array ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)). - Casting Tables to a new schema now honors the nullability flag in the target schema ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
…tory (apache#13760) (apache#13811) This PR allow developers to create Dataset from ARROW IPC files in JVM code like: `FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);` It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion Lead-authored-by: Igor Suhorukov <[email protected]> Co-authored-by: igor.suhorukov <[email protected]> Signed-off-by: David Li <[email protected]>
This PR allow developers to create Dataset from ARROW IPC files in JVM code like:
FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion