Skip to content

Conversation

@igor-suhorukov
Copy link
Contributor

@igor-suhorukov igor-suhorukov commented Aug 7, 2022

This PR allow developers to create Dataset from ARROW IPC files in JVM code like:
FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);

It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

Thanks for the PR!

@davisusanibar @lwhite1 would one of you mind taking a look?

Is "osm_nodes.arrow" from OpenStreetMap? Are there licensing concerns around the data? Arrow already has test data files for use and/or files can be generated in-process.

@igor-suhorukov
Copy link
Contributor Author

igor-suhorukov commented Aug 8, 2022

@lidavidm yes, it is 10 records from Openstreetmap planet dump. Could you please provide more information how to generate test data in ARROW file format to test dataset API or where existing test data located?

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

It'd be something like

File out = TMP.newFile();
Schema schema = new Schema(Collections.singletonList(Field.nullable("ints", new ArrowType.Int(32, true))));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
     FileOutputStream fileOutputStream = new FileOutputStream(file);
     ArrowFileWriter writer = new ArrowFileWriter(root, /*dictionaryProvider=*/null, sink)) {
    // Fill root with data
    IntVector ints = (IntVector) root.getVector(0);
    ints.setSafe(0, 0);
    root.setRowCount(1);
    // ...
    writer.start();
    writer.writeBatch();
    writer.end();
}
// Use out.getPath()...

@igor-suhorukov
Copy link
Contributor Author

@lidavidm thank you for advise. OSM data was deleted from PR. Please check updated test TestFileSystemDataset#testBaseArrowIpcRead
Is it fit project test approach?

@lwhite1
Copy link
Contributor

lwhite1 commented Aug 8, 2022

Hi @igor-suhorukov This looks good to me except I wish the tests were more robust. (The same is true for the Parquet test that you're emulating, but I guess that's out of scope here.)

This kind of test - relying on checking sizes and names - doesn't provide much assurance that we won't see bug reports when people import complex data types or otherwise tap into some of the more advanced functionality.

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

@lwhite1 we could file another JIRA for that?

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

Also a general note re: Larry's comment: we currently have a mix of JUnit 4/5, ad-hoc test helpers like the one here, and a mix of assertion libraries; it might be good to start incrementally cleaning that up (e.g. it would be much easier to test complex types if there were an easy setup to parameterize a test and have the data generated for you).

ARROW-6931 is sort of related, and ARROW-4740 (we added JUnit5 but didn't port the existing tests)

@lwhite1
Copy link
Contributor

lwhite1 commented Aug 8, 2022 via email

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface)

@lwhite1
Copy link
Contributor

lwhite1 commented Aug 8, 2022

I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface)

Ok. Works for me.

Copy link
Contributor

@lwhite1 lwhite1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

I filed https://issues.apache.org/jira/browse/ARROW-17342

@lidavidm
Copy link
Member

lidavidm commented Aug 8, 2022

FWIW, looking at the JIRA/GH issue, this will only handle "IPC" files, not Arrow stream files - there's work needed on the C++ side if that is something we want to cover

@lidavidm lidavidm merged commit 78351ce into apache:master Aug 8, 2022
@igor-suhorukov
Copy link
Contributor Author

Thanks a lot for clarification @lidavidm @lwhite1 and for your time. Don't worries about refactoring. I have such experience with Spring/ElasticSearch projects refactoring, fix tech debt and cleanup - it can be contribution of crowd when Arrow project will be more mature - separate activities for new joiners. Good start for someone

@ursabot
Copy link

ursabot commented Aug 9, 2022

Benchmark runs are scheduled for baseline = a2f3666 and contender = 78351ce. 78351ce is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.34% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 78351cec ec2-t3-xlarge-us-east-2
[Finished] 78351cec test-mac-arm
[Finished] 78351cec ursa-i9-9960x
[Finished] 78351cec ursa-thinkcentre-m75q
[Finished] a2f3666d ec2-t3-xlarge-us-east-2
[Finished] a2f3666d test-mac-arm
[Finished] a2f3666d ursa-i9-9960x
[Finished] a2f3666d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Yicong-Huang added a commit to apache/texera that referenced this pull request Dec 13, 2022
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- JDBC Driver for Arrow Flight SQL
([13800](apache/arrow#13800))
- Initial implementation of immutable Table API
([14316](apache/arrow#14316))
- Substrait, transaction, cancellation for Flight SQL
([13492](apache/arrow#13492))
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory
([13811](apache/arrow#13811),
[13973](apache/arrow#13973),
[14182](apache/arrow#14182))
- Add utility to bind Arrow data to JDBC parameters
([13589](apache/arrow#13589))

## Python side:

- The batch_readahead and fragment_readahead arguments for scanning
Datasets are exposed in Python
([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
- ExtensionArrays can now be created from a storage array through the
pa.array(..) constructor
([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
- Converting ListArrays containing ExtensionArray values to numpy or
pandas works by falling back to the storage array
([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
- Casting Tables to a new schema now honors the nullability flag in the
target schema
([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…tory (apache#13760) (apache#13811)

This PR allow developers to create Dataset from ARROW IPC files in JVM code like:
`FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
            FileFormat.ARROW_IPC, arrowDatasetURL);`

It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion

Lead-authored-by: Igor Suhorukov <[email protected]>
Co-authored-by: igor.suhorukov <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API

4 participants