Skip to content
This repository was archived by the owner on May 10, 2024. It is now read-only.

Conversation

@xhochy
Copy link
Member

@xhochy xhochy commented Sep 13, 2016

At the moment this is just move of the existing code into the state where it compiles. Outstanding work includes:

  • Understand the issues with ParquetException typeid matching in the tests on macOS. @wesm We already had this problem and you fixed it somewhere. Do you remember the solution?
  • Understand why BoolenType tests break with a Thrift exception
  • Add functions that read directly into Arrow memory and not intermediate structures.

@xhochy
Copy link
Member Author

xhochy commented Sep 13, 2016

The failing BooleanType test is not because of arrow but there seems to be a general problem inside parquet-cpp. Will make a PR with a separate failing test.

@wesm
Copy link
Member

wesm commented Sep 14, 2016

@xhochy I thought I fixed the ParquetException issue in 616305c, but it is recurring?

@wesm
Copy link
Member

wesm commented Sep 14, 2016

I think we can leave item 3 for a follow up JIRA.

The output for OS X (all the Thrift linker warnings) looks worrisome, not sure how to fix that off hand

PROPERTIES
BUILD_WITH_INSTALL_RPATH ON
INSTALL_NAME_DIR "@rpath")
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: building static libparquet_arrow.a

@wesm
Copy link
Member

wesm commented Sep 14, 2016

@xhochy my understanding from google searches is that throwing a shared library's exception types outside its boundaries can be unreliable. So we may need to come up with a different way for Parquet library implementers to signal failure of an interface extension type's methods to parquet-cpp

@wesm
Copy link
Member

wesm commented Sep 15, 2016

@xhochy as a workaround you can add a protected API to the base classes that allow subclasses in other shared libraries to throw ParquetException without using it directly. e.g.

in subclass

if (... /* error */) {
  RaiseException("it failed");
}

in ParquetAllocator (etc.)

void RaiseException(const std::string& msg) {
  throw ParquetException(msg);
}

@wesm
Copy link
Member

wesm commented Sep 15, 2016

Or maybe add the exception thrower function to the public API

@wesm
Copy link
Member

wesm commented Sep 16, 2016

Why is the output of the unit tests not visible? We should fix the Travis CI here to be like Arrow so we can see which test failed.

Does my solution to the exception issue work? I think this is the only offending line that needs to be replaced with a function call from libparquet.so that actually throws the exception:

https://github.com/apache/parquet-cpp/pull/158/files#diff-92da47ba6e6cd44f28603464bbbe66d2R47

@wesm
Copy link
Member

wesm commented Sep 16, 2016

@xhochy
Copy link
Member Author

xhochy commented Sep 17, 2016

Locally I encounter #161 now with a static build for parquet_arrow enabled, thus that one should go in first.

@wesm
Copy link
Member

wesm commented Sep 18, 2016

I can merge this whenever the build is green

@xhochy
Copy link
Member Author

xhochy commented Sep 18, 2016

There are still open issues I have to fix, so please wait until I remove the WIP prefix.

@xhochy xhochy changed the title [WIP] PARQUET-712: Add library to read into Arrow memory PARQUET-712: Add library to read into Arrow memory Sep 18, 2016
@xhochy
Copy link
Member Author

xhochy commented Sep 18, 2016

With correct static linkage, we don't have the ParquetException problems. I'll open a JIRA for faster reads and a benchmarks for Arrow-Parquet I/O. This PR is so far done.

@wesm conda builds are broken. Not sure if we can fix them / we want to fix them as they probably involve a cyclic Arrow<->Parquet dependency.

@wesm
Copy link
Member

wesm commented Sep 18, 2016

@xhochy I'm going to set up conda builds in conda-forge with pinned git revisions -- the builds here were a stop gap so I'll remove them once conda-forge is all working

@wesm
Copy link
Member

wesm commented Sep 18, 2016

+1, thank you! let's wait to remove arrow_parquet from Arrow until ARROW-280 is merged

@asfgit asfgit closed this in 4a7bf11 Sep 18, 2016
break;
case ArrowType::UINT32:
if (properties.version() == ::parquet::ParquetVersion::PARQUET_1_0) {
type = ParquetType::INT64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy Can you explain why UINT_32 is not emitted for Parquet version 1.0? There's zero comment and I can't find any explanation otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I saw the comment elsewhere:

        // Parquet 1.0 reader cannot read the UINT_32 logical type. Thus we need
        // to use the larger Int64Type to store them lossless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's still not very informative. Which "Parquet 1.0 reader" is it? parquet-mr? parquet-cpp? A particular version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is unclear about when it's acceptable to use the unsigned integer types (whether the "1.0" format version versus "2.0" format version -- here 1.0 refers to the Parquet format version).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants