-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](orc) ignore null values when the literals of in_predicate contains #45104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
b48f4f7 to
d317351
Compare
|
run buildall |
b4bb463 to
52d3b60
Compare
|
clang-tidy review says "All clean, LGTM! 👍" |
52d3b60 to
ec9aded
Compare
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
ec9aded to
02754ce
Compare
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
TeamCity be ut coverage result: |
|
PR approved by anyone and no changes requested. |
### What changes were proposed in this pull request? close: #2079 relate pr: #2055 Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation. The Java implementation includes the following logic: ```java if (stats.hasHasNull()) { hasNull = stats.getHasNull(); } else { hasNull = true; } ``` In contrast, the C++ implementation directly uses the has_null value without any fallback logic: ```c++ ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) { stats_.setNumberOfValues(pb.number_of_values()); stats_.setHasNull(pb.has_null()); } ``` ### Why are the changes needed? We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly. **This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(** ### How was this patch tested? I have tested this using [Doris](https://github.com/apache/doris) external pipeline: apache/doris#45104 apache/doris-thirdparty#259 ### Was this patch authored or co-authored using generative AI tooling? No Closes #2082 from suxiaogang223/fix_has_null. Authored-by: Socrates <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
close: apache#2079 relate pr: apache#2055 Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation. The Java implementation includes the following logic: ```java if (stats.hasHasNull()) { hasNull = stats.getHasNull(); } else { hasNull = true; } ``` In contrast, the C++ implementation directly uses the has_null value without any fallback logic: ```c++ ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) { stats_.setNumberOfValues(pb.number_of_values()); stats_.setHasNull(pb.has_null()); } ``` We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly. **This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(** I have tested this using [Doris](https://github.com/apache/doris) external pipeline: apache/doris#45104 apache/doris-thirdparty#259 No Closes apache#2082 from suxiaogang223/fix_has_null. Authored-by: Socrates <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
kaka11chen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ins (#45104) ### What problem does this PR solve? Related PR: #43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
…dicate contains #45104 (#45586) Cherry-picked from #45104 Co-authored-by: Socrates <[email protected]>
close: #2079 relate pr: #2055 Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation. The Java implementation includes the following logic: ```java if (stats.hasHasNull()) { hasNull = stats.getHasNull(); } else { hasNull = true; } ``` In contrast, the C++ implementation directly uses the has_null value without any fallback logic: ```c++ ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) { stats_.setNumberOfValues(pb.number_of_values()); stats_.setHasNull(pb.has_null()); } ``` We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly. **This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(** I have tested this using [Doris](https://github.com/apache/doris) external pipeline: apache/doris#45104 apache/doris-thirdparty#259 No Closes #2082 from suxiaogang223/fix_has_null. Authored-by: Socrates <suxiaogang223icloud.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes #2086 from suxiaogang223/cherry_pick_fix_has_null. Authored-by: Socrates <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…f in_predicate contains apache#45104 (apache#45586)" This reverts commit 55a51dc.
revert: branch-3.0: [fix](orc) ignore null values when the literals of in_predicate contains #45104 (#45586) [fix](orc) check all the cases before build_search_argument (#44615) (#44802) branch-3.0: [enhance](orc) Optimize ORC Predicate Pushdown for OR-connected Predicate #43255 (#44436) re-pick: branch-3.0: [Fix](ORC) Not push down fixed char type in orc reader #45484 (#45525) --------- Co-authored-by: Socrates <[email protected]>
…ins (#45104) ### What problem does this PR solve? Related PR: #43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
What problem does this PR solve?
Related PR: #43255
Problem Summary:
Should ignore null values when the literals of in_predicate contains null value, like
in (1, null)For example, init table in hive:
select result in Doris should be:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)