Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

What problem does this PR solve?

Related PR: #43255

Problem Summary:
Should ignore null values when the literals of in_predicate contains null value, like in (1, null)
For example, init table in hive:

CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL); 

select result in Doris should be:

mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 force-pushed the fix_orc_push_dwon branch 3 times, most recently from b48f4f7 to d317351 Compare December 6, 2024 16:46
@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223 suxiaogang223 marked this pull request as draft December 9, 2024 06:22
@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2024

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2024

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2024

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223 suxiaogang223 marked this pull request as ready for review December 10, 2024 03:33
@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.76% (10104/26069)
Line Coverage: 29.70% (84765/285368)
Region Coverage: 28.77% (43515/151228)
Branch Coverage: 25.33% (22107/87270)
Coverage Report: http://coverage.selectdb-in.cc/coverage/a20e66db478b3b0ddcfbeb2ac4ad7e1dee45765a_a20e66db478b3b0ddcfbeb2ac4ad7e1dee45765a/report/index.html

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

dongjoon-hyun pushed a commit to apache/orc that referenced this pull request Dec 12, 2024
### What changes were proposed in this pull request?
close: #2079
relate pr: #2055
Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation.
The Java implementation includes the following logic:
```java
if (stats.hasHasNull()) {
    hasNull = stats.getHasNull();
} else {
    hasNull = true;
}
```
In contrast, the C++ implementation directly uses the has_null value without any fallback logic:
```c++
ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) {
    stats_.setNumberOfValues(pb.number_of_values());
    stats_.setHasNull(pb.has_null());
}
```
### Why are the changes needed?
We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly.
**This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(**
### How was this patch tested?
I have tested this using [Doris](https://github.com/apache/doris) external pipeline:
apache/doris#45104
apache/doris-thirdparty#259

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #2082 from suxiaogang223/fix_has_null.

Authored-by: Socrates <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
suxiaogang223 added a commit to suxiaogang223/orc that referenced this pull request Dec 16, 2024
close: apache#2079
relate pr: apache#2055
Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation.
The Java implementation includes the following logic:
```java
if (stats.hasHasNull()) {
    hasNull = stats.getHasNull();
} else {
    hasNull = true;
}
```
In contrast, the C++ implementation directly uses the has_null value without any fallback logic:
```c++
ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) {
    stats_.setNumberOfValues(pb.number_of_values());
    stats_.setHasNull(pb.has_null());
}
```
We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly.
**This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(**
I have tested this using [Doris](https://github.com/apache/doris) external pipeline:
apache/doris#45104
apache/doris-thirdparty#259
No

Closes apache#2082 from suxiaogang223/fix_has_null.

Authored-by: Socrates <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 58f1df2 into apache:master Dec 18, 2024
25 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 18, 2024
…ins (#45104)

### What problem does this PR solve?

Related PR: #43255 

Problem Summary:
Should ignore null values when the literals of in_predicate contains
null value, like `in (1, null)`
For example, init table in hive:
```sql
CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL); 
```
select result in Doris should be:
```sql
mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)
```
yiguolei pushed a commit that referenced this pull request Dec 19, 2024
dongjoon-hyun pushed a commit to apache/orc that referenced this pull request Dec 20, 2024
close: #2079
relate pr: #2055
Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation. The Java implementation includes the following logic:
```java
if (stats.hasHasNull()) {
    hasNull = stats.getHasNull();
} else {
    hasNull = true;
}
```
In contrast, the C++ implementation directly uses the has_null value without any fallback logic:
```c++
ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) {
    stats_.setNumberOfValues(pb.number_of_values());
    stats_.setHasNull(pb.has_null());
}
```
We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly. **This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(** I have tested this using [Doris](https://github.com/apache/doris) external pipeline: apache/doris#45104
apache/doris-thirdparty#259 No

Closes #2082 from suxiaogang223/fix_has_null.

Authored-by: Socrates <suxiaogang223icloud.com>

### What changes were proposed in this pull request?

### Why are the changes needed?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

Closes #2086 from suxiaogang223/cherry_pick_fix_has_null.

Authored-by: Socrates <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
morningman added a commit to morningman/doris that referenced this pull request Feb 8, 2025
morningman added a commit that referenced this pull request Feb 9, 2025
revert:
branch-3.0: [fix](orc) ignore null values when the literals of
in_predicate contains #45104 (#45586)
[fix](orc) check all the cases before build_search_argument (#44615)
(#44802)
branch-3.0: [enhance](orc) Optimize ORC Predicate Pushdown for
OR-connected Predicate #43255 (#44436)

re-pick:
branch-3.0: [Fix](ORC) Not push down fixed char type in orc reader
#45484 (#45525)

---------

Co-authored-by: Socrates <[email protected]>
morningman pushed a commit that referenced this pull request Feb 17, 2025
…ins (#45104)

### What problem does this PR solve?

Related PR: #43255 

Problem Summary:
Should ignore null values when the literals of in_predicate contains
null value, like `in (1, null)`
For example, init table in hive:
```sql
CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL); 
```
select result in Doris should be:
```sql
mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Jun 24, 2025
…ins (apache#45104)

Related PR: apache#43255

Problem Summary:
Should ignore null values when the literals of in_predicate contains
null value, like `in (1, null)`
For example, init table in hive:
```sql
CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL);
```
select result in Doris should be:
```sql
mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Jun 25, 2025
…ins (apache#45104)

Related PR: apache#43255

Problem Summary:
Should ignore null values when the literals of in_predicate contains
null value, like `in (1, null)`
For example, init table in hive:
```sql
CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL);
```
select result in Doris should be:
```sql
mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)
```
morningman pushed a commit to suxiaogang223/doris that referenced this pull request Jun 25, 2025
…ins (apache#45104)

Related PR: apache#43255

Problem Summary:
Should ignore null values when the literals of in_predicate contains
null value, like `in (1, null)`
For example, init table in hive:
```sql
CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL);
```
select result in Doris should be:
```sql
mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Jun 26, 2025
…ins (apache#45104)

Related PR: apache#43255

Problem Summary:
Should ignore null values when the literals of in_predicate contains
null value, like `in (1, null)`
For example, init table in hive:
```sql
CREATE TABLE sample_orc_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC;
INSERT INTO TABLE sample_orc_table VALUES
    (1, 'Alice', 25),
    (2, NULL, NULL);
```
select result in Doris should be:
```sql
mysql> select * from sample_orc_table where age in (null,25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.30 sec)

mysql> select * from sample_orc_table where age in (25);
+------+-------+------+
| id   | name  | age  |
+------+-------+------+
|    1 | Alice |   25 |
+------+-------+------+
1 row in set (0.27 sec)

mysql> select * from sample_orc_table where age in (null);
Empty set (0.01 sec)

mysql> select * from sample_orc_table where age is null;
+------+------+------+
| id   | name | age  |
+------+------+------+
|    2 | NULL | NULL |
+------+------+------+
1 row in set (0.11 sec)
```
GoGoWen pushed a commit to GoGoWen/incubator-doris that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants