Skip to content

Conversation

dejangvozdenac
Copy link

@dejangvozdenac dejangvozdenac commented Aug 13, 2025

producesNull check is used when binding isNull and notNull predicate. Currently, the logic states that a field can produce null if and only if that field is optional. This is not true, however, in the case of required fields nested within optional structs. The field itself can produce nulls if the parent struct is null despite it being required.

I'm able to reproduce this case in Trino and Spark by creating the following schema and adding rows to it:

spark-sql>  CREATE TABLE default.dejan_test (
  id INT NOT NULL,
  name STRING NOT NULL,
  age INT NOT NULL,
  address STRUCT<street: STRING NOT NULL, address_info: STRUCT<city: STRING NOT NULL, county: STRING NOT NULL, state: STRING NOT NULL>>)
USING iceberg;
spark-sql> INSERT INTO default.dejan_test (id, name, age, address)
VALUES (
  0, 
  'Jane Doe', 
  27, 
  NULL
);
spark-sql> INSERT INTO default.dejan_test (id, name, age, address)
VALUES (
  1, 
  'John Doe', 
  30, 
  STRUCT(
    '123 Main St',
    STRUCT('San Francisco', 'San Francisco County', 'California')
  )
);

address.street is null for row 0, but trino/spark using iceberg api disagree:

trino> 
set session iceberg.projection_pushdown_enabled=true;
SET SESSION
trino> 
select
  id
from
  iceberg.default.dejan_test
where
  address.street is null;
 id 
----
(0 rows)

Query 20250613_034027_00008_xn59q, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.36 [0 rows, 0B] [0 rows/s, 0B/s]

You can see that when Trino reads the entire file, it correctly determines the row:

trino> 
set session iceberg.projection_pushdown_enabled=false;
SET SESSION
trino> 
select
  id
from
  iceberg.default.dejan_test
where
  address.street is null;
 id 
----
  0 
(1 row)

Query 20250613_033713_00001_xn59q, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
2.85 [2 rows, 4.43KiB] [0 rows/s, 1.56KiB/s]

This also leads to unexpected behavior where null or not null check behaves differently based on the binding order:

spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test;
2

spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test
                   > where
                   >   address.street is null or address.street is not null;
2


spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test
                   > where
                   >   address.street is not null;
1

spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test
                   > where
                   >   address.street is null;
0

After this change, iceberg can find the null row:

trino> 
set session iceberg.projection_pushdown_enabled=true;
SET SESSION
trino> 
select
  id
from
  iceberg.default.dejan_test
where
  address.street is null;
 id 
----
  0 
(1 row)

Query 20250813_413537_00001_qn1a9, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
2.15 [1 rows, 2.03KiB] [0 rows/s, 969B/s]

Closes #13328 (and relatedly trinodb/trino#20511)

@github-actions github-actions bot added the API label Aug 13, 2025
@dejangvozdenac dejangvozdenac marked this pull request as draft August 13, 2025 15:23
@dejangvozdenac dejangvozdenac changed the title add ancestor check to field optional is null check add ancestor check to field optional producesNull Aug 13, 2025
@dejangvozdenac dejangvozdenac changed the title add ancestor check to field optional producesNull API: required nested fields within optional structs can produce null Aug 14, 2025
@dejangvozdenac dejangvozdenac marked this pull request as ready for review August 14, 2025 19:47
@stevenzwu
Copy link
Contributor

stevenzwu commented Aug 14, 2025

@dejangvozdenac thanks for reporting and fixing the issue. I agree that my_struct.nested_field should be evaluated to null if my_struct is null.

can you add unit test in Spark module to cover the scenario you described?

@dejangvozdenac
Copy link
Author

thanks for the review @stevenzwu, I appreciate it! I addressed all the comments, let me know if anything further is needed.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We will need 2 or 3 more committers' approval, since this modifies critical api code.

@dejangvozdenac
Copy link
Author

LGTM. We will need 2 or 3 more committers' approval, since this modifies critical api code.

awesome, thanks for the quick review @stevenzwu. what's the usually process here? should I tag 2-3 people in who have reviewed code under api or do you have suggestions?

@dejangvozdenac
Copy link
Author

@singhpk234 @pvary @nastra do you mind taking a look? I see you're familiar with this part of the code and @stevenzwu mentioned we need more reviews.

@dejangvozdenac
Copy link
Author

Thanks for the reviews @nastra / @pvary , the comments should all be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Required fields within optional fields cause incorrect results in Trino
5 participants