[SPARK-53124][SQL] Prune unnecessary fields from JsonTuple #51843

wangyum · 2025-08-05T07:08:06Z

What changes were proposed in this pull request?

This PR enhances the GenerateOptimization rule in Spark SQL Catalyst to improve the pruning of unnecessary fields from JsonTuple generators.

Explicitly handled these cases:

No generator outputs used: remove the Generate node.
Some generator outputs used: prune the JsonTuple to keep only necessary fields.

Example:

SELECT f2
FROM (SELECT '{"f1": "Spark", "f2": 2025, "f3": 8}' AS json) test
LATERAL VIEW json_tuple(json, 'f1', 'f2', 'f2') jt AS f1, f2, f3

Before this PR:

== Optimized Logical Plan ==
Project [f2#2]
+- Generate json_tuple({"f1": "Spark", "f2": 2025, "f3": 8}, f1, f2, f2), false, jt, [f1#1, f2#2, f3#3]
   +- OneRowRelation

After this PR:

== Optimized Logical Plan ==
Generate json_tuple({"f1": "Spark", "f2": 2025, "f3": 8}, f2), false, jt, [f2#2]
+- OneRowRelation

Why are the changes needed?

Prune unnecessary JSON fields, reducing data processing overhead and improving query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests covering all scenarios: no outputs used, some outputs used, and all outputs used.

Was this patch authored or co-authored using generative AI tooling?

No.

wangyum · 2025-08-05T07:09:33Z

cc @cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

cloud-fan · 2025-08-07T03:47:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+              }.unzip
+            p.withNewChildren(Seq(g.copy(
+              generator = JsonTuple(originJsonTuple.children.head +: newJsonExpressions),
+              unrequiredChildIndex = Nil,


we didn't do this for the ExplodeBase case, why is it needed?

After reading SPARK-21657, we can remove unrequiredChildIndex = Nil here.

github-actions bot added the SQL label Aug 5, 2025

LuciferYang reviewed Aug 6, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Show resolved Hide resolved

cloud-fan reviewed Aug 7, 2025

View reviewed changes

wangyum added 2 commits August 7, 2025 22:34

prune unnecessary fields from JsonTuple

b4bbf79

Remove unrequiredChildIndex = Nil

e4d76d0

wangyum force-pushed the SPARK-53124 branch from 33aaf5c to e4d76d0 Compare August 8, 2025 02:30

cloud-fan approved these changes Aug 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53124][SQL] Prune unnecessary fields from JsonTuple #51843

[SPARK-53124][SQL] Prune unnecessary fields from JsonTuple #51843

wangyum commented Aug 5, 2025

Uh oh!

wangyum commented Aug 5, 2025

Uh oh!

Uh oh!

cloud-fan Aug 7, 2025

Uh oh!

wangyum Aug 8, 2025

Uh oh!

Uh oh!

[SPARK-53124][SQL] Prune unnecessary fields from JsonTuple #51843

Are you sure you want to change the base?

[SPARK-53124][SQL] Prune unnecessary fields from JsonTuple #51843

Conversation

wangyum commented Aug 5, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wangyum commented Aug 5, 2025

Uh oh!

Uh oh!

cloud-fan Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

wangyum Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!