[SPARK-49025][CONNECT] Make Column implementation agnostic #47785

hvanhovell · 2024-08-16T03:42:44Z

What changes were proposed in this pull request?

This makes Column API implementation agnostic. We do this by:

Removing Column.expr. This has been replaced in the source by either the use of Column itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions.
Removing Column.apply(e: Expression). This has been replaced in the source by the ExpressionUtils.column (implicit) method, or by the use of Column.
Removing TypedColumn.withTypedColumn(..). This has been replaced by direct calls to TypedAggUtils.withInputType(...).
Removing Column.named and Column.generateAlias. This has been moved to ExpressionUtils..
Making a bunch of pandas and arrow operators use a Column instead of an Expression.

Why are the changes needed?

This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

hvanhovell · 2024-08-17T01:16:54Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-    select(exprs.map { expr =>
-      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
-    }: _*)
+    select(exprs.map(functions.expr): _*)


Expression parsing will use this SparkSession's parser, so there is no need to explicity parse anymore.

hvanhovell · 2024-08-18T03:31:12Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

-        .elementType.asInstanceOf[StructType].fieldNames
-      assert(fieldNames4.toSeq === Seq("0", "1"))
-    }
+    val df = Seq((Seq(9001, 9002, 9003), Seq(4, 5, 6))).toDF("val1", "val2")


Mostly rewritten this because it was doing some crazy stuff... If you want to validate a schema post optimization then check that instead of writing a file and making all kinds of assumptions on the expression tree structure.

HyukjinKwon · 2024-08-19T00:34:45Z

Took a quick look, and LGTM at highlevel

hvanhovell · 2024-08-20T02:18:55Z

Merging this.

EnricoMi · 2024-08-22T09:58:07Z

Worked around removal of access to Column.expr in G-Research/spark-extension#257.

Required by apache/spark#47785.

Spark-master based build broken of change apache/spark#47785 --------- Co-authored-by: Thang Long VU <[email protected]> Co-authored-by: Thang Long Vu <[email protected]>

### What changes were proposed in this pull request? This makes Column API implementation agnostic. We do this by: - Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions. - Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`. - Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`. - Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`. - Making a bunch of pandas and arrow operators use a Column instead of an Expression. ### Why are the changes needed? This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47785 from hvanhovell/SPARK-49025. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 50a328b) Signed-off-by: Herman van Hovell <[email protected]>

…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr apache#47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

hvanhovell added 25 commits August 6, 2024 22:22

Integrate ColumnNode AST into Column.scala

d2ea80d

Add internally registered functions

e7a2a32

Move window to cool new API :)

a4e52f4

Improve Window

dcde4d4

Refactor ColumnNode API

9acfdbe

Support UDFs/UDAFs

6e31176

Regular Fixes

ea07c58

UDF Fixes

73b1812

Add test for ColumnNode sql and normalize

3e41a98

Merge remote-tracking branch 'apache/master' into SPARK-49022

8dcb381

wip

87a7b1e

Merge remote-tracking branch 'apache/master' into SPARK-49022

5fe4b18

Fix pyspark issues

763a082

fixes

4244ef6

Fix Connect MiMa

e573f7c

Fix docs

4ba1d94

Merge remote-tracking branch 'apache/master' into SPARK-49022

35467a9

style

7318f60

merge artifact

c73ef8e

Merge branch 'SPARK-49022' into SPARK-49025

40afb9a

Remove expr() in scala/java land

6f84348

Remove Column.apply(Expression)

33a0c38

Fix pyspark

32d4138

style

e365d5a

Merge branch 'SPARK-49022' into SPARK-49025

7456fbf

github-actions bot added SQL ML STRUCTURED STREAMING BUILD PYTHON labels Aug 16, 2024

hvanhovell commented Aug 17, 2024

View reviewed changes

hvanhovell changed the title ~~[WIP][SPARK-49025][CONNECT] Remove Column.expr and Column.apply(Expression) from Column~~ [SPARK-49025][CONNECT] Make Column implementation agnostic Aug 17, 2024

hvanhovell added 4 commits August 17, 2024 01:06

fix import

df29113

Fix tests

7077ebd

Fix Scala Tests

e7440fc

Fix Pandas UDFs

ec51bd8

hvanhovell commented Aug 18, 2024

View reviewed changes

hvanhovell added 3 commits August 17, 2024 23:36

Not needed anymore

058a806

style

6d404e1

oops

90855e2

HyukjinKwon approved these changes Aug 19, 2024

View reviewed changes

hvanhovell added 2 commits August 19, 2024 16:42

Merge remote-tracking branch 'apache/master' into SPARK-49025

92a2f45

Fix UDTF

69cb29d

asfgit closed this in 8fbbcb0 Aug 20, 2024

EnricoMi mentioned this pull request Aug 21, 2024

Make Column.expr accessible G-Research/spark-extension#257

Merged

github-merge-queue bot pushed a commit to G-Research/spark-extension that referenced this pull request Aug 22, 2024

Make Column.expr accessible (#257)

395998b

Required by apache/spark#47785.

vkorukanti mentioned this pull request Aug 23, 2024

[Spark] Fix Spark-master compile errors delta-io/delta#3591

Merged

chris-twiner mentioned this pull request Oct 15, 2024

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

Closed

chris-twiner mentioned this pull request Feb 20, 2025

[SPARK-49960][SQL] Custom ExpressionEncoder support and TransformingEncoder fixes #50023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49025][CONNECT] Make Column implementation agnostic #47785

[SPARK-49025][CONNECT] Make Column implementation agnostic #47785

Uh oh!

hvanhovell commented Aug 16, 2024 •

edited

Loading

Uh oh!

hvanhovell Aug 17, 2024

Uh oh!

hvanhovell Aug 18, 2024

Uh oh!

HyukjinKwon commented Aug 19, 2024

Uh oh!

hvanhovell commented Aug 20, 2024

Uh oh!

EnricoMi commented Aug 22, 2024

Uh oh!

Uh oh!

[SPARK-49025][CONNECT] Make Column implementation agnostic #47785

[SPARK-49025][CONNECT] Make Column implementation agnostic #47785

Uh oh!

Conversation

hvanhovell commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

hvanhovell Aug 17, 2024

Choose a reason for hiding this comment

Uh oh!

hvanhovell Aug 18, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 19, 2024

Uh oh!

hvanhovell commented Aug 20, 2024

Uh oh!

EnricoMi commented Aug 22, 2024

Uh oh!

Uh oh!

hvanhovell commented Aug 16, 2024 •

edited

Loading