-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-49025][CONNECT] Make Column implementation agnostic #47785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
select(exprs.map { expr => | ||
Column(sparkSession.sessionState.sqlParser.parseExpression(expr)) | ||
}: _*) | ||
select(exprs.map(functions.expr): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expression parsing will use this SparkSession's parser, so there is no need to explicity parse anymore.
.elementType.asInstanceOf[StructType].fieldNames | ||
assert(fieldNames4.toSeq === Seq("0", "1")) | ||
} | ||
val df = Seq((Seq(9001, 9002, 9003), Seq(4, 5, 6))).toDF("val1", "val2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly rewritten this because it was doing some crazy stuff... If you want to validate a schema post optimization then check that instead of writing a file and making all kinds of assumptions on the expression tree structure.
Took a quick look, and LGTM at highlevel |
Merging this. |
Worked around removal of access to |
Required by apache/spark#47785.
Spark-master based build broken of change apache/spark#47785 --------- Co-authored-by: Thang Long VU <[email protected]> Co-authored-by: Thang Long Vu <[email protected]>
### What changes were proposed in this pull request? This makes Column API implementation agnostic. We do this by: - Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions. - Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`. - Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`. - Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`. - Making a bunch of pandas and arrow operators use a Column instead of an Expression. ### Why are the changes needed? This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47785 from hvanhovell/SPARK-49025. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
### What changes were proposed in this pull request? This makes Column API implementation agnostic. We do this by: - Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions. - Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`. - Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`. - Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`. - Making a bunch of pandas and arrow operators use a Column instead of an Expression. ### Why are the changes needed? This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47785 from hvanhovell/SPARK-49025. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
### What changes were proposed in this pull request? This makes Column API implementation agnostic. We do this by: - Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions. - Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`. - Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`. - Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`. - Making a bunch of pandas and arrow operators use a Column instead of an Expression. ### Why are the changes needed? This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47785 from hvanhovell/SPARK-49025. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 50a328b) Signed-off-by: Herman van Hovell <[email protected]>
…ncoder fixes ### What changes were proposed in this pull request? 4.0.0-preview2 introduced, as part of SPARK-49025 pr apache#47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait: ```scala DeveloperApi trait AgnosticExpressionPathEncoder[T] extends AgnosticEncoder[T] { def toCatalyst(input: Expression): Expression def fromCatalyst(inputPath: Expression): Expression } ``` and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)). SPARK-49960 provides the same information. Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key. ### Why are the changes needed? Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes. Authored-by: Chris Twiner <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
What changes were proposed in this pull request?
This makes Column API implementation agnostic. We do this by:
Column.expr
. This has been replaced in the source by either the use ofColumn
itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions.Column.apply(e: Expression)
. This has been replaced in the source by theExpressionUtils.column
(implicit) method, or by the use ofColumn
.TypedColumn.withTypedColumn(..)
. This has been replaced by direct calls toTypedAggUtils.withInputType(...)
.Column.named
andColumn.generateAlias
. This has been moved toExpressionUtils.
.Why are the changes needed?
This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.