Skip to content

Conversation

hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented Aug 16, 2024

What changes were proposed in this pull request?

This makes Column API implementation agnostic. We do this by:

  • Removing Column.expr. This has been replaced in the source by either the use of Column itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions.
  • Removing Column.apply(e: Expression). This has been replaced in the source by the ExpressionUtils.column (implicit) method, or by the use of Column.
  • Removing TypedColumn.withTypedColumn(..). This has been replaced by direct calls to TypedAggUtils.withInputType(...).
  • Removing Column.named and Column.generateAlias. This has been moved to ExpressionUtils..
  • Making a bunch of pandas and arrow operators use a Column instead of an Expression.

Why are the changes needed?

This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

select(exprs.map { expr =>
Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
}: _*)
select(exprs.map(functions.expr): _*)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expression parsing will use this SparkSession's parser, so there is no need to explicity parse anymore.

@hvanhovell hvanhovell changed the title [WIP][SPARK-49025][CONNECT] Remove Column.expr and Column.apply(Expression) from Column [SPARK-49025][CONNECT] Make Column implementation agnostic Aug 17, 2024
.elementType.asInstanceOf[StructType].fieldNames
assert(fieldNames4.toSeq === Seq("0", "1"))
}
val df = Seq((Seq(9001, 9002, 9003), Seq(4, 5, 6))).toDF("val1", "val2")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly rewritten this because it was doing some crazy stuff... If you want to validate a schema post optimization then check that instead of writing a file and making all kinds of assumptions on the expression tree structure.

@HyukjinKwon
Copy link
Member

Took a quick look, and LGTM at highlevel

@hvanhovell
Copy link
Contributor Author

Merging this.

@EnricoMi
Copy link
Contributor

Worked around removal of access to Column.expr in G-Research/spark-extension#257.

github-merge-queue bot pushed a commit to G-Research/spark-extension that referenced this pull request Aug 22, 2024
vkorukanti added a commit to delta-io/delta that referenced this pull request Sep 11, 2024
Spark-master based build broken of change
apache/spark#47785

---------

Co-authored-by: Thang Long VU <[email protected]>
Co-authored-by: Thang Long Vu <[email protected]>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?
This makes Column API implementation agnostic. We do this by:
- Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions.
- Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`.
- Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`.
- Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`.
- Making a bunch of pandas and arrow operators use a Column instead of an Expression.

### Why are the changes needed?
This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47785 from hvanhovell/SPARK-49025.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
This makes Column API implementation agnostic. We do this by:
- Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions.
- Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`.
- Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`.
- Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`.
- Making a bunch of pandas and arrow operators use a Column instead of an Expression.

### Why are the changes needed?
This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47785 from hvanhovell/SPARK-49025.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
This makes Column API implementation agnostic. We do this by:
- Removing `Column.expr`. This has been replaced in the source by either the use of `Column` itself, the use of an Expression that wraps a ColumnNode, or by (implicit) conversions.
- Removing `Column.apply(e: Expression)`. This has been replaced in the source by the `ExpressionUtils.column` (implicit) method, or by the use of `Column`.
- Removing `TypedColumn.withTypedColumn(..)`. This has been replaced by direct calls to `TypedAggUtils.withInputType(...)`.
- Removing `Column.named` and `Column.generateAlias`. This has been moved to `ExpressionUtils.`.
- Making a bunch of pandas and arrow operators use a Column instead of an Expression.

### Why are the changes needed?
This is one of the last steps in our effort to unify the Scala Column API for Classic and Connect.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47785 from hvanhovell/SPARK-49025.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 1, 2025
…ncoder fixes

### What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders.  This PR adds a trait:

```scala
DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}
```

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)).

SPARK-49960 provides the same information.

Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key.

### Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days.  This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2).  Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test.  Removing the case statements in either BuildHelper will trigger the MatchError.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes.

Authored-by: Chris Twiner <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 1, 2025
…ncoder fixes

### What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders.  This PR adds a trait:

```scala
DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}
```

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)).

SPARK-49960 provides the same information.

Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key.

### Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days.  This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2).  Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test.  Removing the case statements in either BuildHelper will trigger the MatchError.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes.

Authored-by: Chris Twiner <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
(cherry picked from commit 50a328b)
Signed-off-by: Herman van Hovell <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…ncoder fixes

### What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr apache#47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders.  This PR adds a trait:

```scala
DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}
```

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as [frameless](https://github.com/typelevel/frameless) or [sparksql-scalapb](https://github.com/scalapb/sparksql-scalapb)).

SPARK-49960 provides the same information.

Additionally this PR provides fixes necessary to use TransformingEncoder as a root encoder with an OptionalEncoder, use as an ArrayType and MapType entry/key.

### Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days.  This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2).  Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test.  Removing the case statements in either BuildHelper will trigger the MatchError.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50023 from chris-twiner/temp/expressionEncoder_compat_TransformingEncoder_fixes.

Authored-by: Chris Twiner <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants