df.show() would cause multi runs of operations #4966

xy-xin · 2025-08-13T02:39:30Z

xy-xin
Aug 13, 2025

Suppose we have two UDFs: udf1 and udf2, and the following code:

df = xxx

df = df.with_column("output1", udf1(col("input1")))
**df.show()**

df = df.with_column("output2", udf2(col("input2")))
**df.show()**

In this case, udf1 will be triggered twice, which is not what the user expects.

We can work around this by using collect():

df = xxx

df = df.with_column("output1", udf1(col("input1"))
**df.collect()**
**df.show()**

df = df.with_column("output2", udf2(col("input2"))
**df.show()**

However, this requires the user to have experience with how Daft works, which can be challenging for algorithm developers.

In conclusion, users must understand how Daft works and what happens when calling show(). Otherwise, this behavior may confuse users.

srilman · 2025-08-13T20:18:39Z

srilman
Aug 13, 2025
Maintainer

I think the misunderstanding here is that df.show() does more than just run df. It also applies a limit to the end of the query plan.

So if df represents a plan like

ParquetScan -> Projection -> UDF

then df.show will actually run

ParquetScan -> Limit -> Project -> UDF

The limit it in the front because of limit pushdown. And so it is not always possible to modify df and run df.show() again and expect it to do the same thing. For example, if we added a groupby, we can't push the limit all the way though and the plan is now too different.

In your example, it is probably fine, but we need to somehow save the plan and showed output together, which we don't do for memory purposes. Wdyt?

0 replies

xy-xin · 2025-08-18T13:44:15Z

xy-xin
Aug 18, 2025
Author

Thanks for your reply, @srilman

I understand and agree with your point. However, what I wanted to highlight is that in some cases, calling df.show() twice may trigger two executions of the UDF, especially if the UDF has side effects. Algorithm developers do not have much experience on how Daft (or other SQL-like engines) works internally, and this behavior can be confusing for them.

If we don’t have a good solution to this issue and decide to address it through better documentation and user guidance instead, I’m fine with that approach as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

df.show() would cause multi runs of operations #4966

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

df.show() would cause multi runs of operations #4966

Uh oh!

Uh oh!

xy-xin Aug 13, 2025

Replies: 2 comments

Uh oh!

srilman Aug 13, 2025 Maintainer

Uh oh!

xy-xin Aug 18, 2025 Author

xy-xin
Aug 13, 2025

srilman
Aug 13, 2025
Maintainer

xy-xin
Aug 18, 2025
Author