Skip to content

Conversation

@tshauck
Copy link
Contributor

@tshauck tshauck commented Dec 14, 2023

Which issue does this PR close?

Closes #8545

Rationale for this change

Updating the docs to match the exciting UDTF addition.

What changes are included in this PR?

Updates the UDF library doc, makes minor style update to the example

Are these changes tested?

image image

Are there any user-facing changes?

Yes, public docs update.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tshauck -- this is great. I think it might help to add a little more motiviation about about why UDTFs are so cool and what types of things you can do with them, but we can also do that as a follow on PR. This PR is a great step forward

🚀


A User-Defined Table Function (UDTF) is a function that takes parameters and returns a `TableProvider`.

Because we're returning a `TableProvider`, in this example we'll use the `MemTable` data source to represent a table. This is a simple struct that holds a set of RecordBatches in memory and treats them as a table. In your case, this would be replaced with your own struct that implements `TableProvider`. See the [example][4] for a working example that reads from a CSV file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some other examples of things one could do, for example

parse_url('http://foo.com')

Or point at the parquet_metadata function in datafusion-cli and note that the output of the table function can be processed like the output of any other table.

For example

❯ select filename, row_group_id, row_group_num_rows, row_group_bytes, stats_min, stats_max from parquet_metadata('./benchmarks/data/hits.parquet') where  column_id = 17 limit 10;
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
| filename                       | row_group_id | row_group_num_rows | row_group_bytes | stats_min | stats_max |
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
| ./benchmarks/data/hits.parquet | 0            | 450560             | 188921521       | 0         | 73256     |
| ./benchmarks/data/hits.parquet | 1            | 612174             | 210338885       | 0         | 109827    |
| ./benchmarks/data/hits.parquet | 2            | 344064             | 161242466       | 0         | 122484    |
| ./benchmarks/data/hits.parquet | 3            | 606208             | 235549898       | 0         | 121073    |
| ./benchmarks/data/hits.parquet | 4            | 335872             | 137103898       | 0         | 108996    |
| ./benchmarks/data/hits.parquet | 5            | 311296             | 145453612       | 0         | 108996    |
| ./benchmarks/data/hits.parquet | 6            | 303104             | 138833963       | 0         | 108996    |
| ./benchmarks/data/hits.parquet | 7            | 303104             | 191140113       | 0         | 73256     |
| ./benchmarks/data/hits.parquet | 8            | 573440             | 208038598       | 0         | 95823     |
| ./benchmarks/data/hits.parquet | 9            | 344064             | 147838157       | 0         | 73256     |
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I just pushed a090783 which expands a bit on why they're nice and adds the parquet metadata use-case since it shows why they're nice for interactive analysis.

@alamb alamb added documentation Improvements or additions to documentation devrel labels Dec 15, 2023
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Dec 15, 2023
@alamb alamb merged commit b7fde3c into apache:main Dec 15, 2023
@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

Thanks again @tshauck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update UDF Library Docs with UDTFs

2 participants