Skip to content

Lots of code duplication in array_functions (array_replace_all, array_replace_n, array_replace, etc) #7988

@alamb

Description

@alamb

Describe the bug

There is a significant amount of code generated for array functions.

This both bloats binaries built with DataFusion as well as makes compile times slow.

To Reproduce

cd datafusion/datafusion-cli
cargo bloat
 File  .text     Size                    Crate Name
 0.1%   0.2% 151.2KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_replace_all
 0.1%   0.2% 151.2KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_replace_n
 0.1%   0.2% 151.2KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_replace
 0.1%   0.2% 150.3KiB                  parquet brotli::enc::prior_eval::PriorEval<Alloc>::update_cost_base
 0.1%   0.2% 124.6KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_repeat
 0.1%   0.2% 121.5KiB                   blake2 blake2::Blake2bVarCore::compress
 0.0%   0.1%  81.5KiB                   blake2 blake2::Blake2sVarCore::compress
 0.0%   0.1%  73.2KiB                   blake3 blake3::portable::compress_in_place
 0.0%   0.1%  65.2KiB                chrono_tz <chrono_tz::timezones::Tz as chrono_tz::timezone_impl::TimeSpans>::timespans
 0.0%   0.1%  61.1KiB                sqlparser <sqlparser::ast::Statement as core::fmt::Display>::fmt
 0.0%   0.1%  61.0KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_append
 0.0%   0.1%  61.0KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_prepend
 0.0%   0.1%  60.5KiB                       h2 h2::codec::framed_read::decode_frame
 0.0%   0.1%  59.3KiB               datafusion datafusion::physical_planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}
 0.0%   0.1%  56.4KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_remove_all
 0.0%   0.1%  56.4KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_remove_n
 0.0%   0.1%  56.4KiB datafusion_physical_expr datafusion_physical_expr::array_expressions::array_remove
 0.0%   0.1%  52.4KiB                       h2 h2::frame::headers::HeaderBlock::load::{{closure}}
 0.0%   0.1%  51.4KiB     datafusion_optimizer <datafusion_optimizer::simplify_expressions::expr_simplifier::Simplifier<S> as datafusion_common::tree_node::...
 0.0%   0.1%  48.9KiB datafusion_physical_expr datafusion_physical_expr::datetime_expressions::date_part
35.4%  97.6%  67.1MiB                          And 290367 smaller methods. Use -n N to show more.
36.3% 100.0%  68.7MiB                          .text section size, the file size is 189.3MiB

Expected behavior

I would like the array_replace_all, array_replace_n, array_replace functions to be implemented in terms of arrow kernels (such as eq, and take) and manipulations of offset buffers rather than directly creating new lists.

For example, the large macro expansion here:
https://github.com/apache/arrow-datafusion/blob/bb1d7f9343532d5fa8df871ff42000fbe836d7d7/datafusion/physical-expr/src/array_expressions.rs#L1431-L1437

I believe generates a bunch of specialized code for each different list element data type 😢

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions