Skip to content

ARRAY_AGG of column of type list ORDER BY column of type non-list #8512

@giacomorebecchi

Description

@giacomorebecchi

Describe the bug

In version 33.0.0, I encountered the following bug (not present in version 32.0.0):
Executing an aggregation with operator ARRAY_AGG() of a column of type list, ORDER BY a column of type non-list, returns the following error:
Execution error: Expects values arguments and/or ordering_values arguments to have same size

To Reproduce

I have an MRE in python:
pip install "pyarrow==14.0.0" "datafusion==33.0.0"

import datetime
import random

import datafusion
import pyarrow as pa
import pyarrow.dataset as pda

N_ROWS = 10_000
N_CARDS = 1_000
N_PRODUCTS = 50

ta = pa.Table.from_pydict(
    {
        "Card.Id": random.choices([str(i) for i in range(N_CARDS)], k=N_ROWS),
        "Date": (datetime.date(2023, (i % 12) + 1, (i % 28) + 1) for i in range(N_ROWS)),
        "Product.Ids": [random.choices([i for i in range(N_PRODUCTS)], k=2) for i in range(N_ROWS)]
    }
)

query = """
SELECT
    "Card.Id"
    , FIRST_VALUE("Product.Ids" ORDER BY "Date")
    , LAST_VALUE("Product.Ids" ORDER BY "Date")
    , ARRAY_AGG("Product.Ids" ORDER BY "Date")
FROM "table"
GROUP BY "Card.Id"
"""

ctx = datafusion.SessionContext()
ctx.register_dataset(name="table",
                     dataset=pda.dataset(ta))
df = ctx.sql(query)
compute_ta = pa.Table.from_batches(df.collect())

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionSomething that used to work no longer does

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions