Skip to content

Conversation

@aitap
Copy link
Contributor

@aitap aitap commented Dec 6, 2025

The new resizable vectors API wraps SETLENGTH, TRUELENGTH, and the GROWABLE_BIT. Since all the vectors we want to resize are now growable, this frees us from:

  • having to adjust the allocated memory counts in the finalizer
  • manual adjustment of TRUELENGTH for duplicated vectors (R resets TRUELENGTH to 0 when duplicating a growable vector)

On the other hand,

  • resizable vectors have to be specially allocated
  • resizing drops names, we have to reinstall them afterwards
  • data.table::truelength must be emulated using R_maxLength.

@jangorecki, I had to introduce copyAsGrowable so that adaptive frollapply would work. Would that be fine, or should we find a way to make that copy earlier?

@TysonStanley, if merged, this will be needed for cherry-picking into the patch branch.

There will be one more batch of fixes in src/dogroups.c that depends on both this PR (for resizable API) and #6694 (for the other use of TRUELENGTH).

Fixes: #990
Closes: #6697

Many thanks to Benjamin Schwendinger for proposing the API and Luke Tierney for making it a reality.

aitap added 7 commits December 6, 2025 19:55
Thanks to Luke Tierney for introducing the API and helping with the migration.
Make sure to set the GROWABLE_BIT on the resizable vectors to avoid
problems when they are duplicated or garbage-collected.
Now that data.table objects have the GROWABLE_BIT set, R will reset TRUELENGTH
when duplicating them, causing our code to take a different branch.
Now that (1) we depend on R >= 3.4 and (2) data.table objects have the
GROWABLE_BIT set, there is no need to adjust allocated memory counts by
hand.
Since adaptive application of rolling functions requires us to resize
the argument to match the window size, make sure to allocate it as such.
- Don't SET_TRUELENGTH by hand. All of our resizable vectors now have
  the GROWABLE_BIT set, so when they are duplicated, TRUELENGTH is reset
  to 0.
- Use a combination of R_isResizable and R_maxLength to replace other
  uses of TRUELENGTH.
@github-actions
Copy link

github-actions bot commented Dec 6, 2025

No obvious timing issues in HEAD=resizable-API
Comparison Plot

Generated via commit 5caf9cc

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 46 seconds
Installing different package versions 44 seconds
Running and plotting the test cases 3 minutes and 0 seconds

@aitap aitap marked this pull request as draft December 6, 2025 17:29
@codecov
Copy link

codecov bot commented Dec 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.06%. Comparing base (a014e38) to head (5caf9cc).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7451      +/-   ##
==========================================
- Coverage   99.07%   99.06%   -0.01%     
==========================================
  Files          85       86       +1     
  Lines       16610    16600      -10     
==========================================
- Hits        16456    16445      -11     
- Misses        154      155       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aitap aitap marked this pull request as ready for review December 6, 2025 18:42
Copy link
Member

@ben-schwen ben-schwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, TY!

@jangorecki
Copy link
Member

jangorecki commented Dec 7, 2025

@aitap place for allocate as resizeable is perfect.

I am not sure about by.column usage there, for by.column=TRUE it seems that columns will not be resizeable but only list of pointers to columns. Am I wrong?

src/frollapply.c Outdated
}
return x;
SEXP copyAsGrowable(SEXP x, SEXP by_column) {
if (LOGICAL_RO(by_column)[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard length(by_column) == 0 or use a different macro. (I can see the R logic requires TRUE or FALSE but it's best not to trust anything about a SEXP coming from R.)

for (int i = 0; i < ncol; i++) {
SEXP d = VECTOR_ELT(dest, i);
SETLENGTH(d, nrow);
R_resizeVector(d, nrow);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably overly defensive but assume R_resizeVector strips all but values. So names might be lost in corner cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lost naming is fine in this use case


int overAlloc = checkOverAlloc(GetOption1(install("datatable.alloccol")));
SEXP ans = PROTECT(allocVector(VECSXP, LENGTH(cols)+overAlloc)); nprotect++; // doing alloc.col directly here; eventually alloc.col can be deprecated.
SEXP ans = PROTECT(R_allocResizableVector(VECSXP, LENGTH(cols)+overAlloc)); nprotect++; // doing alloc.col directly here; eventually alloc.col can be deprecated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment still applicable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been there since 8e272c8. I guess not. Should I remove it altogether?

# define R_duplicateAsResizable(x) R_duplicateAsResizable_(x)
# define R_resizeVector(x, newlen) SETLENGTH(x, newlen)
# define R_maxLength(x) TRUELENGTH(x)
# define R_isResizable(x) R_isResizable_(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need

if (ALTREP(x)) return false;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We also checked for ALTREP(x) in the proposed version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also teach copyAsPlain to allocate resizable vectors and use it instead of R_duplicateAsResizable, but for now this works.

src/data.table.h Outdated
}
# define R_duplicateAsResizable(x) R_duplicateAsResizable_(x)
# define R_resizeVector(x, newlen) SETLENGTH(x, newlen)
# define R_maxLength(x) TRUELENGTH(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return R_isResizable_(x) ? (R_xlen_t)TRUELENGTH(x) : 0;

might be nicer

Copy link
Contributor Author

@aitap aitap Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRE doesn't document what's the R_maxLength() of a non-resizable vector, but in R-devel this function returns xlength(), not 0 (which necessitates R_isResizable and the LEVELS trick below). So far we're just lucky/careful that the TRUELENGTH() calls replaced with R_maxLength() don't contradict each other too much.

I can make a more precise backport of R_maxLength(), of course.

return ret;
}
# define R_duplicateAsResizable(x) R_duplicateAsResizable_(x)
# define R_resizeVector(x, newlen) SETLENGTH(x, newlen)
Copy link
Member

@HughParsonage HughParsonage Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be guarded by R_isResizable_(x) and new_len <= TRUELENGTH(x), and no attributes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It would be cleaner to guard for them, so we not only catch them via R-devels runner

if (!R_isResizable(x))
  error(_("not a resizable vector"));
if (newlen > XTRUELENGTH(x))
  error(_("'newlen' is too large"));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should introduce extra checks here, this is called in tight loops, and it is better to check before those loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a better benchmark for this, frollapply(adaptive = TRUE), or foo[bar, by = baz]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frolladaptive willa stress more, unless grouping is by unique column, then probably similarly

@aitap
Copy link
Contributor Author

aitap commented Dec 8, 2025

I am not sure about by.column usage there, for by.column=TRUE it seems that columns will not be resizeable but only list of pointers to columns. Am I wrong?

by_column here follows the by.column argument of frollapply. When by_column is TRUE, the x argument is a single column, not a list of column pointers. Otherwise x is a list and its elements are columns that need to be made resizable.

If this sounds confusing, let's rename the argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reproducible Segfault When Loading/Unloading data.table

4 participants