Skip to content

Conversation

@aitap
Copy link
Contributor

@aitap aitap commented Dec 29, 2024

Ongoing work towards #6180. Depends on #6694. Detailed motivation in a pending blog post.

Good news:

* checking compiled code ... NOTE
File ‘data.table/libs/data_table.so’:
  Found non-API call to R: ‘LEVELS’ # <-- and that's it!

Bad news:

Test 2289.2 ran without errors but failed check that x equals y:
> x = dt[, b, a] 
   a b  [Key= Types=int,lis Classes=int,lis]
1: 1 1                                      
2: 2 2                                      
> y = dt 
   a b  [Key= Types=int,exp Classes=int,exp]
1: 1 1                                      
2: 2 2                                      
Datasets have different column modes. First 3: b(list!=expression)

Unfortunately, since there is no altexpression ALTREP class, best I was able to support EXPRSXP columns was by creating VECSXP lists for their contents.

How long do we have and does anyone have a plan B?

  • Make sure to coordinate with SebKrantz to avoid breaking collapse

The resulting data.tables now have GROWABLE_BIT set, therefore:
 - the finalizer is not needed on R >= 3.4
 - duplicates of data.tables (which are not over-allocated) now have
   TRUELENGTH of 0 instead of whatever it was before, which is detected
   earlier in selfrefok()

As a result, assign.c only uses TRUELENGTH on R < 3.4.
Since dogroups() relies on being able to shrink vectors it hasn't
allocated, introduce make_growable() and is_growable() to adapt. Since
dogroups() relies on SD and SDall having shared columns, use the
setgrowable() wrapper on the R side at the time when SD and SDall are
being created.

(In the ALTREP case, setgrowable() will re-create the columns.)
This helps avoid false positive warnings about unreachable places in the code.
Currently broken:
- 1 errors out of 11637
 - won't work with expression vectors at all
@ben-schwen
Copy link
Member

We did not support expressions prior to #5596 and still not fully it support. My guess is that nobody really uses this anyway

// out-of-memory. In that case the user will receive hard halt and know to rerun.
if (length(newcolnames)) {
oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.
oldtncol = is_growable(dt) ? growable_max_size(dt) : 0; // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oldtncol = is_growable(dt) ? growable_max_size(dt) : 0; // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.
oldtncol = truelength(x); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.

// Can growVector at this point easily enough, but it shouldn't happen in first place so leave it as
// strong error message for now.
else if (TRUELENGTH(names) != oldtncol)
else if (growable_max_size(names) != oldtncol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also use truelength(names) here?

Comment on lines -19306 to -19452
# segfault when selfref is not ok before set #6410
df = data.frame(a=1:3)
setDT(df)
attr(df, "att") = 1
test(2291.1, set(df, NULL, "new", "new"), error="attributes .* have been reassigned")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will only bite us on R < 4.3.0 and will be detected earlier on the ALTREP version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By setting GROWABLE_BIT on R ≥ 3.4.0 for data.table lists, we prevent the situation where a shallow duplicate of a data.table has a misleading TRUELENGTH. When duplicating a growable list, R will notice the bit and reset TRUELENGTH to 0. This causes set to fail an earlier check for non-zero TRUELENGTH instead of failing the .internal.selfref check, so the error is different ("This data.table has either been loaded from disk...").

Perhaps the test should account for that, especially if we're moving from R-3.3 and thus will no longer hit the old behaviour (where "attributes .* have been reassigned" is still reachable).

Co-Authored-By: Benjamin Schwendinger <[email protected]>
@aitap
Copy link
Contributor Author

aitap commented Mar 22, 2025

Thank you very much for sparing the time to review!

The biggest problem here are the crashes from performing ALTREP operations on objects that the current code wrongly assumes to be resizeable (exposed by the atime check). The second biggest problem is the dependency on #6694 to make data.table not allergic to ALTREP, which has performance problems.

(void)call; \
indx = PROTECT(coerceVector(indx, REALSXP)); \
double * ii = REAL(indx); \
R_xlen_t rlen = XLENGTH(indx), mylen = XLENGTH(x); \
Copy link
Member

@ben-schwen ben-schwen Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm misunderstanding extract_subset but shouldn't this only be valid for the actual size and not the full overallocated length?

Hence, mylen = altall_Length(x) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XLENGTH(x) goes through the ALTREP dispatch and lands in altall_Length, but we could definitely save a function call and make the code clearer.

@ben-schwen ben-schwen force-pushed the growable_refactor branch 2 times, most recently from 5927188 to c16fd97 Compare September 25, 2025 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants