Further optimisation of `.SD` in `j`

In #370 `.SD` was optimised internally for cases like:

``` R
require(data.table)
DT = data.table(id=c(1,1,1,2,2,2), x=1:6, y=7:12, z=13:18)
DT[, c(sum(x), lapply(.SD, mean)), by=id]
#    id V1 x  y  z
#1:  1  6 2  8 14
#2:  2 15 5 11 17
```

You can see that it's optimised by turning _verbose_ on:

``` R
options(datatable.verbose=TRUE)
DT[, c(sum(x), lapply(.SD, mean)), by=id]
# Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
# lapply optimization changed j from 'c(sum(x), lapply(.SD, mean))' to 'list(sum(x), mean(x), mean(y), mean(z))'
# GForce optimized j to 'list(gsum(x), gmean(x), gmean(y), gmean(z))'
options(datatable.verbose=FALSE)
```

However, this expression is not always optimised. For example, 

``` R
options(datatable.verbose=TRUE)
DT[, c(.SD[1], lapply(.SD, mean)), by=id]
options(datatable.verbose=FALSE)
#    id x  y  z x  y  z
#1:  1 1  7 13 2  8 14
#2:  2 4 10 16 5 11 17

# Finding groups (bysameorder=FALSE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'c(.SD[1], lapply(.SD, mean))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# ...
```

This is because `.SD` cases are a little trickier to optimise. To begin with, if `.SD` has `j` as well, then it can't be optimised:

``` R
DT[, c(xx=.SD[1, x], lapply(.SD, mean)), by=id]
#    id xx x  y  z
#1:  1  1 2  8 14
#2:  2  4 5 11 17
```

The above expression can not be changed to `list(..)` (in my understanding).

And even when there's no `j`, `.SD` can have `i` arguments of type `integer`, `numeric`, `logical`, `expressions` and even `data.tables`. For example:

``` R
DT[, c(.SD[x > 1 & y > 9][1], lapply(.SD, mean)), by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17
```

If we optimise this as such, it'd turn to:

``` R
DT[, list(x=x[x>1 & y > 9][1], y=y[x>1 & y>9][1], z=z[x>1 & y>9][1], x=mean(x), y=mean(y), z=mean(z)), by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17
```

which is not really efficient as it evaulates the expression (vector scan) as many times as there are columns, which would be quite slow when there are more and more columns. A better way to do it would be:

``` R
DT[, {tmp = x > 1 & y > 9; list(x=x[tmp][1], y=y[tmp][1], z=z[tmp][1], x=mean(x), y=mean(y), z=mean(z))}, by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17
```

which is a little tricky to implement. 

If it's a `join` on `i`, then it must not be optimised as well, etc..

Basically, `.SD` and `.SD[...]` should be optimised one-by-one, optimising for each scenario:

Optimise (for possible cases):
- [x] `.SD`
- [x] `DT[, c(.SD, lapply(.SD, ...)), by=.]`
- [x] `DT[, c(.SD[1], lapply(.SD, ...)), by=.]`
- [x] `.SD[1L]` # no j
- [x] `.SD[1]`
- [ ] `.SD[logical]`
- [x] `.SD[a]` # where `a` is integer
- [x] `.SD[a]` # where `a` is numeric
- [ ] all of the above, but with a `,`. Ex: `.SD[1,]`
- [ ] `.SD[x > 1 & y > 9]`
- [ ] `.SD[data.table]` # shouldn't / can't be optimised, IMO
- [ ] `.SD[character]` # shouldn't / can't be optimised, IMO
- [ ] `.SD[eval(.)]` # might be possible in some cases
- [ ] `.SD[i, j]` # shouldn't / can't be optimised, IMO
- [ ] `DT[, c(list(.), lapply(.SD, ...)), by=.]`
#### All of these throws error at the moment:
- [ ] `DT[, c(data.table(.), lapply(.SD, ...)), by=.]`
- [ ] `DT[, c(as.data.table(.), lapply(.SD, ...)), by=.]`
- [ ] `DT[, c(data.frame(.), lapply(.SD, ...)), by=.]`
- [ ] `DT[, c(as.data.frame(.), lapply(.SD, ...)), by=.]`

Note that all these can occur on the right side of `lapply(.SD, ...)` as well.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Further optimisation of `.SD` in `j` #735

All of these throws error at the moment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Further optimisation of .SD in j #735

Description

All of these throws error at the moment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Further optimisation of `.SD` in `j` #735