Skip to content

Conversation

haberdashPI
Copy link
Member

@haberdashPI haberdashPI commented Jun 6, 2022

This defines two functions that are handy for computing joins over time spans:

  • interval_join and
  • groupby_interval_join

Rows match in this join if their time spans overlap.

There is also a simple utility function (quantile_windows) for joining over regularly spaces intervals.

Remaining actions:

Will be releasing this as a 0.0.1 release, and add a 0.1.0 release once invenia/Intervals.jl#193 merges.

@haberdashPI haberdashPI marked this pull request as draft June 6, 2022 15:39
@@ -0,0 +1,308 @@
# This file is machine-generated - editing it directly is not advised

[[ArgTools]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be checked in

Copy link
Member Author

@haberdashPI haberdashPI Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this for now. In its draft form this PR will not work without the manifest, since invenia/Intervals.jl#193 has not yet merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this code depends on pointing to a specific git commit that will be garbage collected (assuming that Invenia deletes branches upon merging / closing a PR) at some point. When we do this with internal repositories, we can do things like tag a commit to guarantee that it survives for future reference, but we can't do that for repositories we don't control. That's a very unstable dependency structure and is almost definitely blocking for this package having any type of release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I see now that there are TODOs in the top-level comment, but it might be worth opening a full issue. ❤️ )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#193 from Intervals.jl should merge in the next day or two. My plan is to not merge this until #193 has merged.

@haberdashPI haberdashPI marked this pull request as ready for review June 17, 2022 21:52
@haberdashPI haberdashPI requested a review from kolia June 17, 2022 21:54
@haberdashPI haberdashPI self-assigned this Jun 21, 2022
Comment on lines 169 to 172
right_used = filter(x -> x isa String, right_groups)
right_unused = filter(x -> x isa Unused, right_groups)
left_used = filter(x -> x isa String, left_groups)
left_unused = filter(x -> x isa Unused, left_groups)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name unused is confusing, rename to invalid? maybe rename valid_columns.

also comment remains unclear: maybe spell out that we get the valid columns for right and the valid columns for left (as well as invalid).

Copy link

@kolia kolia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

end

"""
split_into(left, right; spancol=:span)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a join, why not match the DataFrames naming conventions for joins? So use the on keyword instead of spancol, and support on=:leftcol => :rightcol pairs.

function split_into(left, right; spancol=:span)
regions = find_intersections_(view(right, :, spancol), view(left, :, spancol))
left_side, right_side = split(regions, left, right)
joined = hcat(view(right_side, :, Not(spancol)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When left and right have the same column, this will throw an error suggesting passing in makeunique.

Might be worth taking in and passing hcat's makeunique and copycols kwargs through to the hcat call, and maybe even throwing a split_into specific error suggesting passing makeunique to split_into instead.

Also for consistency this could be called intersectjoin or splitjoin or something like that to match the DataFrames naming convention.

end

function spans_for_split!(df, left_span, right_span)
df[!, :left_span] = left_span
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it happen that df already has a column called left_span or right_span, what error gets thrown then?

end
end
toval(x::TimePeriod) = float(Dates.value(convert(Nanosecond, x)))
toperiod(x::Real) = Nanosecond(round(Int, x, RoundDown))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good for this name toperiod to convey that the x::Real arg is interpreted as a number of Nanoseconds.

`combine(groupby(split_into(left, right), groups), pairs...)`. The one caveat is that
the only column from `right` that `pairs` can reference is `:right_span`.
"""
function split_into_combine(left, right, groups, pairs...; spancol=:span, kwds...)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in huddle, a better interface for this would be for the splitjoin / intersectjoin to return some kind of lazy object on which groupby and combine methods get defined, so that the interface mirrors the DataFrames API more closely. But that's a lot more work.

Minimal modifications here that I would find useful:

  • if the join operation is called intersectjoin, maybe this could be called combine_intersectjoined
  • having an example or two in the docstring would be very useful

@ararslan
Copy link
Member

Note that one of Phillip's comments was resolved but is still applicable. Also, it would be good to fix the formatting issues and set up reviewdog + JuliaFormatter.

@haberdashPI
Copy link
Member Author

Note that one of Phillip's comments was resolved but is still applicable.

Is this the comment about the authorship that I missed above?

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Comment on lines 288 to 289
return map(steps[1:end-1], steps[2:end]) do start, stop
return backto(el, Interval{eltype(steps), Closed, Open}(start, stop))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
return map(steps[1:end-1], steps[2:end]) do start, stop
return backto(el, Interval{eltype(steps), Closed, Open}(start, stop))
return map(steps[1:(end - 1)], steps[2:end]) do start, stop
return backto(el, Interval{eltype(steps),Closed,Open}(start, stop))

Comment on lines 315 to 317
splits = intervals(range_(first(span), last(span); length=n+1), span_)
min_duration = if isnothing(min_duration)
asnanoseconds(0.75*toval(Intervals.span(interval(first(splits)))))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
splits = intervals(range_(first(span), last(span); length=n+1), span_)
min_duration = if isnothing(min_duration)
asnanoseconds(0.75*toval(Intervals.span(interval(first(splits)))))
splits = intervals(range_(first(span), last(span); length=n + 1), span_)
min_duration = if isnothing(min_duration)
asnanoseconds(0.75 * toval(Intervals.span(interval(first(splits)))))

else
min_duration
end
df = DataFrame(;(spancol => splits, label_helper(label) => value_helper(label, n))...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
df = DataFrame(;(spancol => splits, label_helper(label) => value_helper(label, n))...)
df = DataFrame(; (spancol => splits, label_helper(label) => value_helper(label, n))...)

test/runtests.jl Outdated
Comment on lines 59 to 61
df2 = DataFrame(label = rand(('a':'d'), n), sublabel = rand(('k':'n'), n), x = rand(n), span = spans)
df2_split = combine(groupby_interval_join(df2, quarters, on=:span, Cols(Between(:label, :sublabel), :quarter)), :x => mean)
df2_manual = combine(groupby(interval_join(df2, quarters, on=:span), Cols(Between(:label, :sublabel), :quarter)), :x => mean)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
df2 = DataFrame(label = rand(('a':'d'), n), sublabel = rand(('k':'n'), n), x = rand(n), span = spans)
df2_split = combine(groupby_interval_join(df2, quarters, on=:span, Cols(Between(:label, :sublabel), :quarter)), :x => mean)
df2_manual = combine(groupby(interval_join(df2, quarters, on=:span), Cols(Between(:label, :sublabel), :quarter)), :x => mean)
df2 = DataFrame(; label=rand(('a':'d'), n), sublabel=rand(('k':'n'), n), x=rand(n),
span=spans)
df2_split = combine(groupby_interval_join(df2, quarters; on=:span,
Cols(Between(:label, :sublabel), :quarter)),
:x => mean)
df2_manual = combine(groupby(interval_join(df2, quarters; on=:span),
Cols(Between(:label, :sublabel), :quarter)), :x => mean)

test/runtests.jl Outdated
Comment on lines 67 to 72
Aqua.test_all(DataFrameIntervals;
project_extras=true,
stale_deps=true,
deps_compat=true,
project_toml_formatting=true,
ambiguities=false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Aqua.test_all(DataFrameIntervals;
project_extras=true,
stale_deps=true,
deps_compat=true,
project_toml_formatting=true,
ambiguities=false)
Aqua.test_all(DataFrameIntervals;
project_extras=true,
stale_deps=true,
deps_compat=true,
project_toml_formatting=true,
ambiguities=false)

haberdashPI and others added 2 commits June 27, 2022 15:54
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Comment on lines 31 to 32
const IntervalTuple = Union{NamedTuple{(:start, :stop)}, NamedTuple{(:stop, :start)}}
interval_type(x::Type{<:T}) where T<:IntervalTuple = Union{T.parameters[2].parameters...}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
const IntervalTuple = Union{NamedTuple{(:start, :stop)}, NamedTuple{(:stop, :start)}}
interval_type(x::Type{<:T}) where T<:IntervalTuple = Union{T.parameters[2].parameters...}
const IntervalTuple = Union{NamedTuple{(:start, :stop)},NamedTuple{(:stop, :start)}}
interval_type(x::Type{<:T}) where {T<:IntervalTuple} = Union{T.parameters[2].parameters...}

Comment on lines 34 to 35
function IntervalArray(x::AbstractVector{<:IntervalTuple})
return IntervalArray{typeof(x), Interval{interval_type(eltype(x)), Closed, Open}}(x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
function IntervalArray(x::AbstractVector{<:IntervalTuple})
return IntervalArray{typeof(x), Interval{interval_type(eltype(x)), Closed, Open}}(x)
function IntervalArray(x::AbstractVector{<:IntervalTuple})
return IntervalArray{typeof(x),Interval{interval_type(eltype(x)),Closed,Open}}(x)

Comment on lines 37 to 39
interval(x::IntervalTuple) = Interval{interval_type(x), Closed, Open}(x.start, x.stop)
backto(::NamedTuple{(:start, :stop)}, x::Interval) = (;start=first(x), stop=last(x))
backto(::NamedTuple{(:stop, :start)}, x::Interval) = (;stop=last(x), start=first(x))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
interval(x::IntervalTuple) = Interval{interval_type(x), Closed, Open}(x.start, x.stop)
backto(::NamedTuple{(:start, :stop)}, x::Interval) = (;start=first(x), stop=last(x))
backto(::NamedTuple{(:stop, :start)}, x::Interval) = (;stop=last(x), start=first(x))
interval(x::IntervalTuple) = Interval{interval_type(x),Closed,Open}(x.start, x.stop)
backto(::NamedTuple{(:start, :stop)}, x::Interval) = (; start=first(x), stop=last(x))
backto(::NamedTuple{(:stop, :start)}, x::Interval) = (; stop=last(x), start=first(x))

test/runtests.jl Outdated
@test isapprox(duration(quarters.span[2]), duration(quarters.span[3]),
atol=Nanosecond(1))
@test isapprox(duration(quarters.span[2]), duration(quarters.span[3]);
atol=Nanosecond(1)) ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
atol=Nanosecond(1)) ||
atol=Nanosecond(1)) ||

test/runtests.jl Outdated
Comment on lines 45 to 46
nt_spans = [(;start=start(x), stop=stop(x)) for x in spans]
df1_nt = hcat(df1[!, Not(:span)], DataFrame(;span = nt_spans))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
nt_spans = [(;start=start(x), stop=stop(x)) for x in spans]
df1_nt = hcat(df1[!, Not(:span)], DataFrame(;span = nt_spans))
nt_spans = [(; start=start(x), stop=stop(x)) for x in spans]
df1_nt = hcat(df1[!, Not(:span)], DataFrame(; span=nt_spans))

test/runtests.jl Outdated
span=spans)
df2_split = combine(groupby_interval_join(df2, quarters,
Cols(Between(:label, :sublabel), :quarter);
on=:span,),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
on=:span,),
on=:span),

@haberdashPI haberdashPI merged commit 39fcd6a into main Jun 28, 2022
@haberdashPI haberdashPI deleted the dfl/initial-setup branch June 28, 2022 21:04
@haberdashPI haberdashPI mentioned this pull request Jun 30, 2022
haberdashPI added a commit that referenced this pull request Jun 30, 2022
- Updated readme to use the actual names I ended up going with in #1 
- Fix bug when passing `Pair` object with `on`
- Fix bug in method definition for `quantile_windows`
- Test various keyword arguments for `interval_join`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants