-
Notifications
You must be signed in to change notification settings - Fork 1
Initial setup #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,308 @@ | |||
# This file is machine-generated - editing it directly is not advised | |||
|
|||
[[ArgTools]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be checked in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this for now. In its draft form this PR will not work without the manifest, since invenia/Intervals.jl#193 has not yet merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, this code depends on pointing to a specific git commit that will be garbage collected (assuming that Invenia deletes branches upon merging / closing a PR) at some point. When we do this with internal repositories, we can do things like tag a commit to guarantee that it survives for future reference, but we can't do that for repositories we don't control. That's a very unstable dependency structure and is almost definitely blocking for this package having any type of release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I see now that there are TODOs in the top-level comment, but it might be worth opening a full issue. ❤️ )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#193 from Intervals.jl
should merge in the next day or two. My plan is to not merge this until #193 has merged.
Co-authored-by: Phillip Alday <[email protected]>
src/DataFrameIntervals.jl
Outdated
right_used = filter(x -> x isa String, right_groups) | ||
right_unused = filter(x -> x isa Unused, right_groups) | ||
left_used = filter(x -> x isa String, left_groups) | ||
left_unused = filter(x -> x isa Unused, left_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name unused
is confusing, rename to invalid
? maybe rename valid_columns
.
also comment remains unclear: maybe spell out that we get the valid columns for right and the valid columns for left (as well as invalid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
src/DataFrameIntervals.jl
Outdated
end | ||
|
||
""" | ||
split_into(left, right; spancol=:span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a join, why not match the DataFrames
naming conventions for joins? So use the on
keyword instead of spancol
, and support on=:leftcol => :rightcol
pairs.
src/DataFrameIntervals.jl
Outdated
function split_into(left, right; spancol=:span) | ||
regions = find_intersections_(view(right, :, spancol), view(left, :, spancol)) | ||
left_side, right_side = split(regions, left, right) | ||
joined = hcat(view(right_side, :, Not(spancol)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When left and right have the same column, this will throw an error suggesting passing in makeunique
.
Might be worth taking in and passing hcat
's makeunique
and copycols
kwargs through to the hcat
call, and maybe even throwing a split_into
specific error suggesting passing makeunique
to split_into
instead.
Also for consistency this could be called intersectjoin
or splitjoin
or something like that to match the DataFrames
naming convention.
src/DataFrameIntervals.jl
Outdated
end | ||
|
||
function spans_for_split!(df, left_span, right_span) | ||
df[!, :left_span] = left_span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it happen that df
already has a column called left_span
or right_span
, what error gets thrown then?
src/DataFrameIntervals.jl
Outdated
end | ||
end | ||
toval(x::TimePeriod) = float(Dates.value(convert(Nanosecond, x))) | ||
toperiod(x::Real) = Nanosecond(round(Int, x, RoundDown)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good for this name toperiod
to convey that the x::Real
arg is interpreted as a number of Nanosecond
s.
src/DataFrameIntervals.jl
Outdated
`combine(groupby(split_into(left, right), groups), pairs...)`. The one caveat is that | ||
the only column from `right` that `pairs` can reference is `:right_span`. | ||
""" | ||
function split_into_combine(left, right, groups, pairs...; spancol=:span, kwds...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in huddle, a better interface for this would be for the splitjoin
/ intersectjoin
to return some kind of lazy object on which groupby
and combine
methods get defined, so that the interface mirrors the DataFrames
API more closely. But that's a lot more work.
Minimal modifications here that I would find useful:
- if the join operation is called
intersectjoin
, maybe this could be calledcombine_intersectjoined
- having an example or two in the docstring would be very useful
Note that one of Phillip's comments was resolved but is still applicable. Also, it would be good to fix the formatting issues and set up reviewdog + JuliaFormatter. |
Co-authored-by: Phillip Alday <[email protected]>
Is this the comment about the authorship that I missed above? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
JuliaFormatter
src/DataFrameIntervals.jl|279|
src/DataFrameIntervals.jl|281|
src/DataFrameIntervals.jl|287|
src/DataFrameIntervals.jl|308|
src/DataFrameIntervals.jl|314|
src/DataFrameIntervals.jl|317|
src/DataFrameIntervals.jl|321|
test/runtests.jl|17|
test/runtests.jl|22|
test/runtests.jl|32|
test/runtests.jl|36|
test/runtests.jl|39|
test/runtests.jl|41|
test/runtests.jl|45|
test/runtests.jl|52|
test/runtests.jl|56|
test/runtests.jl|59|
test/runtests.jl|63|
test/runtests.jl|67|
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
src/DataFrameIntervals.jl
Outdated
return map(steps[1:end-1], steps[2:end]) do start, stop | ||
return backto(el, Interval{eltype(steps), Closed, Open}(start, stop)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
return map(steps[1:end-1], steps[2:end]) do start, stop | |
return backto(el, Interval{eltype(steps), Closed, Open}(start, stop)) | |
return map(steps[1:(end - 1)], steps[2:end]) do start, stop | |
return backto(el, Interval{eltype(steps),Closed,Open}(start, stop)) |
src/DataFrameIntervals.jl
Outdated
splits = intervals(range_(first(span), last(span); length=n+1), span_) | ||
min_duration = if isnothing(min_duration) | ||
asnanoseconds(0.75*toval(Intervals.span(interval(first(splits))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
splits = intervals(range_(first(span), last(span); length=n+1), span_) | |
min_duration = if isnothing(min_duration) | |
asnanoseconds(0.75*toval(Intervals.span(interval(first(splits))))) | |
splits = intervals(range_(first(span), last(span); length=n + 1), span_) | |
min_duration = if isnothing(min_duration) | |
asnanoseconds(0.75 * toval(Intervals.span(interval(first(splits))))) |
src/DataFrameIntervals.jl
Outdated
else | ||
min_duration | ||
end | ||
df = DataFrame(;(spancol => splits, label_helper(label) => value_helper(label, n))...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
df = DataFrame(;(spancol => splits, label_helper(label) => value_helper(label, n))...) | |
df = DataFrame(; (spancol => splits, label_helper(label) => value_helper(label, n))...) |
test/runtests.jl
Outdated
df2 = DataFrame(label = rand(('a':'d'), n), sublabel = rand(('k':'n'), n), x = rand(n), span = spans) | ||
df2_split = combine(groupby_interval_join(df2, quarters, on=:span, Cols(Between(:label, :sublabel), :quarter)), :x => mean) | ||
df2_manual = combine(groupby(interval_join(df2, quarters, on=:span), Cols(Between(:label, :sublabel), :quarter)), :x => mean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
df2 = DataFrame(label = rand(('a':'d'), n), sublabel = rand(('k':'n'), n), x = rand(n), span = spans) | |
df2_split = combine(groupby_interval_join(df2, quarters, on=:span, Cols(Between(:label, :sublabel), :quarter)), :x => mean) | |
df2_manual = combine(groupby(interval_join(df2, quarters, on=:span), Cols(Between(:label, :sublabel), :quarter)), :x => mean) | |
df2 = DataFrame(; label=rand(('a':'d'), n), sublabel=rand(('k':'n'), n), x=rand(n), | |
span=spans) | |
df2_split = combine(groupby_interval_join(df2, quarters; on=:span, | |
Cols(Between(:label, :sublabel), :quarter)), | |
:x => mean) | |
df2_manual = combine(groupby(interval_join(df2, quarters; on=:span), | |
Cols(Between(:label, :sublabel), :quarter)), :x => mean) |
test/runtests.jl
Outdated
Aqua.test_all(DataFrameIntervals; | ||
project_extras=true, | ||
stale_deps=true, | ||
deps_compat=true, | ||
project_toml_formatting=true, | ||
ambiguities=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Aqua.test_all(DataFrameIntervals; | |
project_extras=true, | |
stale_deps=true, | |
deps_compat=true, | |
project_toml_formatting=true, | |
ambiguities=false) | |
Aqua.test_all(DataFrameIntervals; | |
project_extras=true, | |
stale_deps=true, | |
deps_compat=true, | |
project_toml_formatting=true, | |
ambiguities=false) |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
src/DataFrameIntervals.jl
Outdated
const IntervalTuple = Union{NamedTuple{(:start, :stop)}, NamedTuple{(:stop, :start)}} | ||
interval_type(x::Type{<:T}) where T<:IntervalTuple = Union{T.parameters[2].parameters...} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
const IntervalTuple = Union{NamedTuple{(:start, :stop)}, NamedTuple{(:stop, :start)}} | |
interval_type(x::Type{<:T}) where T<:IntervalTuple = Union{T.parameters[2].parameters...} | |
const IntervalTuple = Union{NamedTuple{(:start, :stop)},NamedTuple{(:stop, :start)}} | |
interval_type(x::Type{<:T}) where {T<:IntervalTuple} = Union{T.parameters[2].parameters...} |
src/DataFrameIntervals.jl
Outdated
function IntervalArray(x::AbstractVector{<:IntervalTuple}) | ||
return IntervalArray{typeof(x), Interval{interval_type(eltype(x)), Closed, Open}}(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
function IntervalArray(x::AbstractVector{<:IntervalTuple}) | |
return IntervalArray{typeof(x), Interval{interval_type(eltype(x)), Closed, Open}}(x) | |
function IntervalArray(x::AbstractVector{<:IntervalTuple}) | |
return IntervalArray{typeof(x),Interval{interval_type(eltype(x)),Closed,Open}}(x) |
src/DataFrameIntervals.jl
Outdated
interval(x::IntervalTuple) = Interval{interval_type(x), Closed, Open}(x.start, x.stop) | ||
backto(::NamedTuple{(:start, :stop)}, x::Interval) = (;start=first(x), stop=last(x)) | ||
backto(::NamedTuple{(:stop, :start)}, x::Interval) = (;stop=last(x), start=first(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
interval(x::IntervalTuple) = Interval{interval_type(x), Closed, Open}(x.start, x.stop) | |
backto(::NamedTuple{(:start, :stop)}, x::Interval) = (;start=first(x), stop=last(x)) | |
backto(::NamedTuple{(:stop, :start)}, x::Interval) = (;stop=last(x), start=first(x)) | |
interval(x::IntervalTuple) = Interval{interval_type(x),Closed,Open}(x.start, x.stop) | |
backto(::NamedTuple{(:start, :stop)}, x::Interval) = (; start=first(x), stop=last(x)) | |
backto(::NamedTuple{(:stop, :start)}, x::Interval) = (; stop=last(x), start=first(x)) |
test/runtests.jl
Outdated
@test isapprox(duration(quarters.span[2]), duration(quarters.span[3]), | ||
atol=Nanosecond(1)) | ||
@test isapprox(duration(quarters.span[2]), duration(quarters.span[3]); | ||
atol=Nanosecond(1)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
atol=Nanosecond(1)) || | |
atol=Nanosecond(1)) || |
test/runtests.jl
Outdated
nt_spans = [(;start=start(x), stop=stop(x)) for x in spans] | ||
df1_nt = hcat(df1[!, Not(:span)], DataFrame(;span = nt_spans)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
nt_spans = [(;start=start(x), stop=stop(x)) for x in spans] | |
df1_nt = hcat(df1[!, Not(:span)], DataFrame(;span = nt_spans)) | |
nt_spans = [(; start=start(x), stop=stop(x)) for x in spans] | |
df1_nt = hcat(df1[!, Not(:span)], DataFrame(; span=nt_spans)) |
test/runtests.jl
Outdated
span=spans) | ||
df2_split = combine(groupby_interval_join(df2, quarters, | ||
Cols(Between(:label, :sublabel), :quarter); | ||
on=:span,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
on=:span,), | |
on=:span), |
- Updated readme to use the actual names I ended up going with in #1 - Fix bug when passing `Pair` object with `on` - Fix bug in method definition for `quantile_windows` - Test various keyword arguments for `interval_join`
This defines two functions that are handy for computing joins over time spans:
interval_join
andgroupby_interval_join
Rows match in this join if their time spans overlap.
There is also a simple utility function (
quantile_windows
) for joining over regularly spaces intervals.Remaining actions:
Close Minimal IntervalSet type invenia/Intervals.jl#193, so we can use my newly createdfind_intersections
function from that repo.Remove Manifest.toml entries from this repo and verify that CI now works.Will be releasing this as a 0.0.1 release, and add a 0.1.0 release once invenia/Intervals.jl#193 merges.