-
Notifications
You must be signed in to change notification settings - Fork 708
Beam backend: use TypedPipe descriptions as names for PTransforms #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -1,21 +1,21 @@ | ||||
package com.twitter.scalding.beam_backend | ||||
|
||||
import com.twitter.scalding.dagon.{FunctionK, Memoize, Rule} | ||||
import com.twitter.chill.KryoInstantiator | ||||
import com.twitter.chill.config.ScalaMapConfig | ||||
import com.twitter.scalding.Config | ||||
import com.twitter.scalding.beam_backend.BeamOp.{CoGroupedOp, MergedBeamOp} | ||||
import com.twitter.scalding.dagon.{FunctionK, Memoize, Rule} | ||||
import com.twitter.scalding.serialization.KryoHadoop | ||||
import com.twitter.scalding.typed.OptimizationRules._ | ||||
import com.twitter.scalding.typed._ | ||||
import com.twitter.scalding.typed.cascading_backend.CascadingExtensions.ConfigCascadingExtensions | ||||
import com.twitter.scalding.typed.functions.{ | ||||
FilterKeysToFilter, | ||||
FlatMapValuesToFlatMap, | ||||
MapValuesToMap, | ||||
ScaldingPriorityQueueMonoid | ||||
} | ||||
|
||||
import com.twitter.scalding.typed.cascading_backend.CascadingExtensions.ConfigCascadingExtensions | ||||
|
||||
object BeamPlanner { | ||||
def plan( | ||||
config: Config, | ||||
|
@@ -61,8 +61,15 @@ object BeamPlanner { | |||
BeamOp.Source(config, src, srcs(src)) | ||||
case (IterablePipe(iterable), _) => | ||||
BeamOp.FromIterable(iterable, kryoCoder) | ||||
case (wd: WithDescriptionTypedPipe[a], rec) => | ||||
rec[a](wd.input) | ||||
case (wd: WithDescriptionTypedPipe[_], rec) => { | ||||
val op = rec(wd.input) | ||||
wd.descriptions match { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, as commented, I think this is probably running the risk of dropping some descriptions. I would do: op.withName(wd.descriptions.map(_._1).mkString(", ")) otherwise I think you will wind up with cases where you lose line numbers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I'll try this out. |
||||
case head :: _ => | ||||
op.withName(head._1) | ||||
case Nil => | ||||
op | ||||
} | ||||
} | ||||
case (SumByLocalKeys(pipe, sg), rec) => | ||||
val op = rec(pipe) | ||||
config.getMapSideAggregationThreshold match { | ||||
|
@@ -97,7 +104,10 @@ object BeamPlanner { | |||
uir.evidence.subst[BeamOpT](sortedOp) | ||||
} | ||||
go(ivsr) | ||||
case (ReduceStepPipe(ValueSortedReduce(keyOrdering, pipe, valueSort, reduceFn, _, _)), rec) => | ||||
case ( | ||||
ReduceStepPipe(ValueSortedReduce(keyOrdering, pipe, valueSort, reduceFn, _, _)), | ||||
rec | ||||
) => | ||||
val op = rec(pipe) | ||||
op.sortedMapGroup(reduceFn)(keyOrdering, valueSort, kryoCoder) | ||||
case (ReduceStepPipe(IteratorMappedReduce(keyOrdering, pipe, reduceFn, _, _)), rec) => | ||||
|
@@ -116,7 +126,7 @@ object BeamPlanner { | |||
val ops: Seq[BeamOp[(K, Any)]] = cg.inputs.map(tp => rec(tp)) | ||||
CoGroupedOp(cg, ops) | ||||
} | ||||
go(cg) | ||||
if (cg.descriptions.isEmpty) go(cg) else go(cg).withName(cg.descriptions.last) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant to ask you about this myself. If you look at the test case in this PR, The cogrouped expression has two descriptions "Count words" and "Join with t1", both of which appear in the descriptions of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not combine all of them? Why not just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, let me try this out on a job first. Thanks! |
||||
case (Fork(input), rec) => | ||||
rec(input) | ||||
case (m @ MergedTypedPipe(_, _), rec) => | ||||
|
@@ -137,7 +147,21 @@ object BeamPlanner { | |||
|
||||
def defaultOptimizationRules(config: Config): Seq[Rule[TypedPipe]] = { | ||||
def std(forceHash: Rule[TypedPipe]) = | ||||
OptimizationRules.standardMapReduceRules ::: | ||||
List( | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why the change here? Is this copying the same cascading optimizations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It really is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry I missed that. Note: the line numbers are already captured by the time that rule runs. The line numbers are collected on the If I were you, I would look at all the descriptions and add the full list. The problem with removing that rule is that it will block merging nodes together. It may be fine, maybe Beam will follow up with optimizations, but I would be careful: scalding may do some optimizations that beam doesn't. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The line numbers are in the descriptions though, if users don't explicitly add them with a call to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really know the answer. So, line numbers are added to descriptions when the user constructs the TypedPipe, see: scalding/scalding-base/src/main/scala/com/twitter/scalding/typed/TypedPipe.scala Line 533 in 6434348
The idea of As to what to do, I don't know. If beam's optimizer is very good, it maybe doesn't matter. Maybe you should try to separate runs with a somewhat complex job and compare? Also, I can imagine a I would probably bias to just combining all the descriptions into a single beam description unless you actually see problems. You can always come back and add that setting. I would personally bias to maximizing the utility of the optimizer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Let me try out some jobs with the rule enabled and showing all the descriptions. |
||||
// phase 0, add explicit forks to not duplicate pipes on fanout below | ||||
AddExplicitForks, | ||||
RemoveUselessFork, | ||||
// phase 1, compose flatMap/map, move descriptions down, defer merge, filter pushup etc... | ||||
IgnoreNoOpGroup.orElse(composeSame).orElse(FilterKeysEarly).orElse(DeferMerge), | ||||
// phase 2, combine different kinds of mapping operations into flatMaps, including redundant merges | ||||
composeIntoFlatMap | ||||
.orElse(simplifyEmpty) | ||||
.orElse(DiamondToFlatMap) | ||||
.orElse(ComposeDescriptions) | ||||
.orElse(MapValuesInReducers), | ||||
// phase 3, remove duplicates forces/forks (e.g. .fork.fork or .forceToDisk.fork, ....) | ||||
RemoveDuplicateForceFork | ||||
) ::: | ||||
List( | ||||
OptimizationRules.FilterLocally, // after filtering, we may have filtered to nothing, lets see | ||||
OptimizationRules.simplifyEmpty, | ||||
|
Uh oh!
There was an error while loading. Please reload this page.