-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Add catalog templates to power cascading compaction #18402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
indexing-service/src/main/java/org/apache/druid/indexing/compact/CompactionJobQueue.java
Fixed
Show fixed
Hide fixed
...ice/src/main/java/org/apache/druid/indexing/overlord/supervisor/BatchIndexingSupervisor.java
Fixed
Show fixed
Hide fixed
config.getTaskPriority(), | ||
ClientCompactionTaskQueryTuningConfig.from( | ||
config.getTuningConfig(), | ||
config.getMaxRowsPerSegment(), |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note
DataSourceCompactionConfig.getMaxRowsPerSegment
static PartitionsSpec findPartitionsSpecFromConfig(ClientCompactionTaskQueryTuningConfig tuningConfig) | ||
{ | ||
final PartitionsSpec partitionsSpecFromTuningConfig = tuningConfig.getPartitionsSpec(); | ||
if (partitionsSpecFromTuningConfig == null) { | ||
final long maxTotalRows = Configs.valueOrDefault(tuningConfig.getMaxTotalRows(), Long.MAX_VALUE); | ||
return new DynamicPartitionsSpec(tuningConfig.getMaxRowsPerSegment(), maxTotalRows); | ||
final Long maxTotalRows = tuningConfig.getMaxTotalRows(); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note
ClientCompactionTaskQueryTuningConfig.getMaxTotalRows
final long maxTotalRows = Configs.valueOrDefault(tuningConfig.getMaxTotalRows(), Long.MAX_VALUE); | ||
return new DynamicPartitionsSpec(tuningConfig.getMaxRowsPerSegment(), maxTotalRows); | ||
final Long maxTotalRows = tuningConfig.getMaxTotalRows(); | ||
final Integer maxRowsPerSegment = tuningConfig.getMaxRowsPerSegment(); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note
ClientCompactionTaskQueryTuningConfig.getMaxRowsPerSegment
...-tests/src/test/java/org/apache/druid/testing/embedded/compact/CompactionSupervisorTest.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a qq -
DateTime previousRuleStartTime = DateTimes.MAX; | ||
for (int i = 0; i < rules.size() - 1; ++i) { | ||
final CompactionRule rule = rules.get(i); | ||
final DateTime ruleStartTime = rule.computeStartTime(currentTime, rules.get(i + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: considering the signature of the method computeStartTime(DateTime referenceTime, CompactionRule beforeRule)
, adding a rules.get(i+1)
as beforeRule
sounds a bit counter-intuitive.
Does it mean that the list of rules being passed here is meant to be executed in reverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the beforeRule
name is explained in the javadoc of the computeStartTime
method.
It comes before the current rule in the segment timeline.
But the rules can be evaluated in any order really, since the time chunks of two rules are mutually exclusive.
The rule with the latest time period comes first in the list since that is how we typically think of compaction,
and also how we define load rules in Druid. It keeps the rule definitions intuitive and straightforward.
💥 New features
CompactionStateMatcher
❓Open questions:
[A] Where should we store each indexing template in the Druid catalog?
a. As a table inside a new schema
index_template
(currently used in this PR)b. OR as a table inside the
druid
schema: Currently used for datasources onlyc. OR as a single row inside
sys
.templates
: probably not preferable since the catalog models everything as tables and their properties, but this would be neither.Note: In all of the above cases, the template is always physically stored as a single row in
druid_tableDefs
in the metadata store.[B] How should we specify parameters in MSQ SQL template?
a. Extend
SqlParameter
for this use case. This currently uses positional params likeSELECT * FROM ? WHERE __time > ?
b. OR Named params (currently used in this PR) like
SELECT * FROM "${dataSource}" WHERE __time > '${startTimestamp}
.Changes
Catalog templates
index_template
(Should we just use thedruid
schema itself?)IndexingTemplateDefn
which can currently contain only one propertypayload
IndexingTemplateSchema
for SQL support to run queries likeSELECT * FROM index_template.<template_id>
Compaction supervisor classes
BatchIndexingJob
which may contain either aClientTaskQuery
or aClientSqlQuery
(for MSQ jobs).BatchIndexingJobTemplate
that can create jobs for a given source and destinationCompactionSupervisor
to create jobs using templatesCompactionJobQueue
to create and submit compaction jobs to the OverlordRefactor for reuse
CompactSegments
toCompactionSlotManager
,CompactionSnapshotBuilder
CompactionStatus
,CompactionStatusTracker
andDataSourceCompactibleSegmentIterator
MSQ Compaction
MSQCompactionJobTemplate
to submit MSQ SQL jobs to the Broker${dataSource}
and${startTimestamp}
SqlParameter
since it represents positional params rather than named.SqlParameter
for this purpose instead.✏️ Sample templatized MSQ SQL
Cascading compaction
range = [now - p1, +inf)
range = [now - p2, now - p1)
range = [now - p3, now - p2)
...
range = (-inf, now - pn - 1)
compactInline
,compactMsq
orcompactCatalog
📊 Example: Compact to MONTH-DAY-WEEK-DAY
Cascading rules:
mmnedplf
(DAY granularity)hdapacml
(WEEK granularity)✏️ Full supervisor spec
List of template types
compactInline
/InlineCompactionJobTemplate
segmentGranularity
,partitionsSpec
etc. which can be used in building aCompactionTask
. Can be used directly inside a cascading template or stored in the Druid catalog.compactMsq
/MSQCompactionJobTemplate
${dataSource}
,${startDate}
. Can be used directly inside a cascading template or stored in the Druid catalog.compactCatalog
/CatalogCompactionJobTemplate
compactInline
orcompactMsq
template stored in the Druid catalog.compactCascade
/CascadingCompactionJobTemplate
compactInline
,compactMsq
orcompactCatalog
template.Pending changes
Future work
We can have a common
BatchIndexingSupervisor
which uses templates to create jobs.It could be implemented by
ScheduledBatchSupervisor
andCompactionSupervisor
.This change was originally included in this patch but has been left out to keep the changes small.
Release note
TODO
This PR has: