Skip to content

cache: preserve resumable guarantee with empty history watchers #20491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 15, 2025

Conversation

apullo777
Copy link
Contributor

@apullo777 apullo777 commented Aug 14, 2025

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

I open this PR to address the issue #20488

The root issue is that a cache can be ready but still have no history. So we need to ensure proper watcher and gap handling when history is empty. That includes:

  • When history is empty, a watcher with a non-zero startRev should be registered as lagging.
  • resyncLaggingWatchers will not promote watcher to active while history is empty.
  • Broadcast could defensively detect a gap (firstRev > nextRev) and move such watchers back to lagging instead of streaming past the gap.

/cc @serathius @MadhavJivrajani

@k8s-ci-robot
Copy link

Hi @apullo777. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

/ok-to-test

cache/cache.go Outdated
if oldestHistoryRev != 0 {
tooOld = startRev < oldestHistoryRev
} else {
storeFloor := c.store.LatestRev() + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think watch should care about storeFloor.

Copy link
Contributor Author

@apullo777 apullo777 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. storeFloor is unrelated here.
If we want to decide compaction when history is empty, maybe I should move the compaction check into demux.Register and return (compacted, compactRev) so the decision and watcher registration can happen under the same lock:

func (d *demux) Register(w *watcher, startingRev int64) (compacted bool, compactRev int64) {
	d.mu.Lock()
	defer d.mu.Unlock()

	oldestRev := d.history.PeekOldest()
	if startingRev != 0 && oldestRev != 0 && startingRev < oldestRev {
		return true, oldestRev - 1
	}
	
	latestRev := d.history.PeekLatest()
	if latestRev == 0 {
	// watcher registration
	...

and in Cache.Watch we use that result to return canceled response:

	op := clientv3.OpWatch(key, opts...)
	startRev := op.Rev()

	pred, err := c.validateWatch(key, op)
    ...

	w := newWatcher(c.cfg.PerWatcherBufferSize, pred)
	compacted, compactRev := c.demux.Register(w, startRev)
	if compacted {
		ch := make(chan clientv3.WatchResponse, 1)
		ch <- clientv3.WatchResponse{
			Canceled:        true,
			CompactRevision: compactRev,
		}
		close(ch)
		return ch
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove all this logic, and just register watch and depend on resync to kick it out. Reading demux state from under lock, can become very stale before the watch is started. Is a premature optimization that adds number of problematic edgecases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the early check in Cache.Watch makes sense. For now resyncLaggingWatchers just silently stops watcher on compaction, but I think it should also return cancel response. However, that will require some work to give watcher a way to carry cancel/termination response so that the goroutine in Cache.Watch can forward the response before exiting. Maybe I should just revert this part and do the removal in a follow-up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should also return cancel response

+1

Maybe I should just revert this part and do the removal in a follow-up PR?

Sounds good

Copy link

codecov bot commented Aug 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.99%. Comparing base (05b8563) to head (662e2fc).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

see 32 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20491      +/-   ##
==========================================
- Coverage   69.24%   68.99%   -0.25%     
==========================================
  Files         418      415       -3     
  Lines       34699    34564     -135     
==========================================
- Hits        24028    23849     -179     
- Misses       9275     9299      +24     
- Partials     1396     1416      +20     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05b8563...662e2fc. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@serathius
Copy link
Member

/retest

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: apullo777, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius serathius merged commit 1cae3b3 into etcd-io:main Aug 15, 2025
30 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants