-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
fix(server): frequent embedding #13475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Warning Rate limit exceeded@darkskygit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 0 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdjusts re-embedding criteria to require both a newer doc update and embedding older than 10 minutes, adds content-equality skip in the embedding job, moves/centralizes embedding sanitization and dimensions into models, adds workspace/file embedding insert/get methods, and expands tests with time-based snapshot scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant Job as EmbeddingJob
participant Ctx as CopilotContext
participant DB as DB
Job->>Ctx: getWorkspaceContent(workspaceId, docId[, chunk])
Ctx->>DB: SELECT content ORDER BY chunk ASC
DB-->>Ctx: content
Ctx-->>Job: sanitized content
alt normalized(summary) == normalized(stored content)
Job->>Job: log skip and return
else
Job->>Job: build fragments, generate embeddings
Job->>Ctx: insertFileEmbedding(contextId, fileId, embeddings)
Ctx->>DB: UPSERT ai_context_embeddings (content, embedding, updated_at)
DB-->>Ctx: ok
Ctx-->>Job: done
end
sequenceDiagram
participant Caller
participant WS as CopilotWorkspaceModel
participant DB as DB
Caller->>WS: checkDocNeedEmbedded(workspaceId, docId)
WS->>DB: Query docs + last embedding e
alt no existing embedding (e.updated_at IS NULL)
WS-->>Caller: needsEmbedding = true
else
alt docs.updated_at > e.updated_at AND e.updated_at < NOW() - 10m
WS-->>Caller: needsEmbedding = true
else
WS-->>Caller: needsEmbedding = false
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #13475 +/- ##
==========================================
+ Coverage 56.79% 57.34% +0.55%
==========================================
Files 2718 2718
Lines 134945 135032 +87
Branches 20730 20834 +104
==========================================
+ Hits 76637 77430 +793
+ Misses 56640 55937 -703
+ Partials 1668 1665 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
packages/backend/server/src/models/copilot-workspace.ts (1)
241-247
: Use the latest embedding timestamp per doc to avoid false positives and improve performanceJoining directly to ai_workspace_embeddings without aggregation may cause:
- False positives when multiple chunks exist: EXISTS can succeed against an older chunk even if a newer chunk’s updated_at is up-to-date.
- Unnecessary row scans and repeated comparisons.
Aggregate to the latest embedding time per (workspace_id, doc_id) and compare against that. This also reduces row cardinality in the join.
Apply this diff to aggregate embedding timestamps:
- LEFT JOIN ai_workspace_embeddings e - ON e.workspace_id = docs.workspace_id - AND e.doc_id = docs.doc_id + LEFT JOIN ( + SELECT + workspace_id, + doc_id, + MAX(updated_at) AS updated_at + FROM ai_workspace_embeddings + GROUP BY workspace_id, doc_id + ) e + ON e.workspace_id = docs.workspace_id + AND e.doc_id = docs.doc_id WHERE e.updated_at IS NULL - OR (docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes') + OR (docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes')Optional follow-up (further correctness/perf): also collapse docs to the latest doc update via a second CTE (SELECT MAX(updated_at) per doc) before the join, so the comparison is strictly between “latest doc update” and “latest embedding”.
🧹 Nitpick comments (2)
packages/backend/server/src/models/copilot-context.ts (1)
267-279
: Align return type of getWorkspaceContent with behavior (avoid returning empty string on no rows)findMany returns an empty array when no rows match; the current implementation returns '' (empty string), despite the function being typed as Promise<string | undefined>. This can lead to subtle logic bugs downstream.
Return undefined when no rows exist.
async getWorkspaceContent( workspaceId: string, docId: string, chunk?: number ): Promise<string | undefined> { - const file = await this.db.aiWorkspaceEmbedding.findMany({ + const file = await this.db.aiWorkspaceEmbedding.findMany({ where: { workspaceId, docId, chunk }, select: { content: true }, orderBy: { chunk: 'asc' }, }); - return file?.map(f => f.content).join('\n'); + if (!file.length) return undefined; + return file.map(f => f.content).join('\n'); }packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (1)
351-411
: Well-designed scenarios for “only time elapsed” vs “doc actually updated”
- 30m-old snapshot + 25m-old embedding → no re-embed (correct).
- Then a doc update 5m ago with the same 25m-old embedding → re-embed (correct).
Minor note: these tests depend on relative Date.now() calculations. They’re deterministic here, but if future changes add delays, consider capturing a fixed baseNow once (which you already do) and ensuring all timestamps derive from it to avoid flakiness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.snap
is excluded by!**/*.snap
📒 Files selected for processing (5)
packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md
(1 hunks)packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts
(3 hunks)packages/backend/server/src/models/copilot-context.ts
(3 hunks)packages/backend/server/src/models/copilot-workspace.ts
(2 hunks)packages/backend/server/src/plugins/copilot/embedding/job.ts
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
packages/backend/server/src/plugins/copilot/embedding/job.ts (1)
packages/backend/server/src/plugins/copilot/context/session.ts (1)
workspaceId
(29-31)
packages/backend/server/src/models/copilot-context.ts (1)
packages/backend/server/src/plugins/copilot/context/session.ts (1)
workspaceId
(29-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (49)
- GitHub Check: y-octo binding test on aarch64-apple-darwin
- GitHub Check: y-octo binding test on x86_64-apple-darwin
- GitHub Check: y-octo binding test on x86_64-pc-windows-msvc
- GitHub Check: y-octo binding test on aarch64-pc-windows-msvc
- GitHub Check: E2E BlockSuite Test (7)
- GitHub Check: Run native tests
- GitHub Check: E2E BlockSuite Test (10)
- GitHub Check: E2E BlockSuite Test (5)
- GitHub Check: E2E BlockSuite Test (3)
- GitHub Check: E2E BlockSuite Test (6)
- GitHub Check: E2E BlockSuite Test (9)
- GitHub Check: E2E BlockSuite Test (4)
- GitHub Check: E2E BlockSuite Test (8)
- GitHub Check: E2E Mobile Test (5)
- GitHub Check: E2E BlockSuite Test (2)
- GitHub Check: E2E BlockSuite Test (1)
- GitHub Check: E2E Mobile Test (3)
- GitHub Check: E2E Mobile Test (4)
- GitHub Check: fuzzing
- GitHub Check: Build Server native
- GitHub Check: E2E Mobile Test (1)
- GitHub Check: Build @affine/electron renderer
- GitHub Check: E2E Mobile Test (2)
- GitHub Check: E2E BlockSuite Cross Browser Test (2, webkit)
- GitHub Check: E2E BlockSuite Cross Browser Test (2, firefox)
- GitHub Check: Build AFFiNE native (x86_64-apple-darwin)
- GitHub Check: E2E BlockSuite Cross Browser Test (1, webkit)
- GitHub Check: E2E BlockSuite Cross Browser Test (2, chromium)
- GitHub Check: E2E BlockSuite Cross Browser Test (1, chromium)
- GitHub Check: E2E BlockSuite Cross Browser Test (1, firefox)
- GitHub Check: E2E Test (4)
- GitHub Check: E2E Test (1)
- GitHub Check: E2E Test (7)
- GitHub Check: E2E Test (5)
- GitHub Check: E2E Test (10)
- GitHub Check: E2E Test (3)
- GitHub Check: E2E Test (9)
- GitHub Check: E2E Test (8)
- GitHub Check: E2E Test (2)
- GitHub Check: E2E Test (6)
- GitHub Check: Analyze (typescript, blocksuite)
- GitHub Check: Build AFFiNE native (x86_64-pc-windows-msvc)
- GitHub Check: loom thread test
- GitHub Check: Build AFFiNE native (aarch64-pc-windows-msvc)
- GitHub Check: Analyze (typescript, affine)
- GitHub Check: Analyze (javascript, affine)
- GitHub Check: Analyze (javascript, blocksuite)
- GitHub Check: Lint
- GitHub Check: Typecheck
🔇 Additional comments (6)
packages/backend/server/src/models/copilot-workspace.ts (1)
209-212
: Comment update accurately reflects new gating ruleThe docstring now matches the intended behavior (first-time embed OR re-embed only if newer doc AND last embed older than 10 minutes). Good alignment with the new SQL predicate.
packages/backend/server/src/models/copilot-context.ts (2)
221-241
: File embeddings upsert now refreshes content — good guard and behavior
- Early-return on empty embeddings avoids needless writes and noise.
- Upsert on (context_id, file_id, chunk) updating content/embedding/updated_at is correct and prevents drift.
304-307
: Upsert includes content refresh — correct and consistent with context embeddingsUpdating content alongside embedding and updated_at ensures DB reflects the latest material. Good.
packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md (1)
104-125
: Snapshot cases cover the updated gating matrix wellThe five scenarios map precisely to the new AND-based re-embedding rule and first-embed behavior. Looks good.
packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (2)
296-299
: Good: first-embed case captured via snapshotThis verifies the “no embedding exists” path returns true as intended.
334-340
: Simulating stale embedding timestamp is a solid way to trigger the 10-minute ruleDirectly setting updatedAt to an older timestamp ensures you’re exercising the time window without relying on wall-clock delays.
Summary by CodeRabbit