fix(server): frequent embedding #13475

darkskygit · 2025-08-11T23:55:29Z

Summary by CodeRabbit

New Features
- Smarter embedding pipeline skips re-embedding when content hasn’t changed; added content sanitization for embeddings and workspace content retrieval.
Bug Fixes
- Re-embedding now requires both a document update and the last embedding being older than 10 minutes, reducing unnecessary work.
Refactor
- Consolidated embedding preprocessing and moved sanitization utilities into shared models; upserts now refresh stored content.
Tests
- Expanded snapshot-based tests covering multiple time/age scenarios for embedding decision logic.

coderabbitai · 2025-08-11T23:55:37Z

Warning

Rate limit exceeded

@darkskygit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f98eee8 and 2546951.

📒 Files selected for processing (1)

packages/backend/server/src/plugins/copilot/embedding/job.ts (2 hunks)

Walkthrough

Adjusts re-embedding criteria to require both a newer doc update and embedding older than 10 minutes, adds content-equality skip in the embedding job, moves/centralizes embedding sanitization and dimensions into models, adds workspace/file embedding insert/get methods, and expands tests with time-based snapshot scenarios.

Changes

Cohort / File(s)	Summary
Tests: embedding snapshots & scenarios `packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md`, `packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts`	Replaces direct boolean assertions with snapshot assertions; adds multi-scenario time-based cases (no embedding, recent embedding, updated-after-old-embedding, only-window-passed, updated-after-window) and simulates embedding timestamps.
Workspace re-embed logic `packages/backend/server/src/models/copilot-workspace.ts`	Changes re-embedding trigger from OR to AND: requires docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes'; preserves first-time-embedding when none exists.
Embedding sanitization & constants `packages/backend/server/src/models/common/copilot.ts`	Adds EMBEDDING_DIMENSIONS (1024), clearEmbeddingContent, and clearEmbeddingChunk; implements FILTER_PREFIX trimming of leading metadata lines for embedding content/chunks.
Context model methods & upserts `packages/backend/server/src/models/copilot-context.ts`	Imports EMBEDDING_DIMENSIONS and clearEmbeddingContent; removes local EMBEDDING_DIMENSIONS export; sanitizes getFileContent/getWorkspaceContent; adds insertFileEmbedding(contextId,fileId,embeddings) with empty-array guard; upserts now update content, embedding, and updated_at.
Embedding job: content-stability guard `packages/backend/server/src/plugins/copilot/embedding/job.ts`	Adds normalize helper and uses getWorkspaceContent + normalized comparison to skip embedding generation when fragment.summary equals stored content (early return on match).
Plugin imports updated `packages/backend/server/src/plugins/copilot/mcp/provider.ts`, `packages/backend/server/src/plugins/copilot/tools/doc-semantic-search.ts`	Redirects clearEmbeddingChunk import from utils to models (no runtime logic change).
Removed util `packages/backend/server/src/plugins/copilot/utils.ts`	Deletes local clearEmbeddingChunk and FILTER_PREFIX from utils (logic moved to models).

Sequence Diagram(s)

sequenceDiagram
  participant Job as EmbeddingJob
  participant Ctx as CopilotContext
  participant DB as DB

  Job->>Ctx: getWorkspaceContent(workspaceId, docId[, chunk])
  Ctx->>DB: SELECT content ORDER BY chunk ASC
  DB-->>Ctx: content
  Ctx-->>Job: sanitized content

  alt normalized(summary) == normalized(stored content)
    Job->>Job: log skip and return
  else
    Job->>Job: build fragments, generate embeddings
    Job->>Ctx: insertFileEmbedding(contextId, fileId, embeddings)
    Ctx->>DB: UPSERT ai_context_embeddings (content, embedding, updated_at)
    DB-->>Ctx: ok
    Ctx-->>Job: done
  end

sequenceDiagram
  participant Caller
  participant WS as CopilotWorkspaceModel
  participant DB as DB

  Caller->>WS: checkDocNeedEmbedded(workspaceId, docId)
  WS->>DB: Query docs + last embedding e
  alt no existing embedding (e.updated_at IS NULL)
    WS-->>Caller: needsEmbedding = true
  else
    alt docs.updated_at > e.updated_at AND e.updated_at < NOW() - 10m
      WS-->>Caller: needsEmbedding = true
    else
      WS-->>Caller: needsEmbedding = false
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat(server): skip embedding for deprecated doc ids & empty docs #13211: Modifies embedding flow and workspace checks (overlaps with checkDocNeedEmbedded and embedding-job changes).
fix(server): fulfill empty embedding for trashed docs #13461: Refactors embedding plumbing and CopilotContext methods (overlaps with insert/get workspace/file embedding additions).
feat(server): clear semantic search metadata #13197: Moves/changes clearEmbeddingChunk and sanitization logic (overlaps with relocation of sanitization helpers to models).

Suggested reviewers

fengmk2
pengx17

Poem

I nibble lines and clear the fluff,
If nothing's changed, then skip the stuff.
Ten minutes tick, the timestamps say,
Embed when new — else let it stay.
Snapshots hum; the rabbit hops away. 🥕🐇

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch darksky/fix-frequent-embedding

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-08-12T00:00:45Z

Codecov Report

❌ Patch coverage is 62.50000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.34%. Comparing base (aa20e7b) to head (2546951).
⚠️ Report is 2 commits behind head on canary.

Files with missing lines	Patch %	Lines
...ckages/backend/server/src/models/common/copilot.ts	42.42%	19 Missing ⚠️
...ackend/server/src/plugins/copilot/embedding/job.ts	47.36%	10 Missing ⚠️
...kages/backend/server/src/models/copilot-context.ts	94.44%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           canary   #13475      +/-   ##
==========================================
+ Coverage   56.79%   57.34%   +0.55%     
==========================================
  Files        2718     2718              
  Lines      134945   135032      +87     
  Branches    20730    20834     +104     
==========================================
+ Hits        76637    77430     +793     
+ Misses      56640    55937     -703     
+ Partials     1668     1665       -3

Flag	Coverage Δ
server-test	`79.13% <62.50%> (+1.19%)`	⬆️
unittest	`32.11% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

packages/backend/server/src/models/copilot-workspace.ts (1)
241-247: Use the latest embedding timestamp per doc to avoid false positives and improve performance

Joining directly to ai_workspace_embeddings without aggregation may cause:

False positives when multiple chunks exist: EXISTS can succeed against an older chunk even if a newer chunk’s updated_at is up-to-date.

Unnecessary row scans and repeated comparisons.

Aggregate to the latest embedding time per (workspace_id, doc_id) and compare against that. This also reduces row cardinality in the join.

Apply this diff to aggregate embedding timestamps:
-            LEFT JOIN ai_workspace_embeddings e
-              ON e.workspace_id = docs.workspace_id
-              AND e.doc_id = docs.doc_id
+            LEFT JOIN (
+              SELECT
+                workspace_id,
+                doc_id,
+                MAX(updated_at) AS updated_at
+              FROM ai_workspace_embeddings
+              GROUP BY workspace_id, doc_id
+            ) e
+              ON e.workspace_id = docs.workspace_id
+              AND e.doc_id = docs.doc_id
           WHERE
             e.updated_at IS NULL
-            OR (docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes')
+            OR (docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes')
Optional follow-up (further correctness/perf): also collapse docs to the latest doc update via a second CTE (SELECT MAX(updated_at) per doc) before the join, so the comparison is strictly between “latest doc update” and “latest embedding”.

🧹 Nitpick comments (2)

packages/backend/server/src/models/copilot-context.ts (1)
267-279: Align return type of getWorkspaceContent with behavior (avoid returning empty string on no rows)

findMany returns an empty array when no rows match; the current implementation returns '' (empty string), despite the function being typed as Promise<string | undefined>. This can lead to subtle logic bugs downstream.

Return undefined when no rows exist.
   async getWorkspaceContent(
     workspaceId: string,
     docId: string,
     chunk?: number
   ): Promise<string | undefined> {
-    const file = await this.db.aiWorkspaceEmbedding.findMany({
+    const file = await this.db.aiWorkspaceEmbedding.findMany({
       where: { workspaceId, docId, chunk },
       select: { content: true },
       orderBy: { chunk: 'asc' },
     });
-    return file?.map(f => f.content).join('\n');
+    if (!file.length) return undefined;
+    return file.map(f => f.content).join('\n');
   }
packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (1)

351-411: Well-designed scenarios for “only time elapsed” vs “doc actually updated”

30m-old snapshot + 25m-old embedding → no re-embed (correct).

Then a doc update 5m ago with the same 25m-old embedding → re-embed (correct).

Minor note: these tests depend on relative Date.now() calculations. They’re deterministic here, but if future changes add delays, consider capturing a fixed baseNow once (which you already do) and ensuring all timestamps derive from it to avoid flakiness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa20e7b and 640c87c.

⛔ Files ignored due to path filters (1)

packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.snap is excluded by !**/*.snap

📒 Files selected for processing (5)

packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md (1 hunks)
packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (3 hunks)
packages/backend/server/src/models/copilot-context.ts (3 hunks)
packages/backend/server/src/models/copilot-workspace.ts (2 hunks)
packages/backend/server/src/plugins/copilot/embedding/job.ts (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (2)

packages/backend/server/src/plugins/copilot/embedding/job.ts (1)

packages/backend/server/src/plugins/copilot/context/session.ts (1)

workspaceId (29-31)

packages/backend/server/src/models/copilot-context.ts (1)

packages/backend/server/src/plugins/copilot/context/session.ts (1)

workspaceId (29-31)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (49)

GitHub Check: y-octo binding test on aarch64-apple-darwin
GitHub Check: y-octo binding test on x86_64-apple-darwin
GitHub Check: y-octo binding test on x86_64-pc-windows-msvc
GitHub Check: y-octo binding test on aarch64-pc-windows-msvc
GitHub Check: E2E BlockSuite Test (7)
GitHub Check: Run native tests
GitHub Check: E2E BlockSuite Test (10)
GitHub Check: E2E BlockSuite Test (5)
GitHub Check: E2E BlockSuite Test (3)
GitHub Check: E2E BlockSuite Test (6)
GitHub Check: E2E BlockSuite Test (9)
GitHub Check: E2E BlockSuite Test (4)
GitHub Check: E2E BlockSuite Test (8)
GitHub Check: E2E Mobile Test (5)
GitHub Check: E2E BlockSuite Test (2)
GitHub Check: E2E BlockSuite Test (1)
GitHub Check: E2E Mobile Test (3)
GitHub Check: E2E Mobile Test (4)
GitHub Check: fuzzing
GitHub Check: Build Server native
GitHub Check: E2E Mobile Test (1)
GitHub Check: Build @affine/electron renderer
GitHub Check: E2E Mobile Test (2)
GitHub Check: E2E BlockSuite Cross Browser Test (2, webkit)
GitHub Check: E2E BlockSuite Cross Browser Test (2, firefox)
GitHub Check: Build AFFiNE native (x86_64-apple-darwin)
GitHub Check: E2E BlockSuite Cross Browser Test (1, webkit)
GitHub Check: E2E BlockSuite Cross Browser Test (2, chromium)
GitHub Check: E2E BlockSuite Cross Browser Test (1, chromium)
GitHub Check: E2E BlockSuite Cross Browser Test (1, firefox)
GitHub Check: E2E Test (4)
GitHub Check: E2E Test (1)
GitHub Check: E2E Test (7)
GitHub Check: E2E Test (5)
GitHub Check: E2E Test (10)
GitHub Check: E2E Test (3)
GitHub Check: E2E Test (9)
GitHub Check: E2E Test (8)
GitHub Check: E2E Test (2)
GitHub Check: E2E Test (6)
GitHub Check: Analyze (typescript, blocksuite)
GitHub Check: Build AFFiNE native (x86_64-pc-windows-msvc)
GitHub Check: loom thread test
GitHub Check: Build AFFiNE native (aarch64-pc-windows-msvc)
GitHub Check: Analyze (typescript, affine)
GitHub Check: Analyze (javascript, affine)
GitHub Check: Analyze (javascript, blocksuite)
GitHub Check: Lint
GitHub Check: Typecheck

🔇 Additional comments (6)

packages/backend/server/src/models/copilot-workspace.ts (1)

209-212: Comment update accurately reflects new gating rule

The docstring now matches the intended behavior (first-time embed OR re-embed only if newer doc AND last embed older than 10 minutes). Good alignment with the new SQL predicate.

packages/backend/server/src/models/copilot-context.ts (2)

221-241: File embeddings upsert now refreshes content — good guard and behavior

Early-return on empty embeddings avoids needless writes and noise.

Upsert on (context_id, file_id, chunk) updating content/embedding/updated_at is correct and prevents drift.

304-307: Upsert includes content refresh — correct and consistent with context embeddings

Updating content alongside embedding and updated_at ensures DB reflects the latest material. Good.

packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md (1)

104-125: Snapshot cases cover the updated gating matrix well

The five scenarios map precisely to the new AND-based re-embedding rule and first-embed behavior. Looks good.

packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (2)

296-299: Good: first-embed case captured via snapshot

This verifies the “no embedding exists” path returns true as intended.

334-340: Simulating stale embedding timestamp is a solid way to trigger the 10-minute rule

Directly setting updatedAt to an older timestamp ensures you’re exercising the time window without relying on wall-clock delays.

packages/backend/server/src/plugins/copilot/embedding/job.ts

fix: frequent embedding

640c87c

darkskygit requested a review from a team August 11, 2025 23:55

github-project-automation bot added this to AFFiNE Project Aug 11, 2025

github-actions bot added app:server test Related to test cases labels Aug 11, 2025

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

packages/backend/server/src/plugins/copilot/embedding/job.ts Outdated Show resolved Hide resolved

feat: clear meta in chunk

e2a9059

fengmk2 approved these changes Aug 12, 2025

View reviewed changes

chore: cleanup codes

f98eee8

darkskygit enabled auto-merge August 12, 2025 01:21

chore: cleanup space

0d5564f

darkskygit added this pull request to the merge queue Aug 12, 2025

chore: cleanup space

2546951

darkskygit removed this pull request from the merge queue due to a manual request Aug 12, 2025

darkskygit enabled auto-merge August 12, 2025 01:34

darkskygit added this pull request to the merge queue Aug 12, 2025

Merged via the queue into canary with commit 65f679c Aug 12, 2025
109 checks passed

darkskygit deleted the darksky/fix-frequent-embedding branch August 12, 2025 02:03

coderabbitai bot mentioned this pull request Aug 14, 2025

feat(server): integrate blob to context #13491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(server): frequent embedding #13475

fix(server): frequent embedding #13475

Uh oh!

darkskygit commented Aug 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 11, 2025 •

edited

Loading

Rate limit exceeded

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

codecov bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix(server): frequent embedding #13475

fix(server): frequent embedding #13475

Uh oh!

Conversation

darkskygit commented Aug 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

codecov bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

darkskygit commented Aug 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Aug 12, 2025 •

edited

Loading