Skip to content

fix(server): frequent embedding #13475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 12, 2025
Merged

Conversation

darkskygit
Copy link
Member

@darkskygit darkskygit commented Aug 11, 2025

Summary by CodeRabbit

  • New Features
    • Smarter embedding pipeline skips re-embedding when content hasn’t changed; added content sanitization for embeddings and workspace content retrieval.
  • Bug Fixes
    • Re-embedding now requires both a document update and the last embedding being older than 10 minutes, reducing unnecessary work.
  • Refactor
    • Consolidated embedding preprocessing and moved sanitization utilities into shared models; upserts now refresh stored content.
  • Tests
    • Expanded snapshot-based tests covering multiple time/age scenarios for embedding decision logic.

@darkskygit darkskygit requested a review from a team August 11, 2025 23:55
Copy link
Contributor

coderabbitai bot commented Aug 11, 2025

Warning

Rate limit exceeded

@darkskygit has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f98eee8 and 2546951.

📒 Files selected for processing (1)
  • packages/backend/server/src/plugins/copilot/embedding/job.ts (2 hunks)

Walkthrough

Adjusts re-embedding criteria to require both a newer doc update and embedding older than 10 minutes, adds content-equality skip in the embedding job, moves/centralizes embedding sanitization and dimensions into models, adds workspace/file embedding insert/get methods, and expands tests with time-based snapshot scenarios.

Changes

Cohort / File(s) Summary
Tests: embedding snapshots & scenarios
packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md, packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts
Replaces direct boolean assertions with snapshot assertions; adds multi-scenario time-based cases (no embedding, recent embedding, updated-after-old-embedding, only-window-passed, updated-after-window) and simulates embedding timestamps.
Workspace re-embed logic
packages/backend/server/src/models/copilot-workspace.ts
Changes re-embedding trigger from OR to AND: requires docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes'; preserves first-time-embedding when none exists.
Embedding sanitization & constants
packages/backend/server/src/models/common/copilot.ts
Adds EMBEDDING_DIMENSIONS (1024), clearEmbeddingContent, and clearEmbeddingChunk; implements FILTER_PREFIX trimming of leading metadata lines for embedding content/chunks.
Context model methods & upserts
packages/backend/server/src/models/copilot-context.ts
Imports EMBEDDING_DIMENSIONS and clearEmbeddingContent; removes local EMBEDDING_DIMENSIONS export; sanitizes getFileContent/getWorkspaceContent; adds insertFileEmbedding(contextId,fileId,embeddings) with empty-array guard; upserts now update content, embedding, and updated_at.
Embedding job: content-stability guard
packages/backend/server/src/plugins/copilot/embedding/job.ts
Adds normalize helper and uses getWorkspaceContent + normalized comparison to skip embedding generation when fragment.summary equals stored content (early return on match).
Plugin imports updated
packages/backend/server/src/plugins/copilot/mcp/provider.ts, packages/backend/server/src/plugins/copilot/tools/doc-semantic-search.ts
Redirects clearEmbeddingChunk import from utils to models (no runtime logic change).
Removed util
packages/backend/server/src/plugins/copilot/utils.ts
Deletes local clearEmbeddingChunk and FILTER_PREFIX from utils (logic moved to models).

Sequence Diagram(s)

sequenceDiagram
  participant Job as EmbeddingJob
  participant Ctx as CopilotContext
  participant DB as DB

  Job->>Ctx: getWorkspaceContent(workspaceId, docId[, chunk])
  Ctx->>DB: SELECT content ORDER BY chunk ASC
  DB-->>Ctx: content
  Ctx-->>Job: sanitized content

  alt normalized(summary) == normalized(stored content)
    Job->>Job: log skip and return
  else
    Job->>Job: build fragments, generate embeddings
    Job->>Ctx: insertFileEmbedding(contextId, fileId, embeddings)
    Ctx->>DB: UPSERT ai_context_embeddings (content, embedding, updated_at)
    DB-->>Ctx: ok
    Ctx-->>Job: done
  end
Loading
sequenceDiagram
  participant Caller
  participant WS as CopilotWorkspaceModel
  participant DB as DB

  Caller->>WS: checkDocNeedEmbedded(workspaceId, docId)
  WS->>DB: Query docs + last embedding e
  alt no existing embedding (e.updated_at IS NULL)
    WS-->>Caller: needsEmbedding = true
  else
    alt docs.updated_at > e.updated_at AND e.updated_at < NOW() - 10m
      WS-->>Caller: needsEmbedding = true
    else
      WS-->>Caller: needsEmbedding = false
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • fengmk2
  • pengx17

Poem

I nibble lines and clear the fluff,
If nothing's changed, then skip the stuff.
Ten minutes tick, the timestamps say,
Embed when new — else let it stay.
Snapshots hum; the rabbit hops away. 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch darksky/fix-frequent-embedding

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added app:server test Related to test cases labels Aug 11, 2025
Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 62.50000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.34%. Comparing base (aa20e7b) to head (2546951).
⚠️ Report is 2 commits behind head on canary.

Files with missing lines Patch % Lines
...ckages/backend/server/src/models/common/copilot.ts 42.42% 19 Missing ⚠️
...ackend/server/src/plugins/copilot/embedding/job.ts 47.36% 10 Missing ⚠️
...kages/backend/server/src/models/copilot-context.ts 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #13475      +/-   ##
==========================================
+ Coverage   56.79%   57.34%   +0.55%     
==========================================
  Files        2718     2718              
  Lines      134945   135032      +87     
  Branches    20730    20834     +104     
==========================================
+ Hits        76637    77430     +793     
+ Misses      56640    55937     -703     
+ Partials     1668     1665       -3     
Flag Coverage Δ
server-test 79.13% <62.50%> (+1.19%) ⬆️
unittest 32.11% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
packages/backend/server/src/models/copilot-workspace.ts (1)

241-247: Use the latest embedding timestamp per doc to avoid false positives and improve performance

Joining directly to ai_workspace_embeddings without aggregation may cause:

  • False positives when multiple chunks exist: EXISTS can succeed against an older chunk even if a newer chunk’s updated_at is up-to-date.
  • Unnecessary row scans and repeated comparisons.

Aggregate to the latest embedding time per (workspace_id, doc_id) and compare against that. This also reduces row cardinality in the join.

Apply this diff to aggregate embedding timestamps:

-            LEFT JOIN ai_workspace_embeddings e
-              ON e.workspace_id = docs.workspace_id
-              AND e.doc_id = docs.doc_id
+            LEFT JOIN (
+              SELECT
+                workspace_id,
+                doc_id,
+                MAX(updated_at) AS updated_at
+              FROM ai_workspace_embeddings
+              GROUP BY workspace_id, doc_id
+            ) e
+              ON e.workspace_id = docs.workspace_id
+              AND e.doc_id = docs.doc_id
           WHERE
             e.updated_at IS NULL
-            OR (docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes')
+            OR (docs.updated_at > e.updated_at AND e.updated_at < NOW() - INTERVAL '10 minutes')

Optional follow-up (further correctness/perf): also collapse docs to the latest doc update via a second CTE (SELECT MAX(updated_at) per doc) before the join, so the comparison is strictly between “latest doc update” and “latest embedding”.

🧹 Nitpick comments (2)
packages/backend/server/src/models/copilot-context.ts (1)

267-279: Align return type of getWorkspaceContent with behavior (avoid returning empty string on no rows)

findMany returns an empty array when no rows match; the current implementation returns '' (empty string), despite the function being typed as Promise<string | undefined>. This can lead to subtle logic bugs downstream.

Return undefined when no rows exist.

   async getWorkspaceContent(
     workspaceId: string,
     docId: string,
     chunk?: number
   ): Promise<string | undefined> {
-    const file = await this.db.aiWorkspaceEmbedding.findMany({
+    const file = await this.db.aiWorkspaceEmbedding.findMany({
       where: { workspaceId, docId, chunk },
       select: { content: true },
       orderBy: { chunk: 'asc' },
     });
-    return file?.map(f => f.content).join('\n');
+    if (!file.length) return undefined;
+    return file.map(f => f.content).join('\n');
   }
packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (1)

351-411: Well-designed scenarios for “only time elapsed” vs “doc actually updated”

  • 30m-old snapshot + 25m-old embedding → no re-embed (correct).
  • Then a doc update 5m ago with the same 25m-old embedding → re-embed (correct).

Minor note: these tests depend on relative Date.now() calculations. They’re deterministic here, but if future changes add delays, consider capturing a fixed baseNow once (which you already do) and ensuring all timestamps derive from it to avoid flakiness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa20e7b and 640c87c.

⛔ Files ignored due to path filters (1)
  • packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.snap is excluded by !**/*.snap
📒 Files selected for processing (5)
  • packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md (1 hunks)
  • packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (3 hunks)
  • packages/backend/server/src/models/copilot-context.ts (3 hunks)
  • packages/backend/server/src/models/copilot-workspace.ts (2 hunks)
  • packages/backend/server/src/plugins/copilot/embedding/job.ts (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
packages/backend/server/src/plugins/copilot/embedding/job.ts (1)
packages/backend/server/src/plugins/copilot/context/session.ts (1)
  • workspaceId (29-31)
packages/backend/server/src/models/copilot-context.ts (1)
packages/backend/server/src/plugins/copilot/context/session.ts (1)
  • workspaceId (29-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (49)
  • GitHub Check: y-octo binding test on aarch64-apple-darwin
  • GitHub Check: y-octo binding test on x86_64-apple-darwin
  • GitHub Check: y-octo binding test on x86_64-pc-windows-msvc
  • GitHub Check: y-octo binding test on aarch64-pc-windows-msvc
  • GitHub Check: E2E BlockSuite Test (7)
  • GitHub Check: Run native tests
  • GitHub Check: E2E BlockSuite Test (10)
  • GitHub Check: E2E BlockSuite Test (5)
  • GitHub Check: E2E BlockSuite Test (3)
  • GitHub Check: E2E BlockSuite Test (6)
  • GitHub Check: E2E BlockSuite Test (9)
  • GitHub Check: E2E BlockSuite Test (4)
  • GitHub Check: E2E BlockSuite Test (8)
  • GitHub Check: E2E Mobile Test (5)
  • GitHub Check: E2E BlockSuite Test (2)
  • GitHub Check: E2E BlockSuite Test (1)
  • GitHub Check: E2E Mobile Test (3)
  • GitHub Check: E2E Mobile Test (4)
  • GitHub Check: fuzzing
  • GitHub Check: Build Server native
  • GitHub Check: E2E Mobile Test (1)
  • GitHub Check: Build @affine/electron renderer
  • GitHub Check: E2E Mobile Test (2)
  • GitHub Check: E2E BlockSuite Cross Browser Test (2, webkit)
  • GitHub Check: E2E BlockSuite Cross Browser Test (2, firefox)
  • GitHub Check: Build AFFiNE native (x86_64-apple-darwin)
  • GitHub Check: E2E BlockSuite Cross Browser Test (1, webkit)
  • GitHub Check: E2E BlockSuite Cross Browser Test (2, chromium)
  • GitHub Check: E2E BlockSuite Cross Browser Test (1, chromium)
  • GitHub Check: E2E BlockSuite Cross Browser Test (1, firefox)
  • GitHub Check: E2E Test (4)
  • GitHub Check: E2E Test (1)
  • GitHub Check: E2E Test (7)
  • GitHub Check: E2E Test (5)
  • GitHub Check: E2E Test (10)
  • GitHub Check: E2E Test (3)
  • GitHub Check: E2E Test (9)
  • GitHub Check: E2E Test (8)
  • GitHub Check: E2E Test (2)
  • GitHub Check: E2E Test (6)
  • GitHub Check: Analyze (typescript, blocksuite)
  • GitHub Check: Build AFFiNE native (x86_64-pc-windows-msvc)
  • GitHub Check: loom thread test
  • GitHub Check: Build AFFiNE native (aarch64-pc-windows-msvc)
  • GitHub Check: Analyze (typescript, affine)
  • GitHub Check: Analyze (javascript, affine)
  • GitHub Check: Analyze (javascript, blocksuite)
  • GitHub Check: Lint
  • GitHub Check: Typecheck
🔇 Additional comments (6)
packages/backend/server/src/models/copilot-workspace.ts (1)

209-212: Comment update accurately reflects new gating rule

The docstring now matches the intended behavior (first-time embed OR re-embed only if newer doc AND last embed older than 10 minutes). Good alignment with the new SQL predicate.

packages/backend/server/src/models/copilot-context.ts (2)

221-241: File embeddings upsert now refreshes content — good guard and behavior

  • Early-return on empty embeddings avoids needless writes and noise.
  • Upsert on (context_id, file_id, chunk) updating content/embedding/updated_at is correct and prevents drift.

304-307: Upsert includes content refresh — correct and consistent with context embeddings

Updating content alongside embedding and updated_at ensures DB reflects the latest material. Good.

packages/backend/server/src/__tests__/models/__snapshots__/copilot-workspace.spec.ts.md (1)

104-125: Snapshot cases cover the updated gating matrix well

The five scenarios map precisely to the new AND-based re-embedding rule and first-embed behavior. Looks good.

packages/backend/server/src/__tests__/models/copilot-workspace.spec.ts (2)

296-299: Good: first-embed case captured via snapshot

This verifies the “no embedding exists” path returns true as intended.


334-340: Simulating stale embedding timestamp is a solid way to trigger the 10-minute rule

Directly setting updatedAt to an older timestamp ensures you’re exercising the time window without relying on wall-clock delays.

@darkskygit darkskygit enabled auto-merge August 12, 2025 01:21
@darkskygit darkskygit added this pull request to the merge queue Aug 12, 2025
@darkskygit darkskygit removed this pull request from the merge queue due to a manual request Aug 12, 2025
@darkskygit darkskygit enabled auto-merge August 12, 2025 01:34
@darkskygit darkskygit added this pull request to the merge queue Aug 12, 2025
Merged via the queue into canary with commit 65f679c Aug 12, 2025
109 checks passed
@darkskygit darkskygit deleted the darksky/fix-frequent-embedding branch August 12, 2025 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app:server test Related to test cases
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants