-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
feat(server): clear semantic search metadata #13197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe update introduces a utility to remove leading metadata lines from document chunks before returning them in search results. It also separates and merges author and metadata retrieval for documents, enriching chunk metadata with author details. The overall function signatures remain unchanged, and all modifications are internal. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DocSearchGetter
participant Models
User->>DocSearchGetter: search(query)
DocSearchGetter->>Models: findAuthors(docIds)
Models-->>DocSearchGetter: authors[]
DocSearchGetter->>Models: findMetas(docIds)
Models-->>DocSearchGetter: metas[]
DocSearchGetter->>DocSearchGetter: merge authors into metas
DocSearchGetter->>DocSearchGetter: for each chunk, clearEmbeddingChunk(chunk)
DocSearchGetter-->>User: cleaned, enriched chunks[]
Assessment against linked issues
Assessment against linked issues: Out-of-scope changesNo out-of-scope changes found. Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (54)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## canary #13197 +/- ##
==========================================
- Coverage 57.52% 57.51% -0.02%
==========================================
Files 2709 2709
Lines 132506 132552 +46
Branches 20678 20678
==========================================
+ Hits 76222 76231 +9
- Misses 54075 54655 +580
+ Partials 2209 1666 -543
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/backend/server/src/plugins/copilot/tools/doc-semantic-search.ts
(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
packages/backend/server/src/plugins/copilot/tools/doc-semantic-search.ts (1)
packages/backend/server/src/models/common/copilot.ts (1)
ChunkSimilarity
(96-100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (48)
- GitHub Check: y-octo binding test on x86_64-apple-darwin
- GitHub Check: E2E BlockSuite Test (9)
- GitHub Check: y-octo binding test on aarch64-pc-windows-msvc
- GitHub Check: y-octo binding test on x86_64-pc-windows-msvc
- GitHub Check: E2E BlockSuite Test (8)
- GitHub Check: Run native tests
- GitHub Check: E2E BlockSuite Test (10)
- GitHub Check: E2E BlockSuite Test (7)
- GitHub Check: E2E Mobile Test (4)
- GitHub Check: E2E BlockSuite Test (2)
- GitHub Check: E2E Mobile Test (5)
- GitHub Check: E2E BlockSuite Test (6)
- GitHub Check: E2E BlockSuite Test (4)
- GitHub Check: E2E BlockSuite Test (3)
- GitHub Check: E2E BlockSuite Test (5)
- GitHub Check: E2E BlockSuite Test (1)
- GitHub Check: E2E Mobile Test (2)
- GitHub Check: E2E Mobile Test (1)
- GitHub Check: E2E Mobile Test (3)
- GitHub Check: Build AFFiNE native (x86_64-apple-darwin)
- GitHub Check: E2E BlockSuite Cross Browser Test (2, webkit)
- GitHub Check: E2E BlockSuite Cross Browser Test (1, chromium)
- GitHub Check: E2E BlockSuite Cross Browser Test (2, firefox)
- GitHub Check: E2E BlockSuite Cross Browser Test (1, webkit)
- GitHub Check: E2E BlockSuite Cross Browser Test (2, chromium)
- GitHub Check: E2E BlockSuite Cross Browser Test (1, firefox)
- GitHub Check: E2E Test (3)
- GitHub Check: E2E Test (4)
- GitHub Check: E2E Test (9)
- GitHub Check: E2E Test (10)
- GitHub Check: E2E Test (8)
- GitHub Check: E2E Test (6)
- GitHub Check: E2E Test (7)
- GitHub Check: E2E Test (5)
- GitHub Check: E2E Test (2)
- GitHub Check: E2E Test (1)
- GitHub Check: loom thread test
- GitHub Check: Build Server native
- GitHub Check: Build AFFiNE native (aarch64-pc-windows-msvc)
- GitHub Check: Build AFFiNE native (x86_64-pc-windows-msvc)
- GitHub Check: Build @affine/electron renderer
- GitHub Check: fuzzing
- GitHub Check: Analyze (typescript, blocksuite)
- GitHub Check: Analyze (javascript, affine)
- GitHub Check: Analyze (javascript, blocksuite)
- GitHub Check: Analyze (typescript, affine)
- GitHub Check: Typecheck
- GitHub Check: Lint
🔇 Additional comments (4)
packages/backend/server/src/plugins/copilot/tools/doc-semantic-search.ts (4)
2-2
: LGTM - Import addition looks good.The
omit
import from lodash-es is properly used later in the code for filtering document author fields.
12-18
: LGTM - Well-defined metadata filter prefixes.The FILTER_PREFIX array clearly defines the metadata prefixes that should be removed from chunks. The prefixes are comprehensive and cover the common metadata fields.
74-101
: LGTM - Improved document metadata retrieval with author information.The refactored logic properly separates author and metadata retrieval, then merges them effectively. The defensive filtering with
docs.filter(d => !!d)
and the use ofomit
to clean up author fields are good practices.The key mapping strategy (using
doc.id
for authors anddoc.docId
for metadata) appears to align with the respective API responses.
104-104
: Intentional: clearEmbeddingChunk only applies to fileChunks
docChunks come from the docs API and are returned without the injected “Title:…”, “Created at:…”, etc. prefixes, so there’s nothing to strip. fileChunks (and any contextChunks) may include those metadata lines and must be cleaned before downstream use. No change required.
packages/backend/server/src/plugins/copilot/tools/doc-semantic-search.ts
Outdated
Show resolved
Hide resolved
fb47c53
to
6b05a0a
Compare
fix AI-360
fix AI-348