CodeWeaver

Extensible context platform and MCP server for hybrid semantic code search and targeted context delivery to AI coding agents.

Architectural Goals

Provide semantically-rich, ranked and prioritized search results for developers and their coding agents.

How

CodeWeaver uses ast-grep, and support for dozens of embedding and reranking models, local and remote, to provide weighted responses to searches.
CodeWeaver is fully pluggable. You can add embedding providers, reranking providers, agent providers, services and middleware, and new data sources beyond your codebase.

Eliminate 'cognitive load' on coding agents trying to get context on the codebase.

How

Reduces all operations to a single simple tool -- find_code -- allowing your coding agent to request what it needs, explain what it's trying to do, and get exactly the information it needs in response.
Uses mcp sampling to search and curate context for your coding agent -- using your coding agent! (also supports this outside of an MCP context where sampling isn't enabled or MCP is not available). CodeWeaver uses a different instance of your agent to evaluate your agent's needs and curate a response, keeping your agent unburdened with all the associated context from searching. <<<<<<< HEAD

Significantly cut context bloat, and costs. This also helps keep agents razor focused on their tasks.

||||||| parent of b87fe1b (added analysis on commonalitiees between semantic grammars, still trying to better identify functionally similar node kinds across languages) 3. Significantly cut context bloat, and costs.

CodeWeaver will also have heuristic fallback strategies for when an agent is not available to deliver search.

Significantly cut context bloat, and costs.

b87fe1b (added analysis on commonalitiees between semantic grammars, still trying to better identify functionally similar node kinds across languages) How

CodeWeaver aims to restrict context to your coding agent to only the information it needs. Of course, that's not easy to do, but we hope to get close.
By reducing the context that's returned to your Agent, your Agent no longer has to "carry" all of that extra, unused, context with them -- reducing token use with every turn and reducing its exponential growth.

Overview

CodeWeaver is more than an MCP server—it’s a context platform. It distills and delivers targeted, token-efficient context to your AI coding agent. Under the hood, it functions like RAG + hybrid search over semantically indexed code, and it can integrate arbitrary data sources (e.g., external API docs) through a provider architecture.

Status: Pre-release. Core architecture is in place; several integrations are still being tied together (see “Project status” and “Roadmap” below).

Contents

Features at a glance
Quickstart
CLI overview
Concepts and architecture
Providers and optional extras
Current tool surface (MCP + CLI)
Project status (what works vs. WIP)
Roadmap
Development
Telemetry and auth middleware
Licensing
Links

Features at a glance

Precise span-based code intelligence
- Span and SpanGroup types for exact line/column tracking
- Immutable set operations (union, intersection, difference) over spans
- Rich semantic metadata (AST-aware) for better chunking and assembly
Hybrid search foundation
- Text and semantic search with unified ranking (architecture complete; semantic pipeline integration WIP)
- Advanced, production-ready filtering (vendored search filters with pydantic-based validation)
Provider architecture
- Pluggable embedding, rerank, agent, vector store, and data source providers
- Multi-provider capability matrix (dynamic selection and inference)
pydantic-ai deep integration
- Agent-capable providers for intent analysis, query rewriting, and context planning
- Type-safe configuration and structured results
CLI and MCP server
- codeweaver server to run as an MCP server
- codeweaver search for local interactive search with multiple output formats (json, table, markdown)
- codeweaver config for configuration management and validation
Designed for background indexing and live updates (file watching implemented with watchfiles)
Strong foundation for performance and observability (robust statistics implementation, telemetry scaffolded)

Quickstart

Requirements

Python 3.12+ (tested primarily on 3.12)
Optional: Qdrant for vector storage (in progress), API keys for cloud providers if you choose them

Install Pick one of the extras that fits your environment:

Recommended (cloud-capable; includes telemetry) uv pip install "codeweaver-mcp[recommended]"
Recommended without telemetry uv pip install "codeweaver-mcp[recommended-no-telemetry]"
Local-only (no cloud calls; CPU embeddings via fastembed) uv pip install "codeweaver-mcp[recommended-local-only]"

A-la-carte (advanced) You need:

required-core
at least one agent-capable provider (e.g., provider-openai or provider-anthropic)
at least one embedding-capable provider (e.g., provider-voyageai or provider-fastembed)
at least one vector store (provider-qdrant or provider-in-memory)

Example: uv pip install "codeweaver-mcp[required-core,cli,provider-openai,provider-qdrant,source-filesystem]"

Run the server codeweaver server

Starts the MCP server with FastMCP.
Use --help to see options.

Run a search locally codeweaver search "how do we configure providers?"

Use --format json|table|markdown
Use --help to explore options.

Configure codeweaver config --help

Centralized config powered by pydantic-settings (multi-source).
Supports selecting providers and setting provider-specific options (e.g., API keys when applicable).

CLI overview

The CLI includes:

Server: codeweaver server
- Runs the MCP server with proper lifespan and application state management (integration in progress where noted).
Search: codeweaver search "query" [--format {json|table|markdown}]
- Runs a local search pipeline using the span-based assembly and available providers.
Config: codeweaver config …
- Manages settings, validates configuration, and helps you choose providers.

Use --help on any subcommand for full options.

Concepts and architecture

Span-based core

Spans precisely represent code locations (line/column), and SpanGroups allow composition.
Immutable, set-like operations enable accurate merging of results across passes (text, semantic, AST).
CodeChunk and CodeMatch carry spans and metadata, enabling token-aware, context-safe assembly.

Semantic metadata

ExtKind enumerates language and chunk types.
SemanticMetadata tracks AST nodes and classifications to improve chunk boundaries and ranking.

Provider ecosystem

Embedding providers: VoyageAI, fastembed, sentence-transformers, etc.
Rerank providers: cohere, bedrock (via pydantic-ai-slim integrations).
Agent providers: major cloud LLMs via pydantic-ai-slim (OpenAI, Anthropic, Google, Mistral, Groq, Hugging Face, Bedrock, etc.).
Vector stores: in-memory (basic), Qdrant (in progress for span-aware indexing).
Data sources: filesystem (with planned file watching), Tavily, DuckDuckGo.

Advanced filtering (vendored)

Rich, validated filters for keyword, numeric, range, boolean, etc.
Dynamic tool signature generation via decorator-based wrappers.
Designed to be vendor-agnostic.

Configuration and settings

Multi-source configuration via pydantic-settings.
Capability-based provider selection and dynamic instantiation (registry completing).
Token budgeting and caching strategies planned.

Providers and optional extras

Top-level extras (convenience)

recommended: end-to-end features with cloud support
recommended-no-telemetry: same as above without telemetry
recommended-local-only: local embeddings (fastembed), no cloud calls

A-la-carte extras (compose what you need)

required-core: the minimal core of CodeWeaver
cli: CLI niceties (rich, cyclopts)
pre-context: components used for pre-context and watching
Agent-only providers: provider-anthropic, provider-groq
Agent + embeddings providers: provider-openai, provider-google, provider-huggingface, provider-mistral
Embedding + rerank + agent: provider-bedrock, provider-cohere
Embedding-only: provider-fastembed, provider-sentence-transformers (CPU/GPU variants), provider-voyageai
Vector stores: provider-in-memory, provider-qdrant
Data sources: source-filesystem, source-tavily, source-duckduckgo

See pyproject.toml for exact versions and groups.

Current tool surface (MCP + CLI)

MCP server

Primary tool: find_code (being integrated)
- Query: natural language
- Filters (planned defaults):
  - language: keyword (any)
  - file_type: keyword (code, docs, config)
  - created_after: numeric timestamp (>=)
- Output: span-based code matches with execution metadata
Additional tool surfaces will evolve as pipelines and strategies are implemented.

CLI search

codeweaver search "your query" --format table
Uses the same underlying discovery and assembly model, outputting structured results.

Project status (what works vs. WIP)

What’s built

Strong span-based type system and semantic metadata
Sophisticated data models: DiscoveredFile, CodeChunk, CodeMatch
Deep pydantic-ai provider integration
Capability-based provider selection scaffolding
CLI (server, search, config) with rich output and robust error handling
Vendored filtering system that’s production-ready and provider-agnostic

What’s ~97% complete

Embedding, reranking, and agentic capabilities (provider integrations and orchestration)
Agent handling is implemented but needs to be tied fully into the pipelines

What’s in progress / planned

Provider registry (_registry.py) and final glue code
FastMCP middleware and application state management
Vector stores: Qdrant implementation (span-aware); in-memory baseline
File discovery + indexing: rignore-based discovery; background indexing with watchfiles
Pipelines and strategies: orchestration via pydantic-graph
Hybrid search with unified ranking across signals
Query intent analysis via agents
Performance, caching, and comprehensive test coverage
Authentication and authorization middleware

Roadmap

Phase 1: Core integration

Complete provider registry and statistics integration
Finalize FastMCP application state and context handling
Deliver working find_code over text search with filter integration
Basic tests for core workflows

Phase 2: Semantic search

Integrate embeddings (VoyageAI, fastembed) and Qdrant vector store
AST-aware chunking and span-aware indexing
Background indexing (watchfiles) and incremental updates
Hybrid search with unified ranking and intent analysis

Phase 3: Advanced capabilities

pydantic-graph pipelines and multi-stage workflows
Multi-signal ranking (semantic, syntactic, keyword)
Performance optimization and caching
Enhanced metadata leverage, comprehensive testing, telemetry/monitoring

Development

Clone and install (full dev environment)

uv pip install "codeweaver-mcp[all-dev]"

Linters and type checking

Ruff and Pyright are configured (strict mode for src/codeweaver/**)

Tests

Pytest config is included with markers for unit, integration, e2e, and provider-specific tests
Coverage thresholds are configured (with cov-fail-under)

Local run

codeweaver server
codeweaver search "query"

Contribution notes

Dual-licensed repository (MIT OR Apache-2.0)
A contributors agreement is included; please review CONTRIBUTORS_LICENSE_AGREEMENT.py
Issues and PRs welcome—especially for providers, vector stores, pipelines, and indexing (or anything else)

Telemetry and auth middleware

Telemetry

PostHog integration is available in recommended extras (we take great care to avoid capturing identifying or proprietary data -- we will only use this telemetry for improving CodeWeaver). If you're unsure, please look at our implementation for yourself so you can see what we collect.
Use recommended-no-telemetry to exclude telemetry from install

Auth middleware (optional)

Permit.io (permit-fastmcp), Eunomia (eunomia-mcp), and AuthKit integrations are scaffolded through FastMCP
Enablement is controlled via environment variables when using those middlewares (see comments in pyproject.toml)

Licensing

All original Knitli code licensed under MIT OR Apache-2.0. See LICENSE, LICENSE-MIT, and LICENSE-APACHE-2.0. Some vendored code is Apache-2.0 only or MIT only.

This project follows the REUSE specification. Every file contains exact license information or has an accompanying .license file.

Links

Repository: https://github.com/knitli/codeweaver-mcp
Issues: https://github.com/knitli/codeweaver-mcp/issues
Documentation (in progress): https://dev.knitli.com/codeweaver
Changelog: https://github.com/knitli/codeweaver-mcp/blob/main/CHANGELOG.md
Knitli: https://knitli.com
X: https://x.com/knitli_inc
LinkedIn: https://linkedin.com/company/knitli
Github: https://github.com/knitli

Notes

Python package name: codeweaver-mcp
CLI entry point: codeweaver (module: codeweaver.cli.app:main)
Requires Python >= 3.11 (classifiers include 3.12–3.14)

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.changeset		.changeset
.claude		.claude
.github		.github
.roo		.roo
.serena		.serena
.specify		.specify
.vscode		.vscode
LICENSES		LICENSES
context/apis		context/apis
docs		docs
node_types		node_types
overrides		overrides
plans		plans
scripts		scripts
src/codeweaver		src/codeweaver
tests		tests
typings		typings
vendored		vendored
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.mcp.json		.mcp.json
.mcp.json.license		.mcp.json.license
.python-version.license		.python-version.license
.roomodes		.roomodes
.roomodes.license		.roomodes.license
.sourcery.yaml		.sourcery.yaml
.yamlfmt.yml		.yamlfmt.yml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_STYLE.md		CODE_STYLE.md
CONTRIBUTORS_LICENSE_AGREEMENT.md		CONTRIBUTORS_LICENSE_AGREEMENT.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE-Apache-2.0		LICENSE-Apache-2.0
LICENSE-MIT		LICENSE-MIT
README.md		README.md
TODO.md		TODO.md
_typos.toml		_typos.toml
context7.json		context7.json
context7.json.license		context7.json.license
coverage.xml		coverage.xml
example_usage.py		example_usage.py
hk.pkl		hk.pkl
hk.pkl.license		hk.pkl.license
mise.toml		mise.toml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
sbom.spdx		sbom.spdx
sgconfig.yml		sgconfig.yml
uv.lock		uv.lock
uv.lock.license		uv.lock.license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

CodeWeaver

Architectural Goals

||||||| parent of b87fe1b (added analysis on commonalitiees between semantic grammars, still trying to better identify functionally similar node kinds across languages) 3. Significantly cut context bloat, and costs.

Overview

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

Licenses found

knitli/codeweaver-mcp

Folders and files

Latest commit

History

Repository files navigation

CodeWeaver

Architectural Goals

||||||| parent of b87fe1b (added analysis on commonalitiees between semantic grammars, still trying to better identify functionally similar node kinds across languages) 3. Significantly cut context bloat, and costs.

Overview

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages