Cognitive Triangulation and Knowledge Graph Code Analysis Pipeline

Welcome to a state-of-the-art, event-driven pipeline engineered to perform deep semantic analysis of software codebases. This system discovers, analyzes, and understands the intricate relationships within code, building a queryable, high-fidelity knowledge graph using Large Language Models (LLMs) and graph database technology.

This is not just a static analysis tool; it's a dynamic, scalable, and resilient platform for building a "living" model of your code's cognitive architecture.

Overview
Core Concepts
Architectural Blueprint
- High-Level Diagram
- Component Breakdown
The Data Pipeline: A Step-by-Step Journey
Key Features
Comparison to Existing Technologies
Getting Started
API Reference
- Endpoints
- WebSocket Events
Directory Structure
Database Schema
- SQLite Schema
- Neo4j Graph Model
Contributing
License

Overview

This system is architected to deconstruct a software repository and rebuild it as a knowledge graph. It achieves this by:

Discovering all relevant files and directories, intelligently respecting .gitignore rules.
Analyzing file contents using LLMs to extract key Points of Interest (POIs), such as classes, functions, variables, and imports.
Triangulating Relationships through a sophisticated multi-pass analysis, combining deterministic logic with contextual LLM queries to identify how POIs connect.
Building a Knowledge Graph in Neo4j, representing the codebase as a network of nodes (POIs) and edges (relationships).
Providing Real-time Insight into the analysis process through a robust REST and WebSocket API.
Maintaining Itself by automatically cleaning up graph elements corresponding to deleted files.

The result is a powerful, queryable graph that enables advanced code navigation, dependency analysis, architectural validation, and impact analysis for refactoring.

Core Concepts

The power of this system comes from the fusion of several key architectural patterns and concepts.

Cognitive Triangulation

This is the core philosophy of the pipeline. Instead of relying on a single method to find relationships, I use multiple, independent "witnesses" to build confidence.

Deterministic Pass: Identifies obvious, rule-based relationships (e.g., a file CONTAINS a class). This pass is fast and highly accurate.
Intra-File LLM Pass: An LLM analyzes the POIs within a single file to find local relationships (e.g., function A CALLS function B in the same file).
Intra-Directory LLM Pass: An LLM analyzes all POIs within a directory to find connections between files (e.g., one file IMPORTS an exported class from another).
Global LLM Pass: After summarizing entire directories, an LLM analyzes these summaries to find high-level architectural relationships between different modules or services.

The ConfidenceScoringService then aggregates the "evidence" from these passes. A relationship reported by multiple passes receives a higher confidence score, filtering out noise and LLM hallucinations.

Event-Driven & Asynchronous Architecture

The system is built on BullMQ and Redis, allowing for a highly scalable and decoupled architecture. Each stage of the pipeline is a queue, and each processing unit is a worker. This design means:

Scalability: You can run hundreds or thousands of workers across multiple machines to process massive codebases in parallel.
Resilience: If a worker fails while processing a job, BullMQ ensures the job is automatically retried, preventing data loss.
Decoupling: Each worker has a single responsibility. I can update, improve, or replace a worker (e.g., FileAnalysisWorker) without affecting the rest of the system.

The Transactional Outbox Pattern

To ensure maximum reliability, I use the transactional outbox pattern via the TransactionalOutboxPublisher.

When a worker (like FileAnalysisWorker) processes a file, it doesn't immediately publish an event to the next queue. Instead, it writes its findings and an "event to be published" into the SQLite database within a single transaction.

A separate, dedicated TransactionalOutboxPublisher service polls this outbox table and reliably publishes the events to BullMQ. This guarantees at-least-once delivery and prevents a scenario where a worker crashes after performing its work but before it could enqueue the next job, which would otherwise halt the pipeline for that file.

Polyglot Persistence

The system leverages two different databases for what they do best:

SQLite (better-sqlite3): Acts as the fast, transactional, and transient data store. It's perfect for the outbox pattern, storing intermediate POIs, and tracking job states. Its file-based nature makes it easy to manage per-run.
Neo4j (Graph Database): Serves as the final, persistent knowledge graph. It is optimized for storing and querying complex network relationships, making it the ideal destination for our analysis.

Architectural Blueprint

High-Level Diagram

+------------------+      +--------------------+      +-----------------------+      +-------------------+
|                  |----->|  File Analysis     |----->|  Relationship         |----->|                   |
|   EntityScout    |      |  Queue & Workers   |      |  Resolution Queues    |      |   GraphBuilder    |
| (Discover Files) |      |  (Extract POIs)    |      |  & Workers (Triangulate) |      | (Persist to Neo4j)|
|                  |----->| (Chunking, LLM)    |----->|  (Confidence Score)   |----->|                   |
+------------------+      +----------+---------+      +-----------+-----------+      +-------------------+
        |                          |                        |
        v                          v                        v
+-------------------------------------------------------------------------+
|                                                                         |
|  SQLite (Intermediate Storage: POIs, Relationships, Outbox, Evidence)   |
|                                                                         |
+-------------------------------------------------------------------------+
        ^                          ^                        ^
        |                          |                        |
+-------+--------------------------+------------------------+-------------+
|                                                                         |
|             Transactional Outbox Publisher (Reliable Eventing)          |
|                                                                         |
+-------------------------------------------------------------------------+
        |
        | Publishes to...
        v
+-------------------------------------------------------------------------+
|                                                                         |
|                BullMQ / Redis (Job Queues & Caching)                    |
|                                                                         |
+-------------------------------------------------------------------------+

Component Breakdown

The pipeline is a collection of specialized agents and workers, each with a distinct role.

Component	File Path	Purpose
PipelineApiService	`src/utils/pipelineApi.js`	Provides REST and WebSocket APIs for starting, stopping, and monitoring pipeline runs in real-time. The primary user control plane.
EntityScout	`src/agents/EntityScout.js`	The starting point. Scans the target directory, respects `.gitignore`, and enqueues files into `file-analysis-queue`.
FileDiscoveryBatcher	`src/workers/fileDiscoveryBatcher.js`	An intelligent batching worker that groups files based on token count to optimize LLM API calls, improving efficiency and reducing cost.
FileAnalysisWorker	`src/workers/fileAnalysisWorker.js`	Consumes jobs from `file-analysis-queue`. Reads file content, chunks large files, and uses an LLM to extract POIs. Publishes findings to the transactional outbox.
DirectoryAggregationWorker	`src/workers/directoryAggregationWorker.js`	A coordination worker that uses Redis to track when all files within a single directory have been analyzed, triggering the `DirectoryResolutionWorker`.
DirectoryResolutionWorker	`src/workers/directoryResolutionWorker.js`	Uses an LLM to generate a high-level summary of a directory's purpose based on the contents of its files.
TransactionalOutboxPublisher	`src/services/TransactionalOutboxPublisher.js`	Polls the SQLite `outbox` table and reliably publishes events to the appropriate BullMQ queues, ensuring data consistency and at-least-once delivery.
RelationshipResolver	`src/agents/RelationshipResolver.js`	Orchestrates the multi-pass relationship extraction process (deterministic, intra-file, intra-directory, global). This is the "brain" of the analysis.
RelationshipResolutionWorker	`src/workers/relationshipResolutionWorker.js`	A dedicated worker that takes a single POI and its context, queries an LLM to find relationships, and writes findings to the outbox.
ValidationWorker	`src/workers/ValidationWorker.js`	Consumes analysis findings, stores them as "evidence" in SQLite, and uses Redis to determine when a relationship has enough evidence to be reconciled.
ReconciliationWorker	`src/workers/ReconciliationWorker.js`	Takes a relationship with all its evidence, uses `ConfidenceScoringService` to calculate a final score, and persists validated relationships to the main `relationships` table.
GraphBuilder	`src/agents/GraphBuilder.js`	The final step. Reads validated, high-confidence relationships from SQLite and executes efficient Cypher queries to build the final knowledge graph in Neo4j.
SelfCleaningAgent	`src/agents/SelfCleaningAgent.js`	A maintenance agent that runs to find files that have been deleted from the filesystem and removes their corresponding nodes and relationships from the databases.
DeepSeekClient	`src/utils/deepseekClient.js`	A native, dependency-free client for the DeepSeek LLM API, featuring concurrency management, automatic retries, and backoff logic.
LLMResponseSanitizer	`src/utils/LLMResponseSanitizer.js`	A defensive utility that cleans and repairs common formatting issues in LLM JSON output, making parsing more resilient.
QueueManager	`src/utils/queueManager.js`	A singleton wrapper around BullMQ that manages all queue and worker connections to Redis.
DatabaseManager	`src/utils/sqliteDb.js`	Manages the connection and schema for the SQLite database.
neo4jDriver	`src/utils/neo4jDriver.js`	Manages the singleton connection to the Neo4j graph database.

The Data Pipeline: A Step-by-Step Journey

Here is how a single file travels through the system:

Initiation: A user submits a new pipeline run via the POST /api/pipeline/start endpoint, specifying a target directory.
Discovery: EntityScout awakens, scans the directory, and creates a "file-analysis" job for every discovered file, placing it in the file-analysis-queue.
POI Extraction: A FileAnalysisWorker picks up a job. It reads the file, asks the LLM to identify all POIs (functions, classes, etc.), and writes these POIs along with an event into the SQLite outbox table in a single transaction.
Reliable Eventing: The TransactionalOutboxPublisher polls the outbox, sees the new event, and publishes it to the relationship-resolution-queue.
Relationship Triangulation: A RelationshipResolutionWorker receives the POI. It orchestrates the multi-pass analysis, querying the LLM with different contexts to find potential relationships. Each piece of evidence is written to the outbox.
Evidence Validation: The ValidationWorker consumes these evidence events from the outbox. It stores them in the relationship_evidence table and uses Redis to track how much evidence has been gathered for a specific potential relationship.
Reconciliation & Scoring: Once all evidence for a relationship is collected, the ValidationWorker enqueues a job for the ReconciliationWorker. This worker fetches all evidence, calculates a final confidence_score using ConfidenceScoringService, and if the score exceeds a threshold, writes the final, validated relationship to the main relationships table in SQLite.
Graph Construction: After all analysis jobs are complete, the GraphBuilder agent runs. It reads all validated relationships from SQLite and bulk-inserts them into Neo4j, creating the final, queryable knowledge graph.
Monitoring: Throughout this entire process, status updates are broadcast via WebSocket, allowing a user to monitor the progress in real-time.
Cleanup: If a file is ever deleted from the source repository, the SelfCleaningAgent can be run to find and remove its corresponding data from both SQLite and Neo4j, keeping the graph synchronized with the codebase.

Key Features

Deep Semantic Analysis: Goes far beyond simple pattern matching by using LLMs to understand code intent and context.
Highly Scalable: The distributed worker/queue architecture allows for massive parallel processing of large and complex codebases.
Extremely Reliable: The transactional outbox pattern guarantees that no analysis work is lost, even if parts of the system crash.
Confidence-Scored Relationships: The "Cognitive Triangulation" approach filters out LLM noise and produces a high-fidelity graph.
Real-time Monitoring: A built-in API and WebSocket server provide immediate insight into the pipeline's status, progress, and logs.
Automated Graph Construction: Fully automates the process of turning a directory of code into a queryable Neo4j knowledge graph.
Language Agnostic: Because it relies on LLMs, the system can be adapted to analyze code in any programming language.

Comparison to Existing Technologies

Feature	Your System	Static Analysis Tools (SonarQube, ESLint)	Code Search Tools (Sourcegraph)	AI Code Assistants (GitHub Copilot)
Primary Goal	Build a queryable graph of semantic relationships.	Find bugs, enforce code style, measure quality.	Search, navigate, and understand code.	Suggest code, answer questions.
Relationship Extraction	Advanced, multi-pass, confidence-scored.	Limited, rule-based (e.g., unused imports).	No, provides "find references" only.	Limited, contextual but not exhaustive.
LLM-Based Analysis	Core Component for deep understanding.	No.	No (traditionally).	Core Component for suggestions.
Persistent Knowledge Graph	Yes, creates a Neo4j graph.	No.	No (builds a search index).	No.
Real-time Monitoring	Yes, via REST and WebSocket API.	No.	No.	No.
Scalability	High, designed with distributed queues.	Varies, often monolithic scans.	High, designed for large-scale indexing.	N/A (Cloud Service).
Reliability	High, via Transactional Outbox.	N/A.	N/A.	N/A.

Getting Started

Follow these steps to get the pipeline running on your local machine.

Prerequisites

Node.js: v18.x or later recommended.
Redis: An instance of Redis for BullMQ. Can be run locally or via Docker.
Neo4j: A Neo4j instance (v5.x recommended). Can be run locally or via Docker.
DeepSeek API Key: You need an API key from DeepSeek or another compatible LLM provider.

Installation

Clone the Repository:

git clone https://github.com/yourusername/cognitive-triangulation-pipeline.git
cd cognitive-triangulation-pipeline

Install Dependencies:
```
npm install
```

Configuration

Create an Environment File: Copy the example environment file:
```
cp .env.example .env
```

Edit .env: Open the .env file and fill in the details for your local setup:

# --- LLM API Configuration ---
# Get your key from https://platform.deepseek.com/
DEEPSEEK_API_KEY=sk-your_secret_api_key

# --- Redis Configuration ---
# URL for your Redis instance
REDIS_URL=redis://127.0.0.1:6379

# --- Neo4j Database Configuration ---
# URI for your Neo4j instance
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
# Use a secure password, especially in production
NEO4J_PASSWORD=your_secure_password
NEO4J_DATABASE=neo4j

# --- SQLite Database Configuration ---
# Path where the intermediate SQLite database will be stored
SQLITE_DB_PATH=./database.db

Running the Pipeline

The system includes a user-friendly API server for managing runs.

Initialize the Database Schema: This command will create the database.db file and set up the necessary tables.
```
npm run init-db
```
Start the API Server: This will start the web server and the WebSocket server for real-time updates.
```
npm run start-api
```
You should see output indicating the server is running on http://localhost:3002.
Submit a Pipeline Run: Use curl or any API client to send a POST request to the /api/pipeline/start endpoint. Replace /path/to/your/codebase with the absolute path to the directory you want to analyze.
```
curl -X POST http://localhost:3002/api/pipeline/start \
  -H "Content-Type: application/json" \
  -d '{"targetDirectory": "/path/to/your/codebase"}'
```
Monitor the Progress:
- Terminal: Watch the console output from the npm run start-api command.
- REST API: Get the status of a specific run: GET http://localhost:3002/api/pipeline/status/:pipelineId
- WebSocket: Connect a WebSocket client to ws://localhost:3002 to receive real-time JSON patches of the pipeline status.

Once the pipeline completes, you can explore the generated knowledge graph in your Neo4j Browser.

API Reference

The PipelineApiService provides the following endpoints:

Endpoints

Endpoint	Method	Description
`/health`	`GET`	Health check to confirm the server is running.
`/api/pipeline/start`	`POST`	Starts a new analysis pipeline run. Body: `{ "targetDirectory": "/path/to/code" }`.
`/api/pipeline/status/:pipelineId`	`GET`	Retrieves the detailed current status and logs for a specific pipeline run.
`/api/pipeline/active`	`GET`	Lists all currently active or recently completed pipeline runs managed by the server.
`/api/pipeline/stop/:pipelineId`	`POST`	Requests a graceful stop for a running pipeline.
`/api/pipeline/clear/:pipelineId`	`DELETE`	Clears the history of a completed or failed pipeline run from the server's memory.

WebSocket Events

Connect to ws://localhost:3002 to receive real-time updates.

Event Type	Payload Description
`initial_state`	Sent on connection. Provides a full list of all active pipelines.
`pipeline_update`	Sent whenever a pipeline's status, progress, or logs are updated. Contains the full data object for the specific pipeline that changed.

Directory Structure

src/
├── agents/             # High-level orchestrators (EntityScout, GraphBuilder)
├── workers/            # Single-responsibility, queue-driven workers
├── services/           # Core services (LLM Client, Outbox Publisher)
├── utils/              # Shared utilities (DB drivers, queue manager, logger)
├── config/             # Application configuration
├── main.js             # Main CLI entry point for the pipeline
└── ...

Database Schema

SQLite Schema

Defined in src/utils/schema.sql, the SQLite database serves as the pipeline's operational datastore.

files: Tracks discovered files and their processing status.
pois: Stores Points of Interest extracted from files.
relationships: Stores validated relationships with their final confidence scores.
relationship_evidence: Stores all raw "evidence" for potential relationships from different analysis passes before reconciliation.
directory_summaries: Stores the LLM-generated summaries for each directory.
outbox: The transactional outbox table for reliable event publishing.

Neo4j Graph Model

The final graph in Neo4j is simple and powerful:

Nodes:
- :POI: All Points of Interest are stored as nodes with this label.
- Properties: id, type, name, filePath, startLine, endLine.
Relationships:
- :RELATIONSHIP: A generic relationship type between two :POI nodes.
- Properties: type (e.g., 'CALLS', 'IMPORTS', 'CONTAINS'), confidence.

Contributing

Contributions are welcome! This project is ambitious, and there are many avenues for improvement, from performance tuning to enhancing the LLM prompts for even more nuanced analysis. Please open an issue to discuss your ideas or submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
config		config
docs		docs
polyglot-test		polyglot-test
src		src
tests		tests
.env		.env
.env.backup		.env.backup
.gitignore		.gitignore
.roomodes		.roomodes
README.md		README.md
capture_files.py		capture_files.py
code_capture.txt		code_capture.txt
config.js		config.js
database.db		database.db
database.db-shm		database.db-shm
database.db-wal		database.db-wal
database.sqlite		database.sqlite
database.sqlite-shm		database.sqlite-shm
database.sqlite-wal		database.sqlite-wal
db.sqlite		db.sqlite
debug_neo4j.js		debug_neo4j.js
debug_sqlite.js		debug_sqlite.js
geminicliprompt.md		geminicliprompt.md
jest.config.js		jest.config.js
jest.config.temp.js		jest.config.temp.js
jest.globalSetup.js		jest.globalSetup.js
jest.globalTeardown.js		jest.globalTeardown.js
package.json		package.json
run-polyglot-tests.js		run-polyglot-tests.js
test_neo4j_connection.js		test_neo4j_connection.js
test_neo4j_connection.py		test_neo4j_connection.py
test_redis_connection.js		test_redis_connection.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cognitive Triangulation and Knowledge Graph Code Analysis Pipeline

Table of Contents

Overview

Core Concepts

Cognitive Triangulation

Event-Driven & Asynchronous Architecture

The Transactional Outbox Pattern

Polyglot Persistence

Architectural Blueprint

High-Level Diagram

Component Breakdown

The Data Pipeline: A Step-by-Step Journey

Key Features

Comparison to Existing Technologies

Getting Started

Prerequisites

Installation

Configuration

Running the Pipeline

API Reference

Endpoints

WebSocket Events

Directory Structure

Database Schema

SQLite Schema

Neo4j Graph Model

Contributing

License

About

Uh oh!

Releases

Packages

Languages

EARTHTOEDWARD/Cognitive-Triangulation-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Cognitive Triangulation and Knowledge Graph Code Analysis Pipeline

Table of Contents

Overview

Core Concepts

Cognitive Triangulation

Event-Driven & Asynchronous Architecture

The Transactional Outbox Pattern

Polyglot Persistence

Architectural Blueprint

High-Level Diagram

Component Breakdown

The Data Pipeline: A Step-by-Step Journey

Key Features

Comparison to Existing Technologies

Getting Started

Prerequisites

Installation

Configuration

Running the Pipeline

API Reference

Endpoints

WebSocket Events

Directory Structure

Database Schema

SQLite Schema

Neo4j Graph Model

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages