Architecture¶
MegaBrain follows a modular, event-driven architecture built on a modern Java stack with reactive programming principles.
High-Level Architecture¶
flowchart TD
subgraph sources ["Source Code Repositories"]
GitHub
GitLab
Bitbucket
LocalGit["Local Git"]
end
subgraph ingestion ["Ingestion Layer (EPIC-01)"]
SourceControl["Source Control\nClient Factory"] --> ParserRegistry["Parser Registry\n(JavaParser / Tree-sitter)"]
ParserRegistry --> DependencyExtractor["Dependency\nExtractor"]
end
subgraph storage ["Storage Layer"]
LuceneIndex["Lucene Index\n(Primary)"]
GraphDB["Graph DB\n(Neo4j)"]
VectorStore["Vector Store\n(pgvector, Optional)"]
end
subgraph search ["Search & Retrieval Layer (EPIC-02)"]
HybridSearch["Hybrid Search Orchestrator\n(Keyword + Semantic + Graph + Vector)"]
end
subgraph rag ["RAG Layer (EPIC-03)"]
ContextAssembler["Context\nAssembler"] --> LLMProvider["LLM Provider\n(Ollama / OpenAI / Anthropic)"]
LLMProvider --> AnswerStreamer["Answer\nStreamer (SSE)"]
end
subgraph interfaces ["Interface Layer (EPIC-04, EPIC-05, EPIC-08)"]
RestAPI["REST API"]
WebUI["Web UI\n(Angular)"]
CLI["CLI\n(Picocli)"]
MCPServer["MCP Server"]
end
sources --> ingestion
ingestion --> storage
storage --> search
search --> rag
rag --> interfaces
Backend Package Structure¶
io.megabrain/
├── api/ # REST endpoints (JAX-RS)
│ ├── SearchResource # GET /api/v1/search
│ ├── IngestionResource # POST /api/v1/ingestion
│ ├── HealthResource # GET /q/health
│ ├── GrammarHealthCheck # Readiness probe for grammars
│ ├── SearchRequest # Search request DTO
│ ├── SearchResponse # Search response with facets
│ ├── SearchResult # Individual search result DTO
│ ├── IngestionRequest # Ingestion request DTO
│ ├── IngestionResult # Ingestion result DTO
│ ├── LineRange # Line range metadata
│ └── FieldMatchInfo # Per-field match scores
│
├── core/ # Core services and utilities
│ ├── LuceneIndexService # Lucene index management and search
│ ├── HybridIndexService # Combines Lucene + vector search
│ ├── SearchOrchestrator # Coordinates hybrid + graph search
│ ├── ResultMerger # Merges and deduplicates results
│ ├── QueryParserService # Lucene query parsing
│ ├── CodeAwareAnalyzer # camelCase/snake_case tokenizer
│ ├── DocumentMapper # TextChunk to Lucene Document
│ ├── LuceneSchema # Index field definitions
│ ├── LuceneFilterQueryBuilder # Metadata filter construction
│ ├── BoostConfiguration # Configurable field boosts
│ ├── HybridScorer # Weighted score combination
│ ├── VectorScoreNormalizer # Score normalization (0-1)
│ ├── ABTestHarness # Relevance comparison framework
│ ├── VectorStore # Vector storage interface
│ ├── PgVectorStore # pgvector implementation
│ ├── EmbeddingService # Embedding generation
│ ├── EmbeddingModelService # Model management
│ ├── GraphQueryService # Graph query interface
│ ├── GraphQueryServiceStub # Graph query implementation
│ ├── ImplementsClosureQuery # Transitive implements traversal
│ ├── ExtendsClosureQuery # Transitive extends traversal
│ ├── LLMClient # LLM provider interface
│ ├── OllamaLLMClient # Ollama LLM implementation
│ └── OllamaConfiguration # Ollama config mapping
│
├── ingestion/ # Code ingestion services
│ ├── IngestionService # Ingestion orchestration interface
│ ├── IngestionServiceImpl # Ingestion implementation
│ ├── IncrementalIndexingService # Git-diff based indexing
│ ├── RepositoryIndexStateService # Index state tracking
│ ├── GitDiffService # Git diff interface
│ ├── JGitDiffService # JGit diff implementation
│ ├── github/ # GitHub integration
│ │ ├── GitHubSourceControlClient
│ │ ├── GitHubApiClient
│ │ └── GitHubTokenProvider
│ ├── gitlab/ # GitLab integration
│ │ ├── GitLabSourceControlClient
│ │ ├── GitLabApiClient
│ │ └── GitLabTokenProvider
│ ├── bitbucket/ # Bitbucket integration
│ │ ├── BitbucketSourceControlClient # Cloud & Server
│ │ └── BitbucketTokenProvider
│ ├── CompositeSourceControlClient # Multi-provider routing
│ └── parser/ # Code parsers
│ ├── CodeParser # Parser interface
│ ├── ParserRegistry # Extension-to-parser mapping
│ ├── ParserFactory # Parser creation
│ ├── GrammarManager # Grammar lifecycle management
│ ├── GrammarConfig # Grammar configuration
│ ├── GrammarSpec # Grammar specification
│ ├── java/ # Java parsing
│ │ ├── JavaParserService
│ │ └── JavaAstVisitor
│ └── treesitter/ # Tree-sitter parsing
│ └── [14 language parsers: C, C++, Python, JS, TS,
│ Go, Rust, Kotlin, Ruby, Scala, Swift, PHP, C#, Java]
│
└── repository/ # Repository state persistence
├── RepositoryIndexStateRepository
└── FileBasedRepositoryIndexStateRepository
Frontend (Angular 20)¶
frontend/src/app/
├── app.ts # Root component
├── app.routes.ts # Routing configuration
├── app.config.ts # Application configuration
└── app.spec.ts # Root component tests
The frontend is currently scaffolded with Angular 20 standalone components. Feature modules (dashboard, search, chat) are planned for future sprints.
Data Flow¶
Ingestion Flow¶
flowchart LR
Trigger["REST API / CLI"] --> Clone["Clone Repo\n(JGit)"]
Clone --> Extract["Extract Files\n(filter by ext / .gitignore)"]
Extract --> Route["ParserRegistry\n(language routing)"]
Route --> Parse["JavaParser /\nTreeSitterParser"]
Parse --> Grammar["GrammarManager\n(download & cache)"]
Parse --> Index["LuceneIndexService +\nEmbeddingService"]
Index --> Progress["SSE Progress\n(Mutiny Multi)"]
- Trigger - Repository ingestion triggered via REST API or CLI
- Source Control -
GitHubSourceControlClient(or GitLab/Bitbucket) clones repository via JGit - File Extraction - Files extracted from clone, filtered by extension and
.gitignore - Language Routing -
ParserRegistrymaps file extensions to the appropriate parser - Parsing -
JavaParserService(for.java) orTreeSitterParser(for 13+ other languages) creates structuredTextChunkobjects with metadata - Grammar Management -
GrammarManagerdownloads and caches Tree-sitter grammars from GitHub releases - Indexing -
LuceneIndexServiceindexes chunks in Lucene;EmbeddingServicegenerates vectors stored inPgVectorStore - Progress - Events emitted via Mutiny
Multiat each stage for real-time SSE streaming
Query Flow¶
flowchart TD
Request["GET /api/v1/search"] --> Orchestrator["SearchOrchestrator"]
Orchestrator --> KeywordSearch["Keyword Search\n(LuceneIndexService)"]
Orchestrator --> VectorSearch["Vector Search\n(PgVectorStore)"]
KeywordSearch --> Normalize["VectorScoreNormalizer\n(0-1 range)"]
VectorSearch --> Normalize
Normalize --> Combine["HybridScorer\n(0.6 keyword + 0.4 vector)"]
Combine --> Merge["ResultMerger\n(deduplicate)"]
Merge --> Filter["LuceneFilterQueryBuilder\n(metadata filters)"]
Filter --> Facets["Facet Aggregation"]
Filter --> Transitive["GraphQueryService\n(optional transitive search)"]
Facets --> Boost["BoostConfiguration\n(field relevance)"]
Transitive --> Boost
Boost --> Response["Search Response\n(scores, facets, matches)"]
- Request - User submits query via REST API (
GET /api/v1/search) - Orchestration -
SearchOrchestratorcoordinates the search pipeline - Hybrid Search -
HybridIndexServiceruns: - Keyword search via
LuceneIndexServicewithCodeAwareAnalyzer - Vector search via
PgVectorStorewith cosine similarity (if enabled) - Score Normalization -
VectorScoreNormalizernormalizes both score sets to 0-1 range - Score Combination -
HybridScorercombines scores with configurable weights (default: 0.6 keyword, 0.4 vector) - Result Merging -
ResultMergerdeduplicates results appearing in both search modes - Metadata Filtering -
LuceneFilterQueryBuilderapplies language, repository, entity_type, and file_path filters - Facet Aggregation -
SortedSetDocValuesFacetCountscomputes available filter values - Transitive Search (optional) -
GraphQueryServiceexecutes Neo4j graph traversals forimplements:,extends:, andusages:queries - Field Boost -
BoostConfigurationapplies relevance boosts (entity_name=3.0x, doc_summary=2.0x) - Response - Results returned with scores, facets, field matches, and transitive markers