Skip to content

Implemented Features

This document describes all features that have been implemented, organized by epic. Each section covers the key classes, configuration, and testing status.

Status: 12 of 52 user stories fully completed (98 of 211 story points).


EPIC-00: Project Infrastructure Setup

US-00-01: Maven Project Setup (Done)

The project foundation with all core dependencies and build configuration.

  • Backend: Single-module Maven project with Quarkus 3.30.2, Java 22
  • Frontend: Angular 20.3.0 with standalone components, Jest for testing
  • Dependencies: Quarkus BOM, LangChain4j BOM (1.9.1), Lucene 10.3.2, Neo4j 6.0.2, JGit 7.4.0, JavaParser 3.27.1, java-tree-sitter 0.25.6
  • Build Profiles: Standard JAR, native (GraalVM), JMH benchmarks
  • CI/CD: JaCoCo for coverage, Codecov integration
  • Tasks completed: 16 of 16

EPIC-01: Code Ingestion & Indexing

US-01-01: GitHub Repository Ingestion (Done)

Fetches and indexes source code from GitHub repositories.

Key Classes: - GitHubSourceControlClient - Implements SourceControlClient interface for GitHub - GitHubApiClient - REST client for GitHub API v3 (metadata, rate limiting) - GitHubTokenProvider - Secure token management for PAT and GitHub App tokens - CompositeSourceControlClient - Routes to correct provider based on URL

Capabilities: - Clones repositories via JGit with branch specification - Supports public and private repositories via token auth - Extracts files with metadata (path, size, modification time) - Filters binary files and respects .gitignore patterns - Emits progress events via Mutiny Multi at key milestones - Handles rate limiting with exponential backoff

Configuration:

github-api/mp-rest/url=https://api.github.com
megabrain.github.token=${GITHUB_TOKEN}

Tests: 10 unit tests + 1 integration test (real GitHub API), >80% coverage


US-01-04: Java Parsing with JavaParser (Done)

Parses Java source files into structured code entities using AST analysis.

Key Classes: - JavaParserService - Implements CodeParser interface using JavaParser 3.27.1 - JavaAstVisitor - Custom AST visitor extracting classes, methods, and fields

Capabilities: - Extracts classes with fully qualified names, modifiers, and package declarations - Extracts methods with signatures, parameters (with types), and return types - Extracts fields with types, modifiers, and initialization expressions - Handles inner classes, static nested classes, and anonymous classes with parent-child relationships - Creates TextChunk objects with metadata: language, entity_type, entity_name, source_file, line_range - Robust error handling for malformed Java files with partial parsing recovery - Performance: >10,000 LOC per minute

Tests: Unit tests covering all entity types, inner classes, interfaces, enums, edge cases. >80% coverage. JMH benchmark.


US-01-05: Tree-sitter Multi-Language Parsing (Done)

Provides code parsing for 14+ programming languages via Tree-sitter.

Key Classes: - TreeSitterParser - Abstract base class implementing CodeParser - Language-specific parsers for: C, C++, Python, JavaScript, TypeScript, Go, Rust, Kotlin, Ruby, Scala, Swift, PHP, C#, Java

Capabilities: - Extracts functions, classes, methods from each language's AST - Handles language-specific constructs (decorators, async functions, type hints, generics) - File extension routing via ParserRegistry - Dynamic grammar loading from cached binaries - Consistent TextChunk metadata across all languages - Performance benchmarked per language

Supported Languages (16 total):

Language Extensions Key Constructs
Java .java classes, methods, fields, interfaces, enums
Python .py functions, classes, async functions, decorators
JavaScript .js, .jsx functions, classes, arrow functions, exports
TypeScript .ts, .tsx interfaces, type aliases, decorators, generics
C .c, .h functions, structs, typedefs, macros
C++ .cpp, .hpp, .cc classes, templates, namespaces
Go .go functions, structs, interfaces, methods
Rust .rs functions, structs, traits, impl blocks
Kotlin .kt classes, functions, data classes, objects
Ruby .rb classes, modules, methods
Scala .scala classes, traits, objects, case classes
Swift .swift classes, structs, protocols, functions
PHP .php classes, functions, interfaces, traits
C# .cs classes, interfaces, structs, methods

Tests: Unit tests per language, performance benchmark. >80% coverage.


US-01-08: Dynamic Grammar Management (Done)

Manages Tree-sitter grammar lifecycle: downloading, caching, versioning, and health checks.

Key Classes: - ParserRegistry - Central registry mapping file extensions to parser instances - GrammarManager - Grammar download, caching, and version management - GrammarConfig - Grammar configuration (cache directory, version pins) - GrammarSpec - Grammar specification (name, version, platform binary) - GrammarHealthCheck - Quarkus readiness probe for grammar status

Capabilities: - Downloads grammar binaries from GitHub releases on demand - Caches grammars locally with version tracking - Supports version pinning via configuration - Rollback/downgrade to previous grammar versions - Health check verifies all configured grammars are loaded - Fast parser lookup (<10ms) via extension-to-parser mapping - Supports dynamic registration of new parsers at runtime - Cold start <500ms requirement met

Configuration:

megabrain.grammar.cache.directory=~/.megabrain/grammars

Tests: 240+ tests across all grammar management components, >80% coverage.


EPIC-02: Hybrid Search & Retrieval (ALL 6 stories complete)

US-02-01: Lucene Keyword Search (Done)

Full-text code search powered by Apache Lucene 10.3.2.

Key Classes: - LuceneIndexService - Index management (create, write, search, close) - CodeAwareAnalyzer - Custom analyzer: StandardTokenizer + WordDelimiterGraphFilter + CodeSplittingFilter + LowerCaseFilter + StopFilter - QueryParserService - Multi-field query parser with full Lucene syntax - DocumentMapper - Converts TextChunk to Lucene Document with facet fields - LuceneSchema - Field definitions (entity_name, content, doc_summary, language, repository, entity_type, file_path)

Capabilities: - Code-aware tokenization: splits getUserName into [get, user, name] and get_user_name into [get, user, name] - Full Lucene query syntax: AND/OR/NOT, "phrase queries", wildcards (*, ?), field:value - Batch indexing with configurable batch size (default 1000) - Document updates via updateDocument() and deletions by ID - Graceful error handling for malformed queries with multiple fallback strategies - Configurable field boosts applied at query time

Configuration:

megabrain.index.directory=./data/index
megabrain.index.batch.size=1000

Performance: <500ms search latency at 95th percentile with 100K chunks (verified via JMH benchmark).

Tests: 25+ unit tests, JMH benchmark. >80% coverage.


US-02-02: Vector Similarity Search (Done)

Semantic search using vector embeddings stored in PostgreSQL with pgvector.

Key Classes: - VectorStore - Backend-agnostic vector storage interface - PgVectorStore - pgvector implementation with cosine similarity - EmbeddingService - Embedding generation for code chunks - EmbeddingModelService - Model loading and management

Capabilities: - pgvector extension setup via Flyway migrations (V1-V3) - Embedding generation for code chunks (single and batch) - Cosine similarity search using pgvector <=> operator - HNSW index for fast approximate nearest neighbor search (M=16, ef_construction=64) - Batch embedding during indexing with graceful degradation - Vector dimension validation

Database Migrations: - V1.0.0__enable_pgvector_extension.sql - V2.0.0__create_vector_storage_schema.sql - V3.0.0__add_vector_indexes.sql

Configuration:

quarkus.datasource.db-kind=postgresql
quarkus.datasource.jdbc.url=jdbc:postgresql://localhost:5432/megabrain_db
megabrain.vector.ef-search=40

Performance: <500ms search latency at 95th percentile with 100K vectors.

Tests: Unit tests for embedding, storage, search. JMH benchmark. >80% coverage.


US-02-03: Hybrid Ranking Algorithm (Done)

Combines keyword and vector search results with configurable weights.

Key Classes: - HybridScorer - Weighted score combination (default: 0.6 keyword, 0.4 vector) - VectorScoreNormalizer - Min-max normalization to 0-1 range - ResultMerger - Merges and deduplicates results from both search modes - ABTestHarness - Framework for comparing ranking approaches (precision@k, recall) - SearchMode - Enum: HYBRID, KEYWORD, VECTOR - HybridIndexService - Coordinates Lucene and vector search execution

Capabilities: - Min-max score normalization for fair cross-system comparison - Weighted combination: final_score = (kw_weight * lucene) + (vec_weight * vector) - Deduplication by chunk ID (document_id or file_path + entity_name) - Per-request weight override supported - Three search modes: HYBRID (both), KEYWORD (Lucene only), VECTOR (pgvector only) - A/B test harness computes precision@5, precision@10, recall across modes

Configuration:

megabrain.search.hybrid.keyword-weight=0.6
megabrain.search.hybrid.vector-weight=0.4

Tests: 23+ tests for HybridScorer, 14 for ResultMerger, 12 for VectorScoreNormalizer, 22 for ABTestHarness, 11 for search modes. >80% coverage.


US-02-04: Metadata Facet Filtering (Done)

Filter search results by language, repository, file path, and entity type with facet aggregation.

Key Classes: - SearchFilters - Filter data record (language, repository, file_path, entity_type) - LuceneFilterQueryBuilder - Builds Lucene filter queries (TermQuery for exact, PrefixQuery for paths; OR within dimension, AND across) - FacetValue - DTO for facet value and count - SortedSetDocValuesFacetCounts - Lucene facet aggregation

Capabilities: - Four filter dimensions: language, repository, file_path (prefix match), entity_type - Multiple values per filter (OR logic within dimension) - Combined filters (AND logic across dimensions) - Filters applied as BooleanClause.Occur.FILTER (before scoring for efficiency) - Filter query caching via ConcurrentHashMap (<1ms for cached filters vs ~150ms first build) - Facet aggregation returns available values with counts for matching documents - Always returns consistent map structure with all facet keys

Performance: <50ms filter overhead (verified via unit tests).

Tests: 17 tests for filter building, multiple integration tests, API-level tests. >80% coverage.


US-02-05: Relevance Tuning Configuration (Done)

Configurable field boosts and field match explanation.

Key Classes: - BoostConfiguration - Holds boost values loaded from application.properties - FieldMatchInfo - Per-field match scores from Lucene Explanation API

Capabilities: - Configurable boost values: entity_name (3.0x), doc_summary (2.0x), content (1.0x) - Boosts applied at query time via BoostQuery wrapping (no reindexing needed) - Recursive boost application to BooleanQuery, TermQuery, PhraseQuery, WildcardQuery - Optional field match explanation: shows which fields matched with per-field scores - Enabled via include_field_match=true query parameter

Configuration:

megabrain.search.boost.entity-name=3.0
megabrain.search.boost.doc-summary=2.0
megabrain.search.boost.content=1.0

Tests: Configuration loading, boost application, ranking verification with inverted boosts. >80% coverage.


US-02-06: Transitive Search Integration (Done)

Graph-powered transitive search for inheritance and usage relationships.

Key Classes: - SearchOrchestrator - Coordinates hybrid search + graph queries in parallel - GraphQueryService - Graph query interface - GraphQueryServiceStub - Implementation delegating to closure queries - ImplementsClosureQuery / Neo4jImplementsClosureQuery - Transitive implements traversal via Neo4j Cypher - ExtendsClosureQuery / Neo4jExtendsClosureQuery - Transitive extends traversal - StructuralQueryParser - Parses implements:X, extends:X, usages:X syntax

Capabilities: - Structural query syntax: implements:InterfaceName, extends:ClassName, usages:TypeName - Transitive closure via Neo4j Cypher: MATCH (i)<-[:IMPLEMENTS|EXTENDS*1..depth]-(c) RETURN DISTINCT c - Configurable depth limit (default 5, max 10, per-request override) - Parallel execution: hybrid search runs alongside graph queries - Graph results merged with hybrid results (deduplicated, sorted by score) - Results annotated with is_transitive flag and relationship_path array - usages:TypeName combines both implements and extends closures for polymorphic call site coverage - Graceful degradation: returns empty when megabrain.neo4j.uri is unset

Configuration:

megabrain.search.transitive.default-depth=5
megabrain.search.transitive.max-depth=10
megabrain.neo4j.uri=bolt://localhost:7687

Tests: Comprehensive tests for structural parsing, closure queries, orchestration, depth validation, result marking. >80% coverage.


EPIC-03: RAG Answer Generation (partial)

US-03-01: Ollama Local LLM Integration (Partial - T1-T3 of 6)

Foundation for local LLM integration via Ollama.

Key Classes: - LLMClient - Unified interface for LLM providers (generate, isAvailable, generate with model override) - OllamaLLMClient - Ollama implementation wrapping LangChain4j OllamaChatModel - OllamaConfiguration - Config mapping for base URL, model, timeout, model availability cache - OllamaModelAvailabilityService - Model availability check via Ollama /api/tags with caching

Completed: - LangChain4j Ollama dependency integrated (BOM 1.9.1) - LLMClient interface defined for provider abstraction - OllamaLLMClient implementation with configuration loading - Model selection with per-request override (T3): generate(msg, modelOverride), model validation, cached availability

Not Yet Implemented: - Endpoint configuration with retry logic (T4) - Health check for Ollama availability (T5) - Integration tests with real Ollama (T6)

Configuration:

megabrain.llm.ollama.base-url=http://localhost:11434
megabrain.llm.ollama.model=codellama
megabrain.llm.ollama.timeout-seconds=60
megabrain.llm.ollama.model-availability-cache-seconds=60

Tests: Unit tests for OllamaLLMClient, LLMClient interface, OllamaConfiguration.

RAG REST (US-04-03): AC6 (first token within 2s) is validated by an integration test (RagStreamingIntegrationTestIT.rag_streamFirstToken_within2Seconds, tagged performance) with a mocked RAG service; production compliance is validated by demo or APM.


EPIC-04: REST API & CLI

US-04-04: CLI Ingest Command (Done)

CLI command structure and options for ingesting repositories from the command line.

Key Classes: - MegaBrainCommand - Top-level CLI entry point (Quarkus Picocli @TopCommand) with subcommands - IngestCommand - ingest subcommand with --source, --repo, --branch, --token, --incremental options; CDI bean with constructor-injected IngestionService

Completed (T1): - Picocli integration in package io.megabrain.cli - Command name ingest; megabrain ingest --help shows usage - Unit tests for command name, help output, and minimal parse (no mocks)

Completed (T2): - All five options added: --source (required), --repo (required), --branch (default: main), --token (optional), --incremental (default: false) - Validation via IngestionResource.SourceType.fromString(); invalid source or blank repo throw ParameterException with clear messages; token never logged - Unit tests for help output, defaults, valid sources, invalid source, token parsing, run() after valid parse

Completed (T3): - Progress display: subscribes to IngestionService.ingestRepository(repo) / ingestRepositoryIncrementally(repo); single-line updates on TTY, line-by-line when not TTY; message length capped; failure logs short message (no token) and exits non-zero - Unit tests for full ingest progress output, incremental progress output, full vs incremental method calls, stream failure

Completed (T4): - Exit codes: 0 = success, 1 = execution/ingestion failure, 2 = invalid arguments; documented in CLI reference and Javadoc; no System.exit(); Picocli exitCodeOnInvalidInput / exitCodeOnExecutionException on IngestCommand; tests assert codes via CommandLine.execute() on IngestCommand

Completed (T5): - --verbose option: when set, enables DEBUG for io.megabrain logger (via JBoss LogManager), fuller progress (no message truncation), and on ingestion failure logs full stack trace with LOG.error("Ingestion failed", err); otherwise message-only; single source is the verbose field

Completed (T6): - Tests: Unit tests for option parsing, validation, progress display, exit codes, and help text using Picocli CommandLine.execute() and mocked IngestionService. Coverage includes: token never in output, repo trim, Picocli exit-code contract (invalid 2, execution 1), branch default in help, non-verbose truncation, null progress message, missing --repo exit 2, MegaBrainCommand help. Package io.megabrain.cli line and branch coverage >80% (JaCoCo).

US-04-05: CLI Search Command (Done)

CLI command to search the MegaBrain index from the command line.

Key Classes: - SearchCommand – Picocli search subcommand with required query parameter; integrated in MegaBrainCommand.subcommands

Completed (T1): - SearchCommand class in package io.megabrain.cli with @Command(name = "search"), @Parameters(index = "0") for query, mixinStandardHelpOptions = true, exit codes 2 (invalid input) and 1 (execution exception) - Validation: non-blank query required; throws ParameterException otherwise - Minimal run() behavior: writes to stdout that query was received (stub for T1; no SearchOrchestrator injection yet) - Help text: megabrain search --help shows description and usage - Unit tests: SearchCommandTest (plain JUnit 5) for command name, --help output, execute with one query arg, blank query exit 2

Completed (T2): - Filter and output options: --language, --repo, --type (entity_type), --limit (default 10), --json (default false), --quiet (default false). All options in help with clear descriptions. - Validation in run(): after query check, each --language validated against supported set (java, python, javascript, typescript, go, rust, kotlin, ruby, scala, swift, php, c, cpp); each --type against (class, method, function, field, interface, enum, module); --limit 1–100. Invalid values throw ParameterException(spec.commandLine(), "message") with allowed values in message. - SearchRequest built in run() from validated options: setQuery, addLanguage/addRepository/addEntityType for each list, setLimit. --json and --quiet kept as fields for T3/T5. No new DTOs. - Unit tests: option parsing, defaults when only query, multi-value --language/--repo/--type, valid language/type exit 0, invalid --language/--type exit 2 with stderr message, --help contains all option names, --limit 1 and 100 valid, --limit 0 or out of range exit 2, missing query exit 2. Aim >80% on SearchCommand.

Completed (T3): - Terminal formatting: SearchResultFormatter interface in io.megabrain.cli with format(SearchResponse) and format(SearchResponse, boolean quiet). HumanReadableSearchResultFormatter: per result shows File, Entity, Score, snippet, separator ---; truncation by line count (max 15 lines) and line length (max 120 chars); null-safe placeholders; empty results → "No results."; optional header (query, total, tookMs). Quiet mode: one line per result (path + entity). - SearchCommand integration: Injects SearchOrchestrator, SearchResultFormatter, and config (facetLimit, transitiveDefaultDepth, transitiveMaxDepth). In run(): builds SearchRequest, calls orchestrate(..., SearchMode.HYBRID, ...).await().indefinitely(), converts OrchestratorResult to SearchResponse via SearchResultMapper.toSearchResult() (shared helper in io.megabrain.api, used by REST and CLI). If !json prints formatter output and flushes; handles Uni failure with user-facing ExecutionException. - SearchResultMapper: In io.megabrain.api; maps MergedResult to SearchResult DTO; used by SearchResource and CLI. - Tests: SearchResultFormatterTest (empty → "No results.", single/multiple layout, long snippet truncated, null/blank no NPE, quiet format); SearchCommandTest (mock orchestrator, stdout contains formatted result when not --json, empty results "No results.").

Completed (T4): - Syntax highlighting: SyntaxHighlighter interface and CliSyntaxHighlighter implementation (keyword/pattern-based) using Jansi for ANSI codes. Supports Java, Python, JavaScript, TypeScript; other languages fall back to plain. HumanReadableSearchResultFormatter injects highlighter and uses it in format(response, quiet, useColor); on highlighter failure logs debug and appends plain snippet. - Color control: --no-color option (default false). useColor resolved as: false if --no-color, else false if env NO_COLOR set, else false if output not TTY (System.console() == null), else true. Formatter receives useColor and highlights snippets only when true. - Tests: CliSyntaxHighlighterTest (color on → ANSI, color off → no ANSI, multiple languages, unknown/null/blank language no exception, empty snippet); SearchResultFormatterTest (useColor true → snippet contains ANSI, useColor false → no ANSI); SearchCommandTest (--no-color parsed and useColor false passed to formatter, output with --no-color has no ANSI).

Completed (T5): - JSON output: When --json is set, output is written as JSON (no formatter). Injected ObjectMapper (Quarkus-provided) serializes SearchResponse: full JSON includes results, total, page, size, query, took_ms, facets; with --quiet only response.getResults() is serialized (results array). Pretty-printing uses writerWithDefaultPrettyPrinter() when TTY and not quiet and not --no-color; compact when piped or --no-color. Written to spec.commandLine().getOut() and flushed. No new DTOs; SearchResponse/SearchResult are Jackson-friendly. - Tests: SearchCommandTest: with --json parse stdout as JSON object, assert root has results, total, page, size, query, took_ms, facets and one result has source_file, entity_name, score; with --json --quiet parse as JSON array, assert length and element fields; empty results: full JSON has results=[], total=0, quiet is []. Use ObjectMapper.readValue(stdout.trim(), ...); no exact string assertions.

Completed (T6): - Tests: Unit tests for SearchCommand covering option parsing, validation, output formatting, JSON mode, and help text using Picocli CommandLine.execute() and mocked SearchOrchestrator. Council-recommended tests: orchestrator failure (exit 1, stderr "Search failed"), JSON with null ObjectMapper (exit 1), JSON serialization failure (mocked IOException), NO_COLOR env (useColor false via SystemStubs), blank/uppercase filter normalization (--language " " skipped, JAVA → java), --quiet human-readable (formatter quiet true, one line per result), JSON with non-empty facets. Additional: SearchResultFormatterTest, CliSyntaxHighlighterTest. Package io.megabrain.cli line and instruction coverage >80% (JaCoCo); SearchCommand 94% instructions, 75% branches.