Implemented Features¶
This document describes all features that have been implemented, organized by epic. Each section covers the key classes, configuration, and testing status.
Status: 12 of 52 user stories fully completed (98 of 211 story points).
EPIC-00: Project Infrastructure Setup¶
US-00-01: Maven Project Setup (Done)¶
The project foundation with all core dependencies and build configuration.
- Backend: Single-module Maven project with Quarkus 3.30.2, Java 22
- Frontend: Angular 20.3.0 with standalone components, Jest for testing
- Dependencies: Quarkus BOM, LangChain4j BOM (1.9.1), Lucene 10.3.2, Neo4j 6.0.2, JGit 7.4.0, JavaParser 3.27.1, java-tree-sitter 0.25.6
- Build Profiles: Standard JAR, native (GraalVM), JMH benchmarks
- CI/CD: JaCoCo for coverage, Codecov integration
- Tasks completed: 16 of 16
EPIC-01: Code Ingestion & Indexing¶
US-01-01: GitHub Repository Ingestion (Done)¶
Fetches and indexes source code from GitHub repositories.
Key Classes:
- GitHubSourceControlClient - Implements SourceControlClient interface for GitHub
- GitHubApiClient - REST client for GitHub API v3 (metadata, rate limiting)
- GitHubTokenProvider - Secure token management for PAT and GitHub App tokens
- CompositeSourceControlClient - Routes to correct provider based on URL
Capabilities:
- Clones repositories via JGit with branch specification
- Supports public and private repositories via token auth
- Extracts files with metadata (path, size, modification time)
- Filters binary files and respects .gitignore patterns
- Emits progress events via Mutiny Multi at key milestones
- Handles rate limiting with exponential backoff
Configuration:
Tests: 10 unit tests + 1 integration test (real GitHub API), >80% coverage
US-01-04: Java Parsing with JavaParser (Done)¶
Parses Java source files into structured code entities using AST analysis.
Key Classes:
- JavaParserService - Implements CodeParser interface using JavaParser 3.27.1
- JavaAstVisitor - Custom AST visitor extracting classes, methods, and fields
Capabilities:
- Extracts classes with fully qualified names, modifiers, and package declarations
- Extracts methods with signatures, parameters (with types), and return types
- Extracts fields with types, modifiers, and initialization expressions
- Handles inner classes, static nested classes, and anonymous classes with parent-child relationships
- Creates TextChunk objects with metadata: language, entity_type, entity_name, source_file, line_range
- Robust error handling for malformed Java files with partial parsing recovery
- Performance: >10,000 LOC per minute
Tests: Unit tests covering all entity types, inner classes, interfaces, enums, edge cases. >80% coverage. JMH benchmark.
US-01-05: Tree-sitter Multi-Language Parsing (Done)¶
Provides code parsing for 14+ programming languages via Tree-sitter.
Key Classes:
- TreeSitterParser - Abstract base class implementing CodeParser
- Language-specific parsers for: C, C++, Python, JavaScript, TypeScript, Go, Rust, Kotlin, Ruby, Scala, Swift, PHP, C#, Java
Capabilities:
- Extracts functions, classes, methods from each language's AST
- Handles language-specific constructs (decorators, async functions, type hints, generics)
- File extension routing via ParserRegistry
- Dynamic grammar loading from cached binaries
- Consistent TextChunk metadata across all languages
- Performance benchmarked per language
Supported Languages (16 total):
| Language | Extensions | Key Constructs |
|---|---|---|
| Java | .java |
classes, methods, fields, interfaces, enums |
| Python | .py |
functions, classes, async functions, decorators |
| JavaScript | .js, .jsx |
functions, classes, arrow functions, exports |
| TypeScript | .ts, .tsx |
interfaces, type aliases, decorators, generics |
| C | .c, .h |
functions, structs, typedefs, macros |
| C++ | .cpp, .hpp, .cc |
classes, templates, namespaces |
| Go | .go |
functions, structs, interfaces, methods |
| Rust | .rs |
functions, structs, traits, impl blocks |
| Kotlin | .kt |
classes, functions, data classes, objects |
| Ruby | .rb |
classes, modules, methods |
| Scala | .scala |
classes, traits, objects, case classes |
| Swift | .swift |
classes, structs, protocols, functions |
| PHP | .php |
classes, functions, interfaces, traits |
| C# | .cs |
classes, interfaces, structs, methods |
Tests: Unit tests per language, performance benchmark. >80% coverage.
US-01-08: Dynamic Grammar Management (Done)¶
Manages Tree-sitter grammar lifecycle: downloading, caching, versioning, and health checks.
Key Classes:
- ParserRegistry - Central registry mapping file extensions to parser instances
- GrammarManager - Grammar download, caching, and version management
- GrammarConfig - Grammar configuration (cache directory, version pins)
- GrammarSpec - Grammar specification (name, version, platform binary)
- GrammarHealthCheck - Quarkus readiness probe for grammar status
Capabilities: - Downloads grammar binaries from GitHub releases on demand - Caches grammars locally with version tracking - Supports version pinning via configuration - Rollback/downgrade to previous grammar versions - Health check verifies all configured grammars are loaded - Fast parser lookup (<10ms) via extension-to-parser mapping - Supports dynamic registration of new parsers at runtime - Cold start <500ms requirement met
Configuration:
Tests: 240+ tests across all grammar management components, >80% coverage.
EPIC-02: Hybrid Search & Retrieval (ALL 6 stories complete)¶
US-02-01: Lucene Keyword Search (Done)¶
Full-text code search powered by Apache Lucene 10.3.2.
Key Classes:
- LuceneIndexService - Index management (create, write, search, close)
- CodeAwareAnalyzer - Custom analyzer: StandardTokenizer + WordDelimiterGraphFilter + CodeSplittingFilter + LowerCaseFilter + StopFilter
- QueryParserService - Multi-field query parser with full Lucene syntax
- DocumentMapper - Converts TextChunk to Lucene Document with facet fields
- LuceneSchema - Field definitions (entity_name, content, doc_summary, language, repository, entity_type, file_path)
Capabilities:
- Code-aware tokenization: splits getUserName into [get, user, name] and get_user_name into [get, user, name]
- Full Lucene query syntax: AND/OR/NOT, "phrase queries", wildcards (*, ?), field:value
- Batch indexing with configurable batch size (default 1000)
- Document updates via updateDocument() and deletions by ID
- Graceful error handling for malformed queries with multiple fallback strategies
- Configurable field boosts applied at query time
Configuration:
Performance: <500ms search latency at 95th percentile with 100K chunks (verified via JMH benchmark).
Tests: 25+ unit tests, JMH benchmark. >80% coverage.
US-02-02: Vector Similarity Search (Done)¶
Semantic search using vector embeddings stored in PostgreSQL with pgvector.
Key Classes:
- VectorStore - Backend-agnostic vector storage interface
- PgVectorStore - pgvector implementation with cosine similarity
- EmbeddingService - Embedding generation for code chunks
- EmbeddingModelService - Model loading and management
Capabilities:
- pgvector extension setup via Flyway migrations (V1-V3)
- Embedding generation for code chunks (single and batch)
- Cosine similarity search using pgvector <=> operator
- HNSW index for fast approximate nearest neighbor search (M=16, ef_construction=64)
- Batch embedding during indexing with graceful degradation
- Vector dimension validation
Database Migrations:
- V1.0.0__enable_pgvector_extension.sql
- V2.0.0__create_vector_storage_schema.sql
- V3.0.0__add_vector_indexes.sql
Configuration:
quarkus.datasource.db-kind=postgresql
quarkus.datasource.jdbc.url=jdbc:postgresql://localhost:5432/megabrain_db
megabrain.vector.ef-search=40
Performance: <500ms search latency at 95th percentile with 100K vectors.
Tests: Unit tests for embedding, storage, search. JMH benchmark. >80% coverage.
US-02-03: Hybrid Ranking Algorithm (Done)¶
Combines keyword and vector search results with configurable weights.
Key Classes:
- HybridScorer - Weighted score combination (default: 0.6 keyword, 0.4 vector)
- VectorScoreNormalizer - Min-max normalization to 0-1 range
- ResultMerger - Merges and deduplicates results from both search modes
- ABTestHarness - Framework for comparing ranking approaches (precision@k, recall)
- SearchMode - Enum: HYBRID, KEYWORD, VECTOR
- HybridIndexService - Coordinates Lucene and vector search execution
Capabilities:
- Min-max score normalization for fair cross-system comparison
- Weighted combination: final_score = (kw_weight * lucene) + (vec_weight * vector)
- Deduplication by chunk ID (document_id or file_path + entity_name)
- Per-request weight override supported
- Three search modes: HYBRID (both), KEYWORD (Lucene only), VECTOR (pgvector only)
- A/B test harness computes precision@5, precision@10, recall across modes
Configuration:
Tests: 23+ tests for HybridScorer, 14 for ResultMerger, 12 for VectorScoreNormalizer, 22 for ABTestHarness, 11 for search modes. >80% coverage.
US-02-04: Metadata Facet Filtering (Done)¶
Filter search results by language, repository, file path, and entity type with facet aggregation.
Key Classes:
- SearchFilters - Filter data record (language, repository, file_path, entity_type)
- LuceneFilterQueryBuilder - Builds Lucene filter queries (TermQuery for exact, PrefixQuery for paths; OR within dimension, AND across)
- FacetValue - DTO for facet value and count
- SortedSetDocValuesFacetCounts - Lucene facet aggregation
Capabilities:
- Four filter dimensions: language, repository, file_path (prefix match), entity_type
- Multiple values per filter (OR logic within dimension)
- Combined filters (AND logic across dimensions)
- Filters applied as BooleanClause.Occur.FILTER (before scoring for efficiency)
- Filter query caching via ConcurrentHashMap (<1ms for cached filters vs ~150ms first build)
- Facet aggregation returns available values with counts for matching documents
- Always returns consistent map structure with all facet keys
Performance: <50ms filter overhead (verified via unit tests).
Tests: 17 tests for filter building, multiple integration tests, API-level tests. >80% coverage.
US-02-05: Relevance Tuning Configuration (Done)¶
Configurable field boosts and field match explanation.
Key Classes:
- BoostConfiguration - Holds boost values loaded from application.properties
- FieldMatchInfo - Per-field match scores from Lucene Explanation API
Capabilities:
- Configurable boost values: entity_name (3.0x), doc_summary (2.0x), content (1.0x)
- Boosts applied at query time via BoostQuery wrapping (no reindexing needed)
- Recursive boost application to BooleanQuery, TermQuery, PhraseQuery, WildcardQuery
- Optional field match explanation: shows which fields matched with per-field scores
- Enabled via include_field_match=true query parameter
Configuration:
megabrain.search.boost.entity-name=3.0
megabrain.search.boost.doc-summary=2.0
megabrain.search.boost.content=1.0
Tests: Configuration loading, boost application, ranking verification with inverted boosts. >80% coverage.
US-02-06: Transitive Search Integration (Done)¶
Graph-powered transitive search for inheritance and usage relationships.
Key Classes:
- SearchOrchestrator - Coordinates hybrid search + graph queries in parallel
- GraphQueryService - Graph query interface
- GraphQueryServiceStub - Implementation delegating to closure queries
- ImplementsClosureQuery / Neo4jImplementsClosureQuery - Transitive implements traversal via Neo4j Cypher
- ExtendsClosureQuery / Neo4jExtendsClosureQuery - Transitive extends traversal
- StructuralQueryParser - Parses implements:X, extends:X, usages:X syntax
Capabilities:
- Structural query syntax: implements:InterfaceName, extends:ClassName, usages:TypeName
- Transitive closure via Neo4j Cypher: MATCH (i)<-[:IMPLEMENTS|EXTENDS*1..depth]-(c) RETURN DISTINCT c
- Configurable depth limit (default 5, max 10, per-request override)
- Parallel execution: hybrid search runs alongside graph queries
- Graph results merged with hybrid results (deduplicated, sorted by score)
- Results annotated with is_transitive flag and relationship_path array
- usages:TypeName combines both implements and extends closures for polymorphic call site coverage
- Graceful degradation: returns empty when megabrain.neo4j.uri is unset
Configuration:
megabrain.search.transitive.default-depth=5
megabrain.search.transitive.max-depth=10
megabrain.neo4j.uri=bolt://localhost:7687
Tests: Comprehensive tests for structural parsing, closure queries, orchestration, depth validation, result marking. >80% coverage.
EPIC-03: RAG Answer Generation (partial)¶
US-03-01: Ollama Local LLM Integration (Partial - T1-T3 of 6)¶
Foundation for local LLM integration via Ollama.
Key Classes:
- LLMClient - Unified interface for LLM providers (generate, isAvailable, generate with model override)
- OllamaLLMClient - Ollama implementation wrapping LangChain4j OllamaChatModel
- OllamaConfiguration - Config mapping for base URL, model, timeout, model availability cache
- OllamaModelAvailabilityService - Model availability check via Ollama /api/tags with caching
Completed:
- LangChain4j Ollama dependency integrated (BOM 1.9.1)
- LLMClient interface defined for provider abstraction
- OllamaLLMClient implementation with configuration loading
- Model selection with per-request override (T3): generate(msg, modelOverride), model validation, cached availability
Not Yet Implemented: - Endpoint configuration with retry logic (T4) - Health check for Ollama availability (T5) - Integration tests with real Ollama (T6)
Configuration:
megabrain.llm.ollama.base-url=http://localhost:11434
megabrain.llm.ollama.model=codellama
megabrain.llm.ollama.timeout-seconds=60
megabrain.llm.ollama.model-availability-cache-seconds=60
Tests: Unit tests for OllamaLLMClient, LLMClient interface, OllamaConfiguration.
RAG REST (US-04-03): AC6 (first token within 2s) is validated by an integration test (RagStreamingIntegrationTestIT.rag_streamFirstToken_within2Seconds, tagged performance) with a mocked RAG service; production compliance is validated by demo or APM.
EPIC-04: REST API & CLI¶
US-04-04: CLI Ingest Command (Done)¶
CLI command structure and options for ingesting repositories from the command line.
Key Classes:
- MegaBrainCommand - Top-level CLI entry point (Quarkus Picocli @TopCommand) with subcommands
- IngestCommand - ingest subcommand with --source, --repo, --branch, --token, --incremental options; CDI bean with constructor-injected IngestionService
Completed (T1):
- Picocli integration in package io.megabrain.cli
- Command name ingest; megabrain ingest --help shows usage
- Unit tests for command name, help output, and minimal parse (no mocks)
Completed (T2):
- All five options added: --source (required), --repo (required), --branch (default: main), --token (optional), --incremental (default: false)
- Validation via IngestionResource.SourceType.fromString(); invalid source or blank repo throw ParameterException with clear messages; token never logged
- Unit tests for help output, defaults, valid sources, invalid source, token parsing, run() after valid parse
Completed (T3):
- Progress display: subscribes to IngestionService.ingestRepository(repo) / ingestRepositoryIncrementally(repo); single-line updates on TTY, line-by-line when not TTY; message length capped; failure logs short message (no token) and exits non-zero
- Unit tests for full ingest progress output, incremental progress output, full vs incremental method calls, stream failure
Completed (T4):
- Exit codes: 0 = success, 1 = execution/ingestion failure, 2 = invalid arguments; documented in CLI reference and Javadoc; no System.exit(); Picocli exitCodeOnInvalidInput / exitCodeOnExecutionException on IngestCommand; tests assert codes via CommandLine.execute() on IngestCommand
Completed (T5):
- --verbose option: when set, enables DEBUG for io.megabrain logger (via JBoss LogManager), fuller progress (no message truncation), and on ingestion failure logs full stack trace with LOG.error("Ingestion failed", err); otherwise message-only; single source is the verbose field
Completed (T6):
- Tests: Unit tests for option parsing, validation, progress display, exit codes, and help text using Picocli CommandLine.execute() and mocked IngestionService. Coverage includes: token never in output, repo trim, Picocli exit-code contract (invalid 2, execution 1), branch default in help, non-verbose truncation, null progress message, missing --repo exit 2, MegaBrainCommand help. Package io.megabrain.cli line and branch coverage >80% (JaCoCo).
US-04-05: CLI Search Command (Done)¶
CLI command to search the MegaBrain index from the command line.
Key Classes:
- SearchCommand – Picocli search subcommand with required query parameter; integrated in MegaBrainCommand.subcommands
Completed (T1):
- SearchCommand class in package io.megabrain.cli with @Command(name = "search"), @Parameters(index = "0") for query, mixinStandardHelpOptions = true, exit codes 2 (invalid input) and 1 (execution exception)
- Validation: non-blank query required; throws ParameterException otherwise
- Minimal run() behavior: writes to stdout that query was received (stub for T1; no SearchOrchestrator injection yet)
- Help text: megabrain search --help shows description and usage
- Unit tests: SearchCommandTest (plain JUnit 5) for command name, --help output, execute with one query arg, blank query exit 2
Completed (T2):
- Filter and output options: --language, --repo, --type (entity_type), --limit (default 10), --json (default false), --quiet (default false). All options in help with clear descriptions.
- Validation in run(): after query check, each --language validated against supported set (java, python, javascript, typescript, go, rust, kotlin, ruby, scala, swift, php, c, cpp); each --type against (class, method, function, field, interface, enum, module); --limit 1–100. Invalid values throw ParameterException(spec.commandLine(), "message") with allowed values in message.
- SearchRequest built in run() from validated options: setQuery, addLanguage/addRepository/addEntityType for each list, setLimit. --json and --quiet kept as fields for T3/T5. No new DTOs.
- Unit tests: option parsing, defaults when only query, multi-value --language/--repo/--type, valid language/type exit 0, invalid --language/--type exit 2 with stderr message, --help contains all option names, --limit 1 and 100 valid, --limit 0 or out of range exit 2, missing query exit 2. Aim >80% on SearchCommand.
Completed (T3):
- Terminal formatting: SearchResultFormatter interface in io.megabrain.cli with format(SearchResponse) and format(SearchResponse, boolean quiet). HumanReadableSearchResultFormatter: per result shows File, Entity, Score, snippet, separator ---; truncation by line count (max 15 lines) and line length (max 120 chars); null-safe placeholders; empty results → "No results."; optional header (query, total, tookMs). Quiet mode: one line per result (path + entity).
- SearchCommand integration: Injects SearchOrchestrator, SearchResultFormatter, and config (facetLimit, transitiveDefaultDepth, transitiveMaxDepth). In run(): builds SearchRequest, calls orchestrate(..., SearchMode.HYBRID, ...).await().indefinitely(), converts OrchestratorResult to SearchResponse via SearchResultMapper.toSearchResult() (shared helper in io.megabrain.api, used by REST and CLI). If !json prints formatter output and flushes; handles Uni failure with user-facing ExecutionException.
- SearchResultMapper: In io.megabrain.api; maps MergedResult to SearchResult DTO; used by SearchResource and CLI.
- Tests: SearchResultFormatterTest (empty → "No results.", single/multiple layout, long snippet truncated, null/blank no NPE, quiet format); SearchCommandTest (mock orchestrator, stdout contains formatted result when not --json, empty results "No results.").
Completed (T4):
- Syntax highlighting: SyntaxHighlighter interface and CliSyntaxHighlighter implementation (keyword/pattern-based) using Jansi for ANSI codes. Supports Java, Python, JavaScript, TypeScript; other languages fall back to plain. HumanReadableSearchResultFormatter injects highlighter and uses it in format(response, quiet, useColor); on highlighter failure logs debug and appends plain snippet.
- Color control: --no-color option (default false). useColor resolved as: false if --no-color, else false if env NO_COLOR set, else false if output not TTY (System.console() == null), else true. Formatter receives useColor and highlights snippets only when true.
- Tests: CliSyntaxHighlighterTest (color on → ANSI, color off → no ANSI, multiple languages, unknown/null/blank language no exception, empty snippet); SearchResultFormatterTest (useColor true → snippet contains ANSI, useColor false → no ANSI); SearchCommandTest (--no-color parsed and useColor false passed to formatter, output with --no-color has no ANSI).
Completed (T5):
- JSON output: When --json is set, output is written as JSON (no formatter). Injected ObjectMapper (Quarkus-provided) serializes SearchResponse: full JSON includes results, total, page, size, query, took_ms, facets; with --quiet only response.getResults() is serialized (results array). Pretty-printing uses writerWithDefaultPrettyPrinter() when TTY and not quiet and not --no-color; compact when piped or --no-color. Written to spec.commandLine().getOut() and flushed. No new DTOs; SearchResponse/SearchResult are Jackson-friendly.
- Tests: SearchCommandTest: with --json parse stdout as JSON object, assert root has results, total, page, size, query, took_ms, facets and one result has source_file, entity_name, score; with --json --quiet parse as JSON array, assert length and element fields; empty results: full JSON has results=[], total=0, quiet is []. Use ObjectMapper.readValue(stdout.trim(), ...); no exact string assertions.
Completed (T6):
- Tests: Unit tests for SearchCommand covering option parsing, validation, output formatting, JSON mode, and help text using Picocli CommandLine.execute() and mocked SearchOrchestrator. Council-recommended tests: orchestrator failure (exit 1, stderr "Search failed"), JSON with null ObjectMapper (exit 1), JSON serialization failure (mocked IOException), NO_COLOR env (useColor false via SystemStubs), blank/uppercase filter normalization (--language " " skipped, JAVA → java), --quiet human-readable (formatter quiet true, one line per result), JSON with non-empty facets. Additional: SearchResultFormatterTest, CliSyntaxHighlighterTest. Package io.megabrain.cli line and instruction coverage >80% (JaCoCo); SearchCommand 94% instructions, 75% branches.