clice/openspec/changes/improve-code-completion-parity/design.md at openspec-plan

caio/clice

Files

2026-04-03 21:49:59 +08:00

14 KiB

Raw Permalink Blame History

Context

clice currently routes textDocument/completion through MasterServer::forward_stateless(), which prepares open-buffer text plus available PCH/PCM paths and sends a one-shot completion request to the stateless worker. The worker builds CompilationParams, invokes feature::code_complete(), and returns a raw std::vector<CompletionItem>.

The implementation in src/feature/code_completion.cpp is intentionally small:

run Sema code completion at a single file/offset
compute the local identifier prefix
fuzzy-match candidate labels against the typed prefix
emit only basic LSP fields such as label, kind, textEdit, and sortText
bundle function overloads only by qualified name, with detail = "(...)" for grouped overloads

Compared with the local clangd implementation in .llvm/clang-tools-extra/clangd, the gap is large and structural rather than cosmetic. clangd has a full completion pipeline:

request shaping from client capabilities and trigger context in ClangdLSPServer
CodeCompleteFlow as a staged completion engine rather than direct Sema rendering
candidate fusion from Sema, project index, and raw identifiers
context capture such as completion kind, query scopes, expected type, and following token
scoring that combines name match with semantic/contextual signals
richer LSP rendering including signatures, return types, docs, snippets, fix-its, additional edits, deprecation, filterText, isIncomplete, and include insertion

At the same time, clice has infrastructure that clangd either lacks or does not currently integrate into completion as deeply:

explicit header-context modeling for headers shared by multiple includers
existing PCH/PCM request plumbing in the master/stateless worker path
merged per-file index shards and project-wide symbol/reference data
a project goal of instantiation-aware template processing

There is also a hard constraint: clice's current index symbol payload only stores name, kind, and reference files. That is enough for navigation-oriented lookup, but not enough for clangd-style index completion, which needs scope, signature, return type, documentation, deprecation, canonical declaration/header ownership, and dependency insertion hints.

Goals / Non-Goals

Goals:

close the highest-value completion gap with clangd in behavior, not in file-by-file implementation shape
replace the current direct Sema-to-LSP path with a staged completion pipeline
support rich, capability-aware LSP completion responses
augment Sema completion with identifier and project-index candidates
introduce heuristics that use semantic/contextual signals instead of raw fuzzy score alone
define how fix-its and dependency insertion edits attach to completion items
use clice's header-context, module, and template machinery to define completion behaviors clangd does not currently provide

Non-Goals:

byte-for-byte parity with clangd internals or its exact scoring constants
introducing completionItem/resolve in this change
shipping a learned ranking model in the first implementation; heuristic ranking is sufficient
requiring the global index to become fully clangd-equivalent before completion quality improves
solving every completion-related editor quirk across all clients before the core pipeline exists

Decisions

1. Split completion into request shaping, candidate collection, scoring, and rendering

clice should stop treating code completion as "run Sema and immediately serialize CompletionItems". Instead, completion will become a staged pipeline:

request shaping at the LSP/server edge
candidate collection in the worker
merge/deduplicate/bundle
score and truncate
render to LSP according to client capabilities

Why this approach:

it matches the actual gap with clangd, which is pipeline depth rather than one missing field
it keeps protocol negotiation in the server and semantic work in the worker
it allows clice-specific signals to affect ranking without polluting the LSP layer

Alternative considered:

continue returning std::vector<CompletionItem> directly from feature::code_complete() Rejected because isIncomplete, score-aware truncation, richer origins, and delayed rendering all need an intermediate completion model.

2. Extend the completion request contract with client capabilities and trigger context

The current stateless completion request only carries file text, compile arguments, and offset. The request contract should be extended so the worker can render results correctly:

completion trigger kind/character
client snippet support
documentation format preference
label-details support
supported completion item kinds
requested limit
overload-bundling preference

The master server should negotiate these from initialize capabilities and pass a normalized completion-options struct to the worker.

Why this approach:

clangd already proves that completion quality depends on client capabilities
several existing CodeCompletionOptions fields in clice are never meaningfully driven by the LSP client today
it keeps editor-specific policy out of the core collector logic

Alternative considered:

hardcode one rendering mode for all clients Rejected because it prevents snippet/doc behavior from ever becoming correct and makes the existing options struct mostly dead code.

3. Keep Sema as the primary source of truth, but augment it with identifiers and index symbols

Collection order should be:

Sema candidates from the active compilation context
local/raw identifiers from the open buffer and active file context
project/index candidates when the completion context allows them

Sema remains primary because it understands visibility, module state, recovery fix-its, and the exact active compile command. Identifier and index sources fill gaps when Sema is incomplete or when project-wide results are useful.

Merge rules should deduplicate by insertion semantics, not just by display label. Function overload bundling should group only candidates that render identically and require the same dependency edits.

Why this approach:

it preserves correctness for the current file while still expanding recall
it fits clice's current stateless worker path, which already builds the correct compile inputs
it allows richer completion without waiting for index-only correctness

Alternative considered:

move to index-first completion with Sema as fallback Rejected because clice's strongest current signals for modules, fix-its, and header contexts come from active compilation, not from the persisted index.

4. Expand the index schema specifically for completion metadata

clice's current index::Symbol payload is too small for rich completion. To support index-backed completion, the index data model should be extended with completion-oriented fields such as:

qualified scope / shortest usable qualifier
signature and return type
deprecation flag
documentation summary or documentation lookup key
canonical declaration path
preferred include header or module/import source
symbol origin/category for ranking

The first implementation does not need every clangd field, but it does need enough metadata to render and rank project-index candidates meaningfully.

Why this approach:

without extra index payload, project-index completion can only return weak name-only suggestions
dependency insertion cannot be justified without declaration ownership metadata
the serialization boundary already exists in TUIndex, ProjectIndex, and merged shards, so this is the right architectural layer

Alternative considered:

use the current index unchanged and limit project-index completion to name + kind Rejected because it would add complexity without delivering clangd-level value, and would likely need to be rewritten immediately after.

5. Use heuristic ranking first, and encode the signals explicitly

The first ranking model should remain heuristic, but it must go beyond raw fuzzy score. Candidate scoring should combine:

prefix/exact name match
semantic availability from Sema
in-scope vs required-qualifier insertion
expected type / preferred type compatibility when available
file/header-context proximity
deprecation penalty
module availability and dependency-edit cost
template-instantiation compatibility when clice can infer it

This keeps the design testable and explainable without blocking on a learned model.

Why this approach:

it delivers most user-visible value quickly
it maps naturally onto clice's existing semantics-heavy architecture
it leaves room for a later learned ranker without locking the system into clangd's exact model

Alternative considered:

adopt clangd-style decision-forest ranking immediately Rejected because clice does not yet have the surrounding instrumentation and labeled data pipeline, and heuristic ranking is sufficient for this change.

Two classes of edits matter:

Sema-provided fix-its near the completion site
dependency insertion edits such as #include or import

The completion model should carry both as structured edits and render them into LSP additionalTextEdits or merged textEdits where appropriate.

Implementation should be staged:

stage 1: preserve and emit Sema fix-its
stage 2: add dependency insertion for header-backed symbols once index metadata can justify it
stage 3: add module import insertion where the active TU and candidate source indicate a module-based dependency is the right form

Why this approach:

it closes a large clangd parity gap without forcing all edit types at once
module-aware insertion is a place where clice can exceed clangd's current include-centric design

Alternative considered:

postpone all completion edits until after ranking/index parity Rejected because Sema fix-its are already available and should not be discarded.

7. Make `clice`-specific header, module, and template context part of completion semantics

The main "beyond clangd" opportunities should not be separate side features. They should influence the same completion pipeline:

Header context: completion in a header should use the active includer/source context for that header, instead of treating the header as a single global state.
Modules: available PCM/module dependencies should affect both candidate visibility and dependency insertion strategy.
Template instantiation: when clice can identify concrete instantiation or expected-type information, ranking should prefer candidates compatible with that instantiated context.

This means the primary completion request must always be evaluated in one active compilation context, not a union of all possible contexts. Alternate contexts can be explored later, but blindly merging them would produce unstable and misleading completions.

Why this approach:

it uses the architecture clice already claims as a differentiator
it avoids the worst failure mode for shared headers: mixing incompatible contexts into one result set
it keeps "beyond clangd" work aligned with user-visible completion quality

Alternative considered:

add cross-context completion by merging all known header contexts Rejected because it sacrifices correctness and predictable ranking for recall.

Risks / Trade-offs

[Index payload expansion increases serialization and background-index cost] → Mitigation: add only completion-relevant fields first and measure shard-size growth before expanding documentation payloads further.
[Ranking heuristics become opaque or brittle] → Mitigation: keep signals explicit, loggable, and unit-tested rather than encoding hidden constants throughout the collector.
[Header-context-aware completion can produce confusing behavior if the active context is wrong] → Mitigation: bind completion to the same active header-context selection model used by navigation/AST features, not an ad hoc completion-only heuristic.
[Dependency insertion can apply the wrong form (#include vs import)] → Mitigation: gate edits on explicit symbol metadata and active compilation mode, and prefer no edit over a wrong edit.
[More completion metadata increases protocol churn between master and stateless workers] → Mitigation: add a dedicated options/result struct instead of spreading new fields loosely across existing request params.

Migration Plan

Extend the LSP/server edge to capture completion capabilities, trigger context, and limits.
Introduce an internal completion result model and return a completion list shape with isIncomplete.
Refactor the worker pipeline to separate candidate collection, merging, scoring, and rendering while preserving current Sema behavior.
Add local-identifier completion and heuristic ranking improvements.
Expand the index schema and integrate project-index candidates.
Add completion-related edits, first for Sema fix-its and then for dependency insertion.
Integrate header-context, module, and template-instantiation signals into ranking and edit policy.
Expand unit and integration coverage around rendering, ranking, index augmentation, module-aware completion, and shared-header contexts.

Rollback strategy:

keep Sema-only completion as a safe fallback path until the staged pipeline is stable
disable index augmentation or dependency insertion independently if they prove noisy
prefer full completion results without advanced edits over partially wrong completion behavior

Open Questions

How much documentation should be stored directly in clice's index versus fetched/lazily summarized from declarations on demand?
Should project-index completion support all-scopes insertion from the first version, or require explicit qualifiers only after scope metadata is fully modeled?
What is the right UX for exposing the currently active header context to users when completion results differ across includers?
How aggressive should template-instantiation-aware ranking be before users perceive it as "hiding" generic but still legal completions?
Whether dependency insertion for modules should be represented as ordinary additional text edits or as a future code-action-style completion extension.

14 KiB Raw Permalink Blame History

Context

Goals / Non-Goals

Decisions

1. Split completion into request shaping, candidate collection, scoring, and rendering

2. Extend the completion request contract with client capabilities and trigger context

3. Keep Sema as the primary source of truth, but augment it with identifiers and index symbols

4. Expand the index schema specifically for completion metadata

5. Use heuristic ranking first, and encode the signals explicitly

6. Treat completion-related edits as first-class output, with staged support

7. Make clice-specific header, module, and template context part of completion semantics

Risks / Trade-offs

Migration Plan

Open Questions

14 KiB

Raw Permalink Blame History

7. Make `clice`-specific header, module, and template context part of completion semantics