feat(indexing): add markdown document indexing#144
Open
shahzadarain wants to merge 2 commits into
Open
Conversation
Adds an index-markdown-documents MCP tool that indexes markdown content into Solr, complementing the existing JSON/CSV/XML tools. - New MarkdownDocumentCreator (CommonMark + YAML front matter extension; lightweight, reflection-free, native-image safe): front matter entries become sanitized fields, title resolves from front matter or the first H1, heading texts go into a multi-valued headings field, and the plain-text body into content - Missing ids are derived from a SHA-256 content hash so re-indexing the same markdown stays idempotent (matches the tool hint) - Flow-style YAML lists ([a, b, c]) are expanded to multi-valued fields, matching block-style list behavior - index-data prompt accepts markdown/md formats - Unit tests, MCP integration round-trip test, README/AGENTS docs Refs apache#69
…formats Per maintainer feedback on apache#69, the index-markdown-documents tool description now tells clients not to use it for JSON/CSV/XML input (dedicated tools exist), to convert to markdown only when no dedicated tool covers the source format, and to supply a stable front matter id when indexing converted content.
adityamparikh
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an
index-markdown-documentsMCP tool that indexes markdown content into Solr, alongside the existing JSON/CSV/XML tools — extracting searchable structure rather than a flat text blob.Closes #69
Design
A new
MarkdownDocumentCreatorfollows the existing strategy pattern inindexing/documentcreator/, parsing with CommonMark-Java 0.28.0 plus its YAML front matter extension (lightweight, reflection-free, so GraalVM native-image safe — no extra hints needed).Field extraction:
FieldNameSanitizer; block- and flow-style lists become multi-valued fields)ididif present, otherwise SHA-256 of the inputtitletitle, else first level-1 headingheadingscontentThe
index-dataMCP prompt also acceptsmarkdown/mdformats.Maintainer feedback from #69, addressed
Stable id / idempotency: when front matter supplies no
id, the document id is derived deterministically from a SHA-256 hash of the input, so re-indexing identical markdown overwrites the same document — matching the tool'sidempotentHint. Since LLM-driven conversion of other formats to markdown is non-deterministic, the tool description also instructs clients to supply a stable front matteridwhen indexing converted content.Tool steering: the tool description now reads: "Do NOT use for JSON/CSV/XML input; use index-json-documents, index-csv-documents, or index-xml-documents instead. Only convert source content to markdown when there is no dedicated tool for the source format, and supply a stable 'id' in the YAML front matter when doing so."
Testing
MarkdownIndexingTest— 10 unit tests: front matter extraction, title resolution (front matter wins over H1), heading collection, plain-text body with formatting stripped, field name sanitization, flow- and block-style YAML lists, front matter id, content-hash id stability, empty-input rejectionmarkdownandmdformats inIndexingServiceTestMcpClientIntegrationTestBase: index markdown via the tool, then find it by front matter id (runs across all transport × runtime combinations)listToolsReturnsExpectedToolsand behavior hints asserted intoolsExposeBehaviorHints./gradlew spotlessApplyclean; unit tests green locally on JDK 25Docs updated: README tool table, AGENTS.md format/creator lists.