feat(indexing): add markdown document indexing by shahzadarain · Pull Request #144 · apache/solr-mcp

shahzadarain · 2026-06-11T20:44:12Z

Summary

Adds an index-markdown-documents MCP tool that indexes markdown content into Solr, alongside the existing JSON/CSV/XML tools — extracting searchable structure rather than a flat text blob.

Closes #69

Design

A new MarkdownDocumentCreator follows the existing strategy pattern in indexing/documentcreator/, parsing with CommonMark-Java 0.28.0 plus its YAML front matter extension (lightweight, reflection-free, so GraalVM native-image safe — no extra hints needed).

Field extraction:

Field	Source
front matter entries	each entry becomes a field (names sanitized via `FieldNameSanitizer`; block- and flow-style lists become multi-valued fields)
`id`	front matter `id` if present, otherwise SHA-256 of the input
`title`	front matter `title`, else first level-1 heading
`headings`	multi-valued field with every heading text (the document outline)
`content`	plain text body, front matter excluded

The index-data MCP prompt also accepts markdown/md formats.

Maintainer feedback from #69, addressed

Stable id / idempotency: when front matter supplies no id, the document id is derived deterministically from a SHA-256 hash of the input, so re-indexing identical markdown overwrites the same document — matching the tool's idempotentHint. Since LLM-driven conversion of other formats to markdown is non-deterministic, the tool description also instructs clients to supply a stable front matter id when indexing converted content.

Tool steering: the tool description now reads: "Do NOT use for JSON/CSV/XML input; use index-json-documents, index-csv-documents, or index-xml-documents instead. Only convert source content to markdown when there is no dedicated tool for the source format, and supply a stable 'id' in the YAML front matter when doing so."

Testing

MarkdownIndexingTest — 10 unit tests: front matter extraction, title resolution (front matter wins over H1), heading collection, plain-text body with formatting stripped, field name sanitization, flow- and block-style YAML lists, front matter id, content-hash id stability, empty-input rejection
Prompt-path tests for markdown and md formats in IndexingServiceTest
MCP round-trip added to McpClientIntegrationTestBase: index markdown via the tool, then find it by front matter id (runs across all transport × runtime combinations)
Tool registered in listToolsReturnsExpectedTools and behavior hints asserted in toolsExposeBehaviorHints
./gradlew spotlessApply clean; unit tests green locally on JDK 25

Docs updated: README tool table, AGENTS.md format/creator lists.

Adds an index-markdown-documents MCP tool that indexes markdown content into Solr, complementing the existing JSON/CSV/XML tools. - New MarkdownDocumentCreator (CommonMark + YAML front matter extension; lightweight, reflection-free, native-image safe): front matter entries become sanitized fields, title resolves from front matter or the first H1, heading texts go into a multi-valued headings field, and the plain-text body into content - Missing ids are derived from a SHA-256 content hash so re-indexing the same markdown stays idempotent (matches the tool hint) - Flow-style YAML lists ([a, b, c]) are expanded to multi-valued fields, matching block-style list behavior - index-data prompt accepts markdown/md formats - Unit tests, MCP integration round-trip test, README/AGENTS docs Refs apache#69

…formats Per maintainer feedback on apache#69, the index-markdown-documents tool description now tells clients not to use it for JSON/CSV/XML input (dedicated tools exist), to convert to markdown only when no dedicated tool covers the source format, and to supply a stable front matter id when indexing converted content.

adityamparikh

Looks good! @epugh / @janhoy / @chatman can we merge this please?

shahzadarain added 2 commits June 11, 2026 23:29

adityamparikh approved these changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(indexing): add markdown document indexing#144

feat(indexing): add markdown document indexing#144
shahzadarain wants to merge 2 commits into
apache:mainfrom
shahzadarain:feature/markdown-indexing

shahzadarain commented Jun 11, 2026

Uh oh!

adityamparikh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shahzadarain commented Jun 11, 2026

Summary

Design

Maintainer feedback from #69, addressed

Testing

Uh oh!

adityamparikh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants