Skip to content

feat(indexing): add markdown document indexing#144

Open
shahzadarain wants to merge 2 commits into
apache:mainfrom
shahzadarain:feature/markdown-indexing
Open

feat(indexing): add markdown document indexing#144
shahzadarain wants to merge 2 commits into
apache:mainfrom
shahzadarain:feature/markdown-indexing

Conversation

@shahzadarain

Copy link
Copy Markdown

Summary

Adds an index-markdown-documents MCP tool that indexes markdown content into Solr, alongside the existing JSON/CSV/XML tools — extracting searchable structure rather than a flat text blob.

Closes #69

Design

A new MarkdownDocumentCreator follows the existing strategy pattern in indexing/documentcreator/, parsing with CommonMark-Java 0.28.0 plus its YAML front matter extension (lightweight, reflection-free, so GraalVM native-image safe — no extra hints needed).

Field extraction:

Field Source
front matter entries each entry becomes a field (names sanitized via FieldNameSanitizer; block- and flow-style lists become multi-valued fields)
id front matter id if present, otherwise SHA-256 of the input
title front matter title, else first level-1 heading
headings multi-valued field with every heading text (the document outline)
content plain text body, front matter excluded

The index-data MCP prompt also accepts markdown/md formats.

Maintainer feedback from #69, addressed

Stable id / idempotency: when front matter supplies no id, the document id is derived deterministically from a SHA-256 hash of the input, so re-indexing identical markdown overwrites the same document — matching the tool's idempotentHint. Since LLM-driven conversion of other formats to markdown is non-deterministic, the tool description also instructs clients to supply a stable front matter id when indexing converted content.

Tool steering: the tool description now reads: "Do NOT use for JSON/CSV/XML input; use index-json-documents, index-csv-documents, or index-xml-documents instead. Only convert source content to markdown when there is no dedicated tool for the source format, and supply a stable 'id' in the YAML front matter when doing so."

Testing

  • MarkdownIndexingTest — 10 unit tests: front matter extraction, title resolution (front matter wins over H1), heading collection, plain-text body with formatting stripped, field name sanitization, flow- and block-style YAML lists, front matter id, content-hash id stability, empty-input rejection
  • Prompt-path tests for markdown and md formats in IndexingServiceTest
  • MCP round-trip added to McpClientIntegrationTestBase: index markdown via the tool, then find it by front matter id (runs across all transport × runtime combinations)
  • Tool registered in listToolsReturnsExpectedTools and behavior hints asserted in toolsExposeBehaviorHints
  • ./gradlew spotlessApply clean; unit tests green locally on JDK 25

Docs updated: README tool table, AGENTS.md format/creator lists.

Adds an index-markdown-documents MCP tool that indexes markdown
content into Solr, complementing the existing JSON/CSV/XML tools.

- New MarkdownDocumentCreator (CommonMark + YAML front matter
  extension; lightweight, reflection-free, native-image safe):
  front matter entries become sanitized fields, title resolves from
  front matter or the first H1, heading texts go into a multi-valued
  headings field, and the plain-text body into content
- Missing ids are derived from a SHA-256 content hash so re-indexing
  the same markdown stays idempotent (matches the tool hint)
- Flow-style YAML lists ([a, b, c]) are expanded to multi-valued
  fields, matching block-style list behavior
- index-data prompt accepts markdown/md formats
- Unit tests, MCP integration round-trip test, README/AGENTS docs

Refs apache#69
…formats

Per maintainer feedback on apache#69, the index-markdown-documents tool
description now tells clients not to use it for JSON/CSV/XML input
(dedicated tools exist), to convert to markdown only when no dedicated
tool covers the source format, and to supply a stable front matter id
when indexing converted content.

@adityamparikh adityamparikh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! @epugh / @janhoy / @chatman can we merge this please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to index and search over markdown documents

2 participants