Automated ontology builder that extracts formal OWL/RDF ontologies from unstructured documents using a Large Language Model.
Ontologist reads a directory of heterogeneous documents (PDF, DOCX, TXT, MD, RTF, HTML, ODT, PPTX, XLSX), uses a language model to identify domain concepts, class hierarchies and object properties, and serializes the resulting schema as both Turtle and RDF/XML OWL files.
The pipeline is built on:
- Apache Tika for document text extraction
- OpenAI Structured Outputs (or any OpenAI-compatible API) for schema extraction
- rdflib for RDF/OWL graph construction and serialization
- Pydantic for typed schema validation
- uv for dependency management
- Python 3.12 or higher
- uv (recommended) or pip
- Java Runtime Environment (required by Apache Tika for PDF and binary formats)
Clone the repository and install dependencies with uv:
git clone <repository-url>
cd Ontologist
uv syncAlternatively, using pip in a virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install -e .Copy the example environment file and adjust the values to your setup:
cp .env.example .envAvailable variables:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
API key for the LLM provider. Required. | - |
OPENAI_BASE_URL |
Base URL of any OpenAI-compatible endpoint (OpenAI, Azure OpenAI, Ollama, vLLM, LM Studio, etc.). | https://api.openai.com/v1 |
OPENAI_MODEL_ID |
Model identifier exposed by the endpoint. | gpt-4o-mini |
ONTOLOGY_BASE_URI |
Base namespace URI for generated classes and properties. | http://example.org/auto-ontology# |
INPUT_DIRECTORY |
Directory scanned for input documents. Created automatically if missing. | ./input |
OUTPUT_DIRECTORY |
Directory receiving the generated .ttl and .owl files. Created automatically. |
./output |
The LLM provider is not restricted to OpenAI. Any service implementing the OpenAI Chat Completions API with structured output support can be used.
Place one or more supported files into the input directory (created on first run if absent) and start the pipeline:
uv run python main.pyFor each processed document, two files are written to the output directory, sharing the source file's stem:
<stem>.ttl(Turtle)<stem>.owl(RDF/XML)
If a document fails to process, the error is reported and the pipeline continues with the remaining files.
PDF, DOCX, DOC, TXT, MD, RTF, HTML, HTM, ODT, PPTX, XLSX.
For a document describing a software team, the LLM may produce a schema such as:
- Classes:
Person,Employee,Developer,Project,CodeArtifact - Hierarchy:
Developeris a subclass ofEmployee;Employeeis a subclass ofPerson - Properties:
develops(Developer->Project),writes(Developer->CodeArtifact),manages(Manager->Employee)
The serialized Turtle output resembles:
@prefix ex: <http://example.org/auto-ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
ex:Developer a owl:Class ;
rdfs:label "Software-Entwickler"@de ;
rdfs:subClassOf ex:Employee .
ex:develops a owl:ObjectProperty ;
rdfs:label "entwickelt"@de ;
rdfs:domain ex:Developer ;
rdfs:range ex:Project ..
├── main.py # Pipeline implementation
├── pyproject.toml # Project metadata and dependencies
├── uv.lock # Locked dependency graph
├── .env.example # Configuration template
├── input/ # Input documents (created at runtime)
└── output/ # Generated ontologies
- Custom extraction logic — modify the
system_promptinAutomatedOntologyBuilder.extract_schema_with_llm(main.py:78) to adapt the LLM to a specific domain. - Additional formats — add a new format by appending a
(format, path)pair to theprocess_filemethod (main.py:131). - Richer schemas — extend the Pydantic models in
main.py:13-26with attributes such asDatatypeProperty,cardinality, orinverse propertiesand adjust the graph construction accordingly.
MIT license