Ontologist

Automated ontology builder that extracts formal OWL/RDF ontologies from unstructured documents using a Large Language Model.

Overview

Ontologist reads a directory of heterogeneous documents (PDF, DOCX, TXT, MD, RTF, HTML, ODT, PPTX, XLSX), uses a language model to identify domain concepts, class hierarchies and object properties, and serializes the resulting schema as both Turtle and RDF/XML OWL files.

The pipeline is built on:

Apache Tika for document text extraction
OpenAI Structured Outputs (or any OpenAI-compatible API) for schema extraction
rdflib for RDF/OWL graph construction and serialization
Pydantic for typed schema validation
uv for dependency management

Requirements

Python 3.12 or higher
uv (recommended) or pip
Java Runtime Environment (required by Apache Tika for PDF and binary formats)

Installation

Clone the repository and install dependencies with uv:

git clone <repository-url>
cd Ontologist
uv sync

Alternatively, using pip in a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Configuration

Copy the example environment file and adjust the values to your setup:

cp .env.example .env

Available variables:

Variable	Description	Default
`OPENAI_API_KEY`	API key for the LLM provider. Required.	-
`OPENAI_BASE_URL`	Base URL of any OpenAI-compatible endpoint (OpenAI, Azure OpenAI, Ollama, vLLM, LM Studio, etc.).	`https://api.openai.com/v1`
`OPENAI_MODEL_ID`	Model identifier exposed by the endpoint.	`gpt-4o-mini`
`ONTOLOGY_BASE_URI`	Base namespace URI for generated classes and properties.	`http://example.org/auto-ontology#`
`INPUT_DIRECTORY`	Directory scanned for input documents. Created automatically if missing.	`./input`
`OUTPUT_DIRECTORY`	Directory receiving the generated `.ttl` and `.owl` files. Created automatically.	`./output`

The LLM provider is not restricted to OpenAI. Any service implementing the OpenAI Chat Completions API with structured output support can be used.

Usage

Place one or more supported files into the input directory (created on first run if absent) and start the pipeline:

uv run python main.py

For each processed document, two files are written to the output directory, sharing the source file's stem:

<stem>.ttl (Turtle)
<stem>.owl (RDF/XML)

If a document fails to process, the error is reported and the pipeline continues with the remaining files.

Supported File Types

PDF, DOCX, DOC, TXT, MD, RTF, HTML, HTM, ODT, PPTX, XLSX.

Output Example

For a document describing a software team, the LLM may produce a schema such as:

Classes: Person, Employee, Developer, Project, CodeArtifact
Hierarchy: Developer is a subclass of Employee; Employee is a subclass of Person
Properties: develops (Developer -> Project), writes (Developer -> CodeArtifact), manages (Manager -> Employee)

The serialized Turtle output resembles:

@prefix ex: <http://example.org/auto-ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Developer a owl:Class ;
    rdfs:label "Software-Entwickler"@de ;
    rdfs:subClassOf ex:Employee .

ex:develops a owl:ObjectProperty ;
    rdfs:label "entwickelt"@de ;
    rdfs:domain ex:Developer ;
    rdfs:range ex:Project .

Project Structure

.
├── main.py              # Pipeline implementation
├── pyproject.toml       # Project metadata and dependencies
├── uv.lock              # Locked dependency graph
├── .env.example         # Configuration template
├── input/               # Input documents (created at runtime)
└── output/              # Generated ontologies

Extending the Pipeline

Custom extraction logic — modify the system_prompt in AutomatedOntologyBuilder.extract_schema_with_llm (main.py:78) to adapt the LLM to a specific domain.
Additional formats — add a new format by appending a (format, path) pair to the process_file method (main.py:131).
Richer schemas — extend the Pydantic models in main.py:13-26 with attributes such as DatatypeProperty, cardinality, or inverse properties and adjust the graph construction accordingly.

License

MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
License		License
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ontologist

Overview

Requirements

Installation

Configuration

Usage

Supported File Types

Output Example

Project Structure

Extending the Pipeline

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ontologist

Overview

Requirements

Installation

Configuration

Usage

Supported File Types

Output Example

Project Structure

Extending the Pipeline

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages