Skip to content

FBR65/Ontologist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ontologist

Automated ontology builder that extracts formal OWL/RDF ontologies from unstructured documents using a Large Language Model.

Overview

Ontologist reads a directory of heterogeneous documents (PDF, DOCX, TXT, MD, RTF, HTML, ODT, PPTX, XLSX), uses a language model to identify domain concepts, class hierarchies and object properties, and serializes the resulting schema as both Turtle and RDF/XML OWL files.

The pipeline is built on:

  • Apache Tika for document text extraction
  • OpenAI Structured Outputs (or any OpenAI-compatible API) for schema extraction
  • rdflib for RDF/OWL graph construction and serialization
  • Pydantic for typed schema validation
  • uv for dependency management

Requirements

  • Python 3.12 or higher
  • uv (recommended) or pip
  • Java Runtime Environment (required by Apache Tika for PDF and binary formats)

Installation

Clone the repository and install dependencies with uv:

git clone <repository-url>
cd Ontologist
uv sync

Alternatively, using pip in a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Configuration

Copy the example environment file and adjust the values to your setup:

cp .env.example .env

Available variables:

Variable Description Default
OPENAI_API_KEY API key for the LLM provider. Required. -
OPENAI_BASE_URL Base URL of any OpenAI-compatible endpoint (OpenAI, Azure OpenAI, Ollama, vLLM, LM Studio, etc.). https://api.openai.com/v1
OPENAI_MODEL_ID Model identifier exposed by the endpoint. gpt-4o-mini
ONTOLOGY_BASE_URI Base namespace URI for generated classes and properties. http://example.org/auto-ontology#
INPUT_DIRECTORY Directory scanned for input documents. Created automatically if missing. ./input
OUTPUT_DIRECTORY Directory receiving the generated .ttl and .owl files. Created automatically. ./output

The LLM provider is not restricted to OpenAI. Any service implementing the OpenAI Chat Completions API with structured output support can be used.

Usage

Place one or more supported files into the input directory (created on first run if absent) and start the pipeline:

uv run python main.py

For each processed document, two files are written to the output directory, sharing the source file's stem:

  • <stem>.ttl (Turtle)
  • <stem>.owl (RDF/XML)

If a document fails to process, the error is reported and the pipeline continues with the remaining files.

Supported File Types

PDF, DOCX, DOC, TXT, MD, RTF, HTML, HTM, ODT, PPTX, XLSX.

Output Example

For a document describing a software team, the LLM may produce a schema such as:

  • Classes: Person, Employee, Developer, Project, CodeArtifact
  • Hierarchy: Developer is a subclass of Employee; Employee is a subclass of Person
  • Properties: develops (Developer -> Project), writes (Developer -> CodeArtifact), manages (Manager -> Employee)

The serialized Turtle output resembles:

@prefix ex: <http://example.org/auto-ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Developer a owl:Class ;
    rdfs:label "Software-Entwickler"@de ;
    rdfs:subClassOf ex:Employee .

ex:develops a owl:ObjectProperty ;
    rdfs:label "entwickelt"@de ;
    rdfs:domain ex:Developer ;
    rdfs:range ex:Project .

Project Structure

.
├── main.py              # Pipeline implementation
├── pyproject.toml       # Project metadata and dependencies
├── uv.lock              # Locked dependency graph
├── .env.example         # Configuration template
├── input/               # Input documents (created at runtime)
└── output/              # Generated ontologies

Extending the Pipeline

  • Custom extraction logic — modify the system_prompt in AutomatedOntologyBuilder.extract_schema_with_llm (main.py:78) to adapt the LLM to a specific domain.
  • Additional formats — add a new format by appending a (format, path) pair to the process_file method (main.py:131).
  • Richer schemas — extend the Pydantic models in main.py:13-26 with attributes such as DatatypeProperty, cardinality, or inverse properties and adjust the graph construction accordingly.

License

MIT license

About

Ontologist extracts text from documents, uses a large language model to infer an ontology schema, and builds an RDF graph.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages