Domain-neutral LLM post-training pipeline for CPT, Fact-SFT, optional DPO, adapter merge, quality evaluation, GGUF/ONNX export, and OpenAI-compatible inference.
-
Updated
Jun 28, 2026 - Python
Domain-neutral LLM post-training pipeline for CPT, Fact-SFT, optional DPO, adapter merge, quality evaluation, GGUF/ONNX export, and OpenAI-compatible inference.
A comprehensive toolkit for end-to-end continued pre-training, fine-tuning, monitoring, testing and publishing of language models with MLX-LM
A clean, reproducible pipeline for training Vietnamese GPT-2 from scratch and adapting it to 5-word quatrain poetry generation.
LangSAMP: Language-Script Aware Multilingual Pretraining
This project evaluates Llama 3.2 3B continued pre-training for Serbian language, using a custom-made cloze-style benchmark. It supports grammatical, lexical, semantic, idiomatic, and factual sentence completion tasks. The evaluation script calculates model accuracy based on log-likelihood scoring over masked token choices.
Adapting LLM to the medical domain through SFT, RAG, and multistep fine-tuning to enhance domain knowledge and performance.
Dialect-aware language model for all six Romansh varieties. QLoRA continued pretraining on ZurichNLP/quotidiana; single-GPU, reproducible. First open Romansh LM.
Controlled study: does continued pre-training on SEC 10-K filings help downstream financial QA? A clean negative result on a fair evaluation instrument. Qwen2.5-3B; FinQA/TAT-QA; CPT (LoRA and full-parameter), SFT, DPO.
Falsifiable, stage-gated methodology for resource-disciplined LLM specialisation — worked example: adapting a generic LLM to believable (non-cliché) Australian English. Work in progress (v0.1).
CPT+SFT LoRA pipeline (German Wikipedia + docs) to extend LLM knowledge cutoff.
Master's thesis repository
Add a description, image, and links to the continued-pretraining topic page so that developers can more easily learn about it.
To associate your repository with the continued-pretraining topic, visit your repo's landing page and select "manage topics."