-
Notifications
You must be signed in to change notification settings - Fork 0
Introduction
RedTEI is a Python-based pipeline designed to transform Reddit comments from Pushshift archives into TEI-XML format. This wiki provides comprehensive documentation for the RedTEI pipeline, covering everything from installation and usage to explanations of the processing steps and data structures.
A poster about RedTEI and the Reddit-d corpus has been created for DHd 2025 and is available on Zenodo.
This project was developed to facilitate the creation of corpora from Reddit data. The pipeline was also used to create the Reddit-d corpus for the Digital Dictionary of the German Language (DWDS). This corpus is based on data from the top 40 German-speaking subreddits available in the Pushshift archive (dumps up to 2023-12-31 are available on academictorrent, with data up to the end of 2024 also accessible).
The pipeline includes key processing stages such as extracting and filtering comments, with the option to group them by threads, before converting them into a TEI-valid XML format.
Contributions are welcome! Suggestions for improvements, bug reports, and code contributions are highly appreciated and contribute to the project's development.