The DIRT corpus consists of Dutch-language reality TV series such as De Mol, Chateau Meiland, and Temptation Island. These are unscripted shows that feature relatively spontaneous, informal Dutch speech.
The first version of the DIRT corpus was created by Ulrike Vogl and Gauthier Delaby in 2021, as part of a student project for the course “Dutch Linguistics: The Contemporary Dutch Language System” and a research track for Dutch bachelor’s students on “Language Use in Reality TV.” For this project, episodes of various reality shows were transcribed following a specific transcription protocol (Ghyselen et al. 2020). The corpus is a work in progress, regularly supplemented with newly transcribed material.
The corpus is enriched with metadata, providing information about the speakers’ regional background, gender, education, and age. It includes both older and contemporary shows, featuring Dutch as spoken in both Belgium and the Netherlands.
Since October 30, 2025, the DIRT corpus version 1.0 is downloadable via Zenodo. In addition, there is also a concordance program for DIRT, the DIRT concordancer, also available on Zenodo. The current version of DIRT contains 350,965 words.
For more detailed information on (i) the content of the corpus, (ii) how to search the corpus, and (iii) the history of the corpus and the DIRT project, see the project documentation (in Dutch) Over DIRT. You will also find here a number of relevant statistics about the current version of the DIRT corpus (e.g., number of programs & seasons or the ratio of Dutch Dutch-Belgian Dutch material).