A billion-word corpus of Danish text, freely distributed with attribution.
It’s hard to develop good tools for processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. DAGW is a project of the IT University of Copenhagen, contributed to by over a dozen other universities and businesses in Denmark; you can read the official ITU press release here. This is the homepage for the project. The general goals are to create a dataset that is:
The corpus is managed and communicated in English so that the world beyond Denmark can also use the resource.
Danish Gigaword is available via the IT University of Copenhagen:
dagw_v1.0-release.zip (2.2 GiB; md5 1eeca465f0ba00e8b03ed234a768c3ff)
Read the paper about The Danish Gigaword Corpus.
If you use the data, you MUST acknowledge it. The license is CC-BY 4.0, Creative Commons with Attribution.
In a press release:
Modellen er præ-trænet på et datasæt fra The Danish Gigaword Project (https://gigaword.dk), der er udviklet af forskere fra IT-Universitetet i København
The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen
In academic writing:
Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
@inproceedings{dagw, title = {{The Danish Gigaword Corpus}}, author = {Leon Derczynski and Manuel R. Ciosici and Rebekah Baglini and Morten H. Christiansen and Jacob Aarup Dalsgaard and Riccardo Fusaroli and Peter Juel Henrichsen and Rasmus Hvingelby and Andreas Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn Årup Nielsen and Jens Madsen and Malte Lau Petersen and Jonathan Hvithamar Rystrøm and Daniel Varab}, year = 2021, booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics}, publisher = {NEALT} }
In a software product, tool, or service:
Denne service er lavede med data fra The Danish Gigaword Corpus
That’s all we ask in return for our work; no money, no signed agreement, no royalties - just acknowledgment. We hope you think that’s fair.
If you cannot acknowledge the project like this, you are not licensed to use the data.
We’re interested in how DAGW is used; please contact us if you train a model over it.
We’re interested in how DAGW is used; please contact us if you build a tool from it.
The project is managed by Leon Derczynski (ld@itu.dk, PI) and Manuel R. Ciosici (manuelc@isi.edu, Co-I).
Background image of Henne Kirkeby by Sven Huls