Danish Gigaword

A billion-word corpus of Danish text, freely distributed with attribution.

Introduction

It’s hard to develop good tools for processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. DAGW is a project of the IT University of Copenhagen, contributed to by over a dozen other universities and businesses in Denmark; you can read the official ITU press release here. This is the homepage for the project. The general goals are to create a dataset that is:

representative;
accessible;
a suitable common starting point for Danish NLP models.

The corpus is managed and communicated in English so that the world beyond Denmark can also use the resource.

Download

Danish Gigaword is available via the IT University of Copenhagen:

dagw_v1.0-release.zip (2.2 GiB; md5 1eeca465f0ba00e8b03ed234a768c3ff)

Documentation

Read the paper about The Danish Gigaword Corpus.

License & Reference

If you use the data, you MUST acknowledge it. The license is CC-BY 4.0, Creative Commons with Attribution.

Sample attributions:

In a press release:

Modellen er præ-trænet på et datasæt fra The Danish Gigaword Project (https://gigaword.dk), der er udviklet af forskere fra IT-Universitetet i København

The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen

In academic writing:

Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

@inproceedings{dagw,
 title = {{The Danish Gigaword Corpus}},
 author = {Leon Derczynski and Manuel R. Ciosici and Rebekah Baglini and Morten H. Christiansen and Jacob Aarup Dalsgaard and Riccardo Fusaroli and Peter Juel Henrichsen and Rasmus Hvingelby and Andreas Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn Årup Nielsen and Jens Madsen and Malte Lau Petersen and Jonathan Hvithamar Rystrøm and Daniel Varab},
 year = 2021,
 booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics},
 publisher = {NEALT}
}

In a software product, tool, or service:

Danish Gigaword Corpus: license - homepage

Denne service er lavede med data fra The Danish Gigaword Corpus

That’s all we ask in return for our work; no money, no signed agreement, no royalties - just acknowledgment. We hope you think that’s fair.

If you cannot acknowledge the project like this, you are not licensed to use the data.

Models using Danish Gigaword

Ælæctra - A Step Towards More Efficient Danish Natural Language Processing. huggingface.co/Maltehb/aelaectra-danish-electra-small-cased

We’re interested in how DAGW is used; please contact us if you train a model over it.

Tools using Danish Gigaword

A&ttack and Ha&te by Analyse & Tal
Implementation in Sketch Engine

We’re interested in how DAGW is used; please contact us if you build a tool from it.

Press Coverage

Heste-nettet kan blive grundlag for kunstig intelligens på dansk - Danmarks Radio
Hestenet, tørstige prompts og chatbot, der kan høre og se - Prompt
Danish AI Trained on Data From a Web Forum About Horses - Bloomberg
ChatGPT blev trænet af danske hestetøser
I Danmark har vi vores egne grundmodeller til dansk sprog. Det udvikles dog udelukkende af ihærdige frivillige, som gør et fantastisk arbejde. - Børsen
Featured in the Foreign Ministry’s “Invest in Denmark”
A Danish billion-word corpus appears - Import AI
Danish Gigaword Project - et historisk stort dansk tekstkorpus - Sprogteknologi.dk / Digitaliseringsstyrelsen
ITU led project will make automated translation more reliable - ITU
Superalgoritme kortlægger det danske had og afslører yndlingsofrene på Facebook - Politiken
Sprogmodellen Ælæctra vil forbedre dansk sprogteknologi på en klimavenlig måde - KMD
This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why. - Morning Brew

Contact

The project is managed by Leon Derczynski (ld@itu.dk, PI) and Manuel R. Ciosici (manuelc@isi.edu, Co-I).

Credits

Background image of Henne Kirkeby by Sven Huls