Danish Gigaword

A billion-word corpus of Danish text, freely distributed with attribution.

Introduction

It’s hard to develop good tools for processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. DAGW is a project of the IT University of Copenhagen, contributed to by over a dozen other universities and businesses in Denmark; you can read the official ITU press release here. This is the homepage for the project. The general goals are to create a dataset that is:

  1. representative;
  2. accessible;
  3. a suitable common starting point for Danish NLP models.

The corpus is managed and communicated in English so that the world beyond Denmark can also use the resource.

Download

Danish Gigaword is available via the IT University of Copenhagen:

dagw_v1.0-release.zip (2.2 GiB; md5 1eeca465f0ba00e8b03ed234a768c3ff)

Documentation

Read the paper about The Danish Gigaword Corpus.

License & Reference

If you use the data, you MUST acknowledge it. The license is CC-BY 4.0, Creative Commons with Attribution.

Sample attributions:

In a press release:

Modellen er præ-trænet på et datasæt fra The Danish Gigaword Project (https://gigaword.dk), der er udviklet af forskere fra IT-Universitetet i København

The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen

In academic writing:

Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).

@inproceedings{dagw,
 title = {{The Danish Gigaword Corpus}},
 author = {Leon Derczynski and Manuel R. Ciosici and Rebekah Baglini and Morten H. Christiansen and Jacob Aarup Dalsgaard and Riccardo Fusaroli and Peter Juel Henrichsen and Rasmus Hvingelby and Andreas Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn Årup Nielsen and Jens Madsen and Malte Lau Petersen and Jonathan Hvithamar Rystrøm and Daniel Varab},
 year = 2021,
 booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics},
 publisher = {NEALT}
}

In a software product, tool, or service:

Danish Gigaword Corpus: license - homepage

Denne service er lavede med data fra The Danish Gigaword Corpus

That’s all we ask in return for our work; no money, no signed agreement, no royalties - just acknowledgment. We hope you think that’s fair.

If you cannot acknowledge the project like this, you are not licensed to use the data.

Models using Danish Gigaword

We’re interested in how DAGW is used; please contact us if you train a model over it.

Tools using Danish Gigaword

We’re interested in how DAGW is used; please contact us if you build a tool from it.

Press Coverage

Contact

The project is managed by Leon Derczynski (ld@itu.dk, PI) and Manuel R. Ciosici (manuelc@isi.edu, Co-I).

Credits

Background image of Henne Kirkeby by Sven Huls