The Danish Gigaword project
It’s hard to develop good tools for Danish NLP when no large and wide-coverage corpus is readily available. To address this, we’re building a gigaword corpus with over a billion words (10^9). This is the homepage for the project. The overriding goals are to create a dataset that is 1. representative; 2. accessible; 3. a suitable “fixed point” for Danish NLP.
To make the corpus accessible, all parts of the corpus must be licensed openly, for free distribution. An example license is something like Creative Commons general license (CC0) or CC-BY.
Details on the corpus are maintained at arXiv:2005.03521.
Danish Gigaword should cover variation along a variety of dimensions, including:
This is an intentionally strong departure from early editions of English Gigaword that focused on Newswire; criterion (1) of the corpus, representativity, requires that one go beyond newswire. This is mandatory if the corpus is to cover enough words and language uses to be general-purpose.
We anticipate an initial release of the corpus in early 2021.
For info about joining the project, contact Leon Strømberg-Derczynski - ld@itu.dk