Text corpus

The dlexDB lexical database is based on the reference corpus of the German language compiled by the Digital Dictionary of the German Language (DWDS). This corpus consists of documents with publication dates almost equally distributed over the ten decades of the 20th century, and has the following characteristics with respect to genres:

  • fiction approx. 28%
  • newspapers approx. 27%
  • scientific publications approx. 23%
  • functional literature approx. 21%

The DWDS corpus has a size of ca. 100 mil. words (tokens) and contains ca. 2.3 mil. distinct words (types).

Please find detailed information on the DWDS corpus here (in German).