Regular expressions
Text fields in dlexDB
There are five categories of text fields in dlexDB:
- verbatim citations from the corpus (Type, Type bigram, Type trigram, Character, Character bigram, Character trigram)
- downcased citations from the corpus (Type DC (downcased), Type bigram DC (downcased), Type trigram DC (downcased), Character DC (downcased), Character bigram DC (downcased), Character trigram DC (downcased))
- representations of linguistic analyses (Syllables)
- linguistic material derived from the corpus (not neccessarily occuring there verbatim) (Syllable, Lemma)
- other codes (PoS tag)
All of these text fields can be queried via regular expressions. To do that, simply enter your query into the input field of the respective filter and mark it as a regular expression by enclosing it with two slashes.
Examples:
- Word must contain gen at any position; e.g., genug, irgendwo, morgen, gen
- Word must start with gen; e.g., genug, gen. The special character ^ marks the beginning of a word.
- Word must end with gen; e.g., morgen, gen. The special character $ marks the end of a word.
When querying a text field that is stored case-sensitively, the Ignore case checkbox also affects the interpretation of the characters in your regular expression (if checked).
Full regular expression syntax
dlexDB supports most of the so called extended regular expression syntax as described in Spencer, 2007. The most common operators are:
- Word must contain gen at any position; e.g., genug, irgendwo, morgen, gen
- Word must start with gen; e.g., genug, gen. The special character ^ marks the beginning of a word.
- Word must end with gen; e.g., morgen, gen. The special character $ marks the end of a word.
- Word contains Üb, followed by any number - even zero - of arbitrary characters, followed by ung; e.g., Überlegung, Übung
- Word contains Üb, followed by any number - but at least one - of arbitrary characters, followed by ung; e.g., Überlegung (but not: Übung)
- Word contains R, followed by exactly one arbitrary character, followed by ck; e.g., Reck, Rock, Ruck
- Word contains R, followed by either e or o, followed by ck; e.g., Reck, Rock
- finds Oberammergau and Unterammergau
Notes
Please note that there are a few multi-word types in dlexDB, in which spaces have been replaced by underscores (New_York). On the other hand, spaces are used in type bigrams and type trigrams to separate the constituent types from each other.
Contents
Current version
- 0.3
- New tables: all measures in case-insensitive variant.