NOAH’s Corpus

We provide PoS-tag annotated Swiss German text as xml files as well as PoS-tagger models trained on NOAH’s Corpus.


NOAH’s Corpus of Swiss German Dialects contains five xml files. Each file contains Swiss German texts of a different genre:


As the basic tagset we use the Stuttgart-Tübingen-TagSet (STTS), which is the standard for German. Because of the differences between German and the Swiss German dialects we additionally introduced the tag PTKINF as well as the adding of a “+“-sign to any PoS tag of a merged word.

The newly introduced tag PTKINF represents an infinitive particle like e.g. in “ich mues go poschte”. It is a commonly used and therefore widely analysed phenomenon for Swiss German dialects with no corresponding word or construction in Standard German.

In order to handle merged words, we introduced the “+“-sign which can be added to any PoS tag. Since Swiss German does not have official spelling rules, words can be freely joined. Instead of splitting, we identify these merged words by using the corresponding STTS-tag for the main part and add a plus sign to show that a given word consists of more than one simple word. There are sequences of words that are commonly merged, but also less common combinations can appear as it depends on the preferences of the writer.