NOAH’s Corpus
We provide PoS-tag annotated Swiss German text as xml files as well as PoS-tagger models trained on NOAH’s Corpus.
Content
NOAH’s Corpus of Swiss German Dialects contains five xml files. Each file contains Swiss German texts of a different genre:
- blick: newspaper articles from “Blick am Abig”, Version Zurich, Nr. 97, 28. Mai 2013
- blogs: “BlogSpot” blogs, extracted 31.1.2014 blog1, blog2, blog3
- swatch: SWATCH annual report 2012
- schobinger: extracts of criminal novels by Viktor Schobinger
- wiki: articles from the Alemannic Wikipedia, extracted 10.4.2012
Tagset
As the basic tagset we use the Stuttgart-Tübingen-TagSet (STTS), which is the standard for German. Because of the differences between German and the Swiss German dialects we additionally introduced the tag PTKINF as well as the adding of a “+“-sign to any PoS tag of a merged word.
The newly introduced tag PTKINF represents an infinitive particle like e.g. in “ich mues go poschte”. It is a commonly used and therefore widely analysed phenomenon for Swiss German dialects with no corresponding word or construction in Standard German.
In order to handle merged words, we introduced the “+“-sign which can be added to any PoS tag. Since Swiss German does not have official spelling rules, words can be freely joined. Instead of splitting, we identify these merged words by using the corresponding STTS-tag for the main part and add a plus sign to show that a given word consists of more than one simple word. There are sequences of words that are commonly merged, but also less common combinations can appear as it depends on the preferences of the writer.