Analysing Utterances in Polish Parliament to Predict Speaker's Background

作者:Przybyla Piotr; Teisseyre Pawel*
来源:Journal of Quantitative Linguistics, 2014, 21(4): 350-376.
DOI:10.1080/09296174.2014.944330

摘要

In this study we use transcripts of the Sejm (Polish parliament) to predict speaker's background: gender, education, party affiliation and birth year. We create learning cases consisting of 100 utterances by the same author and, using rich multi-level annotations of the source corpus, extract a variety of features from them. They are either text-based (e. g. mean sentence length, percentage of long words or frequency of named entities of certain types) or word-based (unigrams and bigrams of surface forms, lemmas and interpretations). Next, we apply general-purpose feature selection, regression and classification algorithms and obtain results well over the baseline (97% of accuracy for gender, 95% for education, 76-88% for party). Comparative study shows that random forest and k nearest neighbour's classifier usually outperform other methods commonly used in text mining, such as support vector machines and naive Bayes classifier. Performed evaluation experiments help to understand how these solutions deal with such sparse and highly-dimensional data and which of the considered traits influence the language the most. We also address difficulties caused by some of the properties of Polish, typical also for other Slavonic languages.

  • 出版日期2014