CAN STATISTICAL LANGUAGE MODELS BE USED TO DISTINGUISH BETWEEN DIFFERENT GENRES OF NEWS

Authors
S Dreibe, G Hunter
Conference

Statistical Language Models (SLMs) have found widespread applications in many fields, including Automatic Speech Recognition systems, Automated Translation systems, and Cryptographic Analysis. It was been previously observed that lexical unigram, bigram and trigram distributions, which form the foundations of such SLMs, heavily depend on the type of data from which they were acquired – popular or serious literature, news, non-fiction text, formal speeches and structured or spontaneous dialogue. It has also been proposed that the lexical distributions also heavily depend on the theme or topic within each of the above styles of language. In this paper, we investigate the extent to which such distributions vary between two different types of news – business and sports – within a dataset compiled by the BBC. We discuss our findings, particularly focusing on whether such models could form the basis of an automated genre or topic detector or classifier for news text or broadcasts.