Style Detection and Authorship Attribution presented at the International Conference on Literature, Languages & Linguistics in Athens, Greece, July 28-31, 2008
MetadataShow full item record
Style detection and Authorship Attribution Nowadays the computer allows literary scholars to enter into an area which was unheard of before, namely computational literary statistics or the statistical analysis of style. It has proven to be invaluable in the area of authorship attribution to disputed and pseudonymous works. In the past, the task of attributing authors to these works had been a matter of connoisseurship with literary scholars relying on their intuition to make conclusions. This kind of identification was often done in a subjective and impressionistic manner and was not always based on clearly defined quantities. Moreover, views would often conflict with one another, with the result that this subjective methodology of measuring style did not appeal to all researchers. In recent years, the new field of style analysis also known as statistical stylometry or "authorship attribution" has allowed scholars to distinguish the style of authors through the use of statistics. Since that time, we have seen the development of improved techniques in the field of authorship attribution due mainly to the wider availability of computer-accessible corpora. This, in turn, has made the automatic inference of authorship a possibility resulting in the fact that research in this area has expanded tremendously. While numerous methodologies exist for the measurement of style (such as word-length, number of syllables per word, sentence-length, words that occur once and words that occur twice in the corpus etc.) some of these methods have proved to be unreliable to be of much use to any serious researcher in this field. In this article I shall talk about some of these methodologies and reveal some of their pitfalls. I shall then discuss one methodology that seems to be the most promising of all, namely the frequency at which words occur in a corpus as an element of style.