Identifying idiolect in forensic authorship attribution : an n-gram textbite approach

Alison Johnson, David Wright

Resumo


Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic
linguist “approaches this problem of questioned authorship from the theoretical
position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the difficulty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to
recognise as having news value and which characterise the important moments
of talk. The soundbite oUers an intriguing parallel for authorship attribution
studies, with the following question arising: looking at any set of texts by any
author, is it possible to identify ‘n-gram textbites’, small textual segments that
characterise that author’s writing, providing DNA-like chunks of identifying material? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, Vrst showing through stylistic analysis that one Enron employee repeatedly produces the same stylistic patterns of politely encoded directives in a way that may be considered habitual. Then a statistical experiment with the same case study author Vnds that word n-grams can assign anonymised email samples to him with success rates as high as 100%. This paper argues that, if suXciently distinctive, these textbites are able to identify authors by reducing a mass of data to key segments that move us closer to the elusive concept of idiolect.


Texto Completo:

PDF

Apontadores

  • Não há apontadores.


 

 

 

 

 

 

 

ERIH PLUS

 

 

 

eISSN 2183-3745

 

Lista das Revistas