January 26 Susan Mandala (Royal Holloway, University of London)
Discourse Coding and Spoken Corpora: The State of the Field

In this paper, I review the current state of discourse coding in machine readable collections of spoken data. While reviews of corpus linguistics and corpus-based research typically begin with assertions that the field is biased towards written sources, I show that this may no longer be the case. I also show that while pragmatic coding schemes to support these collections have been and continue to be developed, they are largely based on lexical or syntactic units such as anaphors or modal verbs. Few pragmatic coding schemes have been developed for tagging units at or above the level of the conversational act. The lack of discourse coding at this level has led to a lexical and syntactic bias in corpus-based discourse research. While there are numerous studies on items such as discourse markers, back channel devices and fixed expressions, there are relatively few corpus-based studies on conversational acts, moves and exchanges.

While discourse coding schemes are not numerous, a few have been put forward. Proposals have come primarily from two disciplines, natural human-human discourse studies, and human-machine discourse studies. Such schemes have been criticised on the grounds that they are not theory neutral, and not informed by widespread native speaker agreement on how the minimum unit of analysis is to be defined. By comparing the discourse schemes proposed with each other and with other schemes within the domain of discourse analysis, I demonstrate that while discourse coding is not at this stage theory neutral, there is in fact a large degree of consensus concerning the minimal unit of analysis (the discourse act) and how it is to be defined (by speech act-type conditions). With consensus, there is way forward for standard, general discourse coding of spoken corpora at the level of the conversational exchange.

Back to the Corpora and Conversation Seminar Series page