LIS 4/5693: Information Retrieval and Text Mining
A corpus can be a collection of (1)
machine-readable(2)authentictexts (including transcripts of spoken data) which is (3)sampledto be (4)representativeof a particular language or task
Representativeness is a defining feature of a corpusassembled with particular purposes in mind, and are often assembled to be (informally speaking) *representative* of some language or text type” (Leech 1992: 116)explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996)well-organized collection of data” (McEnery 2003)explicit design criteria” (Tognini-Bonelili 2001: 2)explicit design criteria for a specific purpose” (Atkins et al 1992) texts selected and put together “in a principled way” (Johansson 1998: 3)A corpus is thought to be representative of the language variety it is supposed to represent (Leech 1991)
Representativeness refers to the extent to which a
sampleincludes the full range ofvariabilityin apopulation(Biber 1993)
The representativeness of general corpora and specialized corpora (domain/genre specific) are achieved and measured in different ways
external
situational vs. linguistic perspectives
External criteria are defined situationally irrespective of the distribution of linguistic featuresInternal criteria are defined linguistically, taking into account the distribution of such featurescircular to use internal criteria like the distribution of words or grammatical features as the primary parameters for the selection of corpus dataTime? If a corpus is not regularly updated, it rapidly becomes unrepresentative (Hunston, 2002)
Relevance of permanence in corpus design actually depends on how we view a corpus - a static or dynamic language model
Static model: sample corpora (nearly all existing corpora)Dynamic model: monitor corporaCriteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination (Sinclair, 2005)
The acceptable balance is determined by the intended use – your research questionsMost general corpora of today are badly balanced because they do not have nearly enough spoken language in them (Sinclair, 2005)
The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components (Sinclair, 2005)
It would be short-sighted indeed to wait until one can scientifically balance a corpus before starting to use one, and hasty to dismiss the results of corpus analysis as ‘unreliable’ or ‘irrelevant’ because the corpus used cannot be proved to be ‘balanced’ (Atkins et al, 1992: 6)
Language is infinite, but a corpus is finite in size, so sampling is inescapable in corpus building
Population ( language/variety) vs. sample (corpus)
Corpus representativeness and balance rely heavily on sampling
Full texts or text segments?
Samples of language for a corpus should wherever possible consist of entire documents or transcriptions of complete speech events (Sinclair 2005)
|--> Good for studying textual organization
A full-text corpus may be inappropriate or problematic
Text initial, middle or end chunks?
In stratified random sampling, how many samples should be taken for each category?
frequencies and/or weights in the target population in order for the resulting corpus to be considered as representativeConstant sample size: ~ 2,000 words
purpose for which it is intended as well as a number of practical considerations
General/reference vs. specialized corporaSynchronic vs. diachronic corporaStatic/sample vs. dynamic/monitor corpora