LIS 4/5693: Information Retrieval and Text Mining
Planned balance: Example of British National Corpus (BNC)
https://www.english-corpora.org/bnc/
Length of Corpus
Token
Token vs Type
The term “token” refers to the total number of words in a text, corpus etc, regardless of how often they are repeated
The term “type” refers to the number of distinct words in a text, corpus etc.
The sentence “a good food is a food that you like” contains nine tokens, but only seven types, as “a” and “food” are repeated
Collecting samples of speech
Collecting written samples
permission
Permission
Transcribing Speech
Transcribing Speech (Cont.)
Markup
mark-upStorage
Access
