This document serves as an introduction to and a brief description of Birmingham's Bank of English, currently a 450 million word corpus of present-day English and a subcorpus aimed for teaching consisting of 56 million words.. Users have to have an Internet connection, and a TELNET application in order to make use of the Bank of English facilities. The practical details about getting connected will be explained in the first section (Getting Connected) of this document.
[Table of Contents]
Since the Bank of English is a net-based service, users must have an Internet connection in order to use the corpus software. Most computers at academic institutions are permanently connected to the Internet, but the corpus can of course be accessed from any computer on the net, including computers temporarily connected via a dial-up connection (i.e a modem).
Telnet is a text-based program used to log in directly to a remote computer and to carry out an text-based interactive session via the Internet. Quite often telnet is used in library systems and in other contexts where there is no need to include screen graphics. Since the Bank of English corpus is all text, then there is no need for a full graphical interface to the corpus software, and Telnet has the advantage that it is a faster and more powerful method of accessing a remote computer server than a web browser. We describe here how to connect using the standard Windows telnet program, and for Macintosh users we suggest a Telnet program that can be easily downloaded and installed on your computer.
a) Windows 95/98/NT/2000/ME
In Microsoft Windows from Win95 onwards, the Telnet program comes as a standard program. The easiest way to run it is to click on the "Start" button, then select "Run...". In the window that pops up, type "telnet" into the entry box and click the "OK" button. A new Telnet window will appear. At the cursor, type
One of the most commonly used Macintosh telnet applications is called NSCA Telnet. The most recent non-beta version can be downloaded from NCSA Telnet for Macintosh Home Page The application is quite user-friendly and easy to use.
The login procedure is quite straightforward. Once connected via telnet to titania.bham.ac.uk (as described in a) and b) above, you will be requested to give your login name and password. Once you have been accepted to the system, you will be presented with the main menu screen.
[Table of Contents]
The main menu gives users access to the two main tools in the Bank of English software, the interactive corpus access tool and the collocations listings, as well as to some other features and information. The screen should look something like this:
1. Interactive corpus access tool
2. Collocations listings
3. Automatically e-mail your saved data files
6. If the information on your screen seems to be garbled...
7. **NEWS** on WorldWide Web services from Cobuild...
8. Change password
9. Quit and logout
(Enter a number for the required option):
Some of these options will be briefly commented on:
1. The Interactive corpus access tool. This is is the actual concordance program that allows you to query the corpus data (taken from one or several subcorpora) in a number of ways. The query language and the selection of subcorpora are described in subsequent sections.
2. Collocations listing. This option lets the user perform collocation analyses, i.e. to find out about what words occur together. This option is described in more detail in the following section.
3. Automatically e-mail your saved data files. Concordances can be saved from the corpus access tool, and email provides an easy way of transferring these files from the Cobuild server to your computer. In order to use this facility, you will have to register by contacting Michelle Deveruex, the Bank of English administration (M.C.M.Devereux@bham.ac.uk). You must specify which email addresses to use for each username of your subscription.
6. If the information on your screen seems to be garbled...Option 6 gives some information about telnet window size (it should be 80 characters wide to allow for problem-free viewing).
8. Change password.
9. Quit and logout. This option will end the active the Bank of English session.
[Table of Contents]
One of the most common uses for corpora and corpus linguistics is to find out about systematic co-occurrence patterns (collocations) of words. A learner might be interested in how the adjective sour is used. When can it be used and when do we use other near-synonyms such as rancid and off? The collocational tool offered in the Bank of English can be helpful here. The typical starting-point is a node word (e.g. sour) and we want to find what words collocate with this node word.
The collocational tool can be reached via option 2 in the main menu. You will be requested to select which subcorpora to use. In most cases, you will want to use all the subcorpora, i.e the whole corpus. In order to do this, you just need to press RETURN. If you want to carry out your analysis on some subset of the full corpus, then you can list the subcorpora you wish to work with. You will then have to give the node word and the span of words you want to include in the analysis. It should be given in the form L:R (number of words to the left of the node word : number of words to the right of the node word). The default span is 4:4. You will also have to indicate whether you are interested in raw frequency or statistical significance. The the Bank of English software supports two ways of measuring statistical significance: Mutual Information and t-score. The Mutual Information score expresses the extent to which observed frequency of co-occurrence differs from what we would expect (statistically speaking). It does not work very well with very low frequencies. For instance, sour occurs 472 times and puss 31 times in the Bank of English corpus. Since sour and puss co-occur 4 times, this gives this particular collocation a very high MI score. The t-score provides a way of getting away from this problem since it also takes frequency into account. To sum up, MI is more likely to give high scores to totally fixed phrases whereas t-score will yield significant collocates that occur relatively frequently. In most cases, t-score is the most reliable measurement.
This screen shows collocational data for sour with a 0:1 span and sorted on M I score. In this way we get information about the word that follows the node word (sour).
As we can see, sour cream is the most common collocation. There are 2438 instances of cream in the corpus and 46 examples of sour cream. The MI-score is 6.78. In other words, the four columns correspond to:
If we had sorted these data according to frequency, the list would turn out quite differently. Function words such as and would score highly. From this screen, RETURN will give you another screen of sorted entries and q will exit. If you choose to exit, you will be given the opportunity to save your concordance and to read a help document on the measures of statistical significance.
[Table of Contents]
The Corpus Access Tool is the main feature of the Bank of English software, and it incorporates several main functions and associated screens. Four main steps can be identified:
We will go through these steps, one at a time.
[Table of Contents]
After choosing option 1 in the main menu, you will be presented with a screen that will let you select the composition of the search corpus. The the Bank of English corpus is divided into 11 subcorpora, or text-type categories. In many cases, you will be likely to use all subcorpora (the whole corpus), but in other cases, it may be useful to perform searches on a restricted set of data. If you want to compare American and British English, you might want to search the American books and the British books separately. If you are interested in analysing spoken English, you might select the spoken British English or National Public Radio subcorpora.
The actual selection of subcorpora is straightforward. You are presented with a list of 11 subcorpora. If you want to use all the subcorpora, just press RETURN. Each subcorpus has an abbreviation associated with it, and in order to select one or several subcorpora, you will have to write the abbreviation of each subcorpus. If you select more than one subcorpus, you have to separate the abbreviations by a space. The abbreviations are listed here (and on the select subcorpora screen) together with the full titles and the sizes:
|Abbreviation||Full title||Size (million words)|
|bbc||BBC World Service||2.6|
|npr||National Public Radio||3.1|
Examples of subcorpus selection:
RETURN - gives you access to all subcorpora
npr - gives you access to National Public Radio transcriptions
ukbooks usbooks - gives you access to the UK books and US books subcorpora
[Table of Contents]
When you have finished selecting which subcorpora to work with, you will move to the query screen.
At this stage, enter a word, phrase or pattern to be retrieved. This is a simple yet powerful process, and much of the strength of the Bank of English software lies in the speed and versatility of this system. The most straight-forward query is a single word such as "Thatcherite", "edutainment" or "dog". Frequently, queries of this simple type will suffice, but often there will be a need for more advanced queries. While we will describe the query language syntax in this section, this is not the place to give a full description.
Before we continue the discussion of the query language, it may be useful to see the result of entering a simple query. The following image shows you the result of a query for the word edutainment.
The figures on the right-hand side shows the low frequency of this word. In total, there are only 11 instances of edutainment, and they occur in the subcorpora as indicated in the list. For instance, there are two occurrences in Australian newspapers (oznews). If we press RETURN here, we will get the actual concordance as illustrated in this figure:
This screen is described in detail in the following section. At this point we will consider a range of queries. The most important building-blocks are:
These building-blocks are not as mysterious as they might at first appear. Actual examples will give you an idea of how it works.
|stupid*||stupid plus all words starting with stupid- (stupidity, stupidness|
|stupid|stupider|stupidest||any of the alternative forms|
|more|most+stupid||more or most plus stupid|
|stupid+NOUN||stupid followed by a noun|
|help@||all the inflected forms of help (help,helped, helps, helping)|
|snow/VB||snow tagged as a verb|
|a|an+0,1advice||indefinite article followed by either just advice or an intermediate word followed by advice|
|a|an+1,1girl||a or an followed by exactly one word followed by girl|
As we can see here, the building-blocks can be combined to form more complex queries. The tags available in the Bank of English concordance software are listed in the Reference Manual. Since the tagging is automatic (tags are assigned by a computer program), it should not be trusted blindly. There is a tagging error rate of around 5%. Many of the tagging errors are caused by instances such as light in Don't touch the light switch which is likely to be tagged as an adjective. The most important tags are NOUN (noun), VERB (verb), VB (verb in the base form), JJ (adjective), CD (numeral), IN (preposition) and RB (adverb).
When several alternatives are specified (either by means of |, * or @), you will be given the opportunity to decide whether you actually want all the forms.
From this screen, you can select and deselect variant forms.
y - select the word at the cursor
n - deselect the word at the cursor
Y - select all words
N - deselect all words
You can use d and u to move up and down the word forms. The frequencies on the right-hand side are those associated with the word form next to the cursor. If you want all the forms, you just need to hit RETURN.
You will be presented with some statistics on the distribution of forms. The following screen appears after submitting a "family+are" query.
As we can see, this construction is most common in Today (8.8 instances per million words). Just as we might expect, it is predominantly in British English that we find collective nouns with plural verbs.
Before the actual concordances are displayed, the program asks you how many concordance lines you wish to be retrieved. If the result of your query amounts to many thousands of hits, and you do not need to view all of them, it may be a good idea to select just 500 or 1000.
[Table of Contents]
After having specified the query and having pressed RETURN a couple of times (to get through some statistics), you will get to the concordance screen. The following screen is the result of the "family+are" query
This type of presentation is called KWIC (Key-Word In Context). The search word is in the middle and surrounding text to left and to the right. You can see 24 instances at the same time. The status line tells us that there are a total of 197 instances and that the first example (next to the cursor) comes from the oznews (Australian news) subcorpus. There is also a text reference id which can be used to identify the source more exactly and it turns out that this particular example comes from Sun 5 Feb, 1995.
a) Moving around the concordances
The user can move up and down among the KWIC entries by means of these single keystroke commands:
k - up one line
u - up a screen
j - down one line
d - down one screen
t - top of lines
b - bottom of lines
: - go to a particular line number
The commands can be used to move the cursor up and down in the present screen or to move one page backward or forward in the KWIC list. When working with large numbers of lines, the go-to-a-particular-line command can be quite time-saving (i.e. you can go directly to line 4000 instead of manually scrolling through hundreds of lines).
b) Selecting/rejecting lines
Most queries, even if well-worked out, will not give you exactly what you want. In the above example, you can see that there are some lines that are not directly relevant for a study of verb agreement and collective nouns. Some instances of family are part of coordinated clauses and other instances are followed by a full stop. In the Bank of English software, there are several ways of selecting and deselecting lines. In this way, the concordance can be made "cleaner". The easiest way to get rid of a line is to hit the delete key. There are also other, more sophisticated ways of deleting lines. A whole chunk can be deleted by using Ctrl-x to set a mark and Ctrl-w to delete between the mark and the present cursor location. This means that whole blocks of lines can be deleted at once.
DEL - delete current line
Ctrl-x - set mark
Ctrl-w - delete lines between here and mark
Regular expressions provide another way of reducing the number of concordance lines. A regular expression is a string which conforms to a special pattern matching language. Basically, this pattern matching language comes from the UNIX world. There it is used with various commands, such as grep (to make a selection). Regular expressions are explained in detail in the help file and in the reference manual, and should not be avoided just because they may seem a bit difficult to use.
|c||regular characters match themselves||dog matches dog but also dogs and doggie|
|\||used to match special characters||\. matches a full stop|
|.*||matches any characters up to end of word||*ish matches greenish, Irishmen etc.|
|\<||matches beginning of word||\<dog matches dog, doggie etc.|
|\>||matches end of word||ish\> matches greenish, bluish etc.|
|\< >\||matches word||\<dog>\ matches dog|
|||matches character against selection in brackets||\<[aA]rmy\> matches army and Army, 1[2-4]00 matches 1200, 1300 and 1400|
|.||matches any character||\<..ck\> matches rock, tack, kick etc.|
|\|||logical or||color|colour matches color or colour (i.e. all instances of both forms)|
It is important to bear in mind that unless explicitly specified, regular expressions do not relate to words but rather to text. This means that a simple expression such as he will match he, she, the etc. Make sure to indicate the beginning and end of words when relevant. All in all, the commands that make use of regular expressions are very powerful:
/ - regular expression search
g - select regular expression (reduce your concordance to lines that conform to the expression)
v - de-select regular expression (delete lines that conform to the expression)
We will give you a couple of examples to clarify how these commands can be used. After /, g or v has been pressed down, you will have to specify the regular expression.
|v||and family||removes all instances of "and family"|
|g||Family||selects all instances of Family (with a capital F) - and deletes the rest|
|/||nuclear family||finds the next instance of "nuclear instance"|
Both v and g will most likely produce a small concordance, but you can always go back to what was before by pressing RETURN. If you have performed a number of operations, you can go back step by step. Note that this does not work if you have used DEL and Ctrl-w to delete lines. Deleted lines cannot be restored.
c) Sorting lines
The "family are" entries in the above example are sorted according to subcorpus, i.e. the oznews corpus entries come first. This is the default sorting order. It can often be useful to sort the entries according to what precedes or follows the node word. This can be done by typing s.
s - sort the concordance lines
You will be asked to specify the sorting criteria: left or right of the node word, and between 1 and 5 words away from the node word in that direction. In the following figure, the entries have been sorted according to the word that proceeds the node word (i.e. left - 1).
If we were interested in finding all instances of family directed preceded by the indefinite article, sorting the lines would make our job easier. Frequently, sorting is the best way of getting to grips with large amounts of data. If we had searched for different + preposition (query: different + IN), it would probably make sense to sort the concordances according to the following word. In this way, constructions such as different from and different to would be clearly separated.
The sort function is not case-sensitive, and if there are several identical constructions, they will be ordered according to subcorpus. In principle, the sorting works according to ASCII order. This means that digits and various tags (such as <p> for new paragraph) precede letters.
d) Text references
There are two commands that give the user information about the current line:
n - display source information
z - display text references
This information is placed on the left-hand side of the screen. Source information relates to the subcorpus from which the line was taken (example: brmags/10) Text reference refers to exactly where the reference comes from in that subcorpus (example:N0000000929).
The following screen shows what happens to the above "family are" concordance when n has been used.
Every line now includes information about which subcorpus the text comes from.
e) Extra context
In many cases, the context that the KWIC screen gives you (80 characters) may not be enough. You need more context, and the Bank of English this can be arranged by means of this command:
x - show expanded context
It gives you approximately a paragraph of text, and if you need even more context, you can utilize the article mode function.
a - article mode
In principle, this command gives you access to the whole text since you can go up (press u) and down (d) in the text.
f) Collocations and frequency lists
In the section on the collocations option, we discussed collocations and various ways of measuring collocational strength. There are several commands that can be used to perform this kind of analysis directly in the concordance software.
c - list by t-score
C - list by MI score
F - list by frequency
This screen shows a collocation list of family (the above example) sorted according to frequency (F). Since the concordance only includes examples with family followed by are, we should not be surprised that there are 200 instances of family+are. The definite article also co-occurs often with family in this set of concordance data. Importantly, the list commands give you collocates with a 4:4 span. You get data for 4 words on each side of the node word.
The right-most column describes t-score values for the various combinations. The list can be resorted according to M I score (C) or t-score (c). There is a help screen available here (press ?). Press RETURN to get back to concordance screen.
As well as this list of collocates, you can also get a visual profile of which collocates tend to occur in which positions. This profile can be sorted according to the same criteria as the above list.
T - picture by t-score
m - picture by MI score
p - picture by frequency
If we press m from the concordance screen (with the family are examples), we will first get a question about how many how many columns of collocates you are interested in. This should be between 3 and 6, and in most cases, 3 is probably enough.
This screen appears after specifying the number of columns to be analysed. It may take a while if there are many concordance lines. The columns should be seen as slots or positions, and it does not really make sense reading the lines from left to right. The node word (family) is in the middle. We can see that the is the word with the highest MI score in the position just before family. Since are always occurs after family in this concordance (due to the query), this is the only item in first column on the right side of the node word. The lines can be resorted just with the concordance lists.
p - reorder picture/collocates by MI-score
T - reorder picture/collocates by t-score
In this section, we will review a couple of commands that have not yet been discussed.
p - give the tag of the node word (i.e. parts-of-speech information)
% - show relative frequencies
As we have noted earlier, the words in the Bank of English corpus are tagged. This information can be retrieved by using the p command. The relative frequency command tells us how many instances of the node word there are per million instances. This is the same kind of information that is given just after a query has been prcoessed by the the Bank of English software. The only difference is that the % command describes the relative frequency for the current set of concordance lines.
h) The help screen
We will end this section with the help screen which summarizes the most important features and the single key commands in the Bank of English software. In the next section, we will consider a feature that we have not discussed yet: namely how to save data.
[Table of Contents]
Once you have a set of concordance lines you might want to save it for further processing. Remember that everything you do in the Bank of English system is handled by computers in Birmingham, and that the telnet program just gives you continuous screen updates. The command f (from the concordance screen) can be used to save concordances.
You need to give a filename and specify how much context (i.e number of characters) you need. Filenames should be composed of upper and lowercase letters, digits, hyphen, period, underscore, hash (#), dollar($) and percent (%) characters. If your own computer is running Windows 3.x or DOS then filenames should also be limited to eight characters with an optional filename suffix of a dot and three further characters. So "mydata01.cnc", "fam-coll-mi#1", "98spok-NN" are valid filenames, but "My Data File", "data/001", "**VB;concordance&MI" are not good filenames. This file will now be saved in your directory on the Cobuild server. The files can be downloaded onto your own computer in two ways. The easiest way is to select menu option 3 (automatically e-mail your saved data files) from the main menu. In this way, all new files will be emailed to you. Files are automatically deleted from the the Bank of English server after 30 days. This system does not allow the user to re-send files by email. This brings us to the second download option.
In order to download files to your own computer directly, you will need to have an FTP-program (FTP stands for File Transfer Protocol). CuteFTP is a FTP-good program for PCs and Fetch is probably the best Macintosh alternative. Run your FTP-program to connect to titania.bham.ac.uk using your regular the Bank of English login name and password. Remember to set the transfer mode to ASCII.
[Table of Contents]
Modern language corpora such as the Bank of English Corpus are an enormous asset in linguistics, language teaching and lexicography. We can quickly and easily investigate large amounts of data, and we can try various search variants and ideas out directly. This high level of accessibility makes it easier to analyse language data than it has ever been before. The use of corpora is not a new thing, of course, and scholars such Samuel Johnson and Otto Jespersen had enormous collections of text material. It is just that computers are very good at handling large collections of material.
Systematic use of corpus data is a significant undertaking, however, and here we mention a number of issues that may be helpful before you embark on any major study based on the Cobuild material.
This section contains some relatively recent references which may be useful for people who want to know more about the use of corpora and corpus linguistics.
Barnbrook, G. (1996). Language and Computers. Edinburgh: Edinburgh University Press.
Biber, D. (1990). Methodological Issues: Regarding Corpus-based Analyses of Linguistic Variation. Literary and Linguistic Computing, 5:4, 257-258.
Clear, J. (1992). Corpus sampling. In Leitner, G (ed.), New Directions in English Language Corpora. On corpus design, sampling procedures and data reliability. Berlin: Mouton de Gruyter, 21-31.
Fillmore, C.J. (1992). 'Corpus linguistics' or 'Computer-aided armchair linguistics'. In Svartvik, J (ed.), Directions of Corpus Linguistics. Proceedings of Nobel Symposium 82 Stockholm, 4-8 August 1991. Berlin: Mouton de Gruyter, 35-60.
McEnery, T. and A Wilson (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press.
Oakes, M P. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Stubbs, M. (1996). Text and Corpus Analysis. Oxford: Blackwell.
b) Web sites
Michael Barlow's excellent corpus linguistics site: http://www.ruf.rice.edu/~barlow/corpus.html