Interactive Corpus Access Tool: Reference

Introduction

This document is a reference for the WordbanksOnline interactive corpus access tool. The program itself is called "lookup". If as you read it you find terms which are unfamiliar to you, you may find them in the glossary at the end. If you are new to WordbanksOnline, you will probably find it useful to begin by working through the Tutorial, which accompanies this document.

The concordance program for WordbanksOnline provides access to a subset of the Bank of English, by retrieving concordance lines for one or more words. You can specify combinations of words and/or parts of speech. The concordance lines are displayed on screen. They can then be sorted or expanded, or further selections can be made. The program can also give you statistical and collocational information. Output can be written to a file, and downloaded to your own computer, using ftp.

When accessing the program using WordbanksOnline, problems with the screen display will be avoided if you use a fixed width font, and if your screen is 24 lines long and 80 characters wide.

When you telnet Cobuild, using your WordbanksOnline user name and password, you can run the interactive corpus access tool by selecting option 1 of the WordbanksOnline menu. If you are not a subscriber, you can run the demonstration version by entering wbdemo as both user name and password.

The first screen invites you to select which corpora you would like to work with during the session, with the Corpora ('q' to quit): prompt. If you want to use all the corpora, just hit RETURN.

If you want to select only certain corpora, type their names at the prompt, separated by spaces. The names of the corpora are given in the first column of the list displayed above the prompt. For instance, to look at just the Times corpus, type times. To look at the Times corpus and the NPR corpus, but not the general corpus, type times npr.

The next screen shows the prompt Query (or RETURN to exit):, and gives some examples of responses.

Examples of Responses to `Query (or RETURN to exit): '

What you type at this prompt specifies what concordance lines will be displayed. The concordance lines will be the examples from the corpus of this particular word, phrase or pattern.

We have developed a simple query language which defines what can be typed at this point. This section gives examples of queries you can type. The section after this one is a more general description.

yew
matches the word "yew"
bill
matches "Bill" or "bill" (or "BILL" or "bILl" ...)
oriented|orientated
"oriented" or "orientated"
that+man|woman
"that man" or "that woman"
glanc*
any word beginning with "glanc"
can@
matches all inflections of "can" ("can", "canned", "canning", "cans" and "could")
bear@+witness
Inflections of "bear", immediately followed by "witness"
bridl*/VERB
all words beginning "bridl", acting as a verb
sling*|slung/VERB
the tag VERB only applies to "slung"
(sling*|slung)/VERB
the tag VERB applies to all the forms
let+1,1lie
require exactly one word between "let" and "lie"
let+4lie
allow 0 to 4 words between "let" and "lie"
let+1,4lie
allow 1 to 4 words between "let" and "lie"
wits+about+PPO
"wits about" followed by an object pronoun
kill+PPL
"kill" followed by a reflexive pronoun
minister/!NOUN
"minister", excluding all noun uses
bridge/(NOUN|VERB)
"bridge" as a noun or a verb
(bridg*)/(NOUN|VERB)
nouns or verbs starting "bridg"
mr+smith
matches "Mr Smith"
the+\5th
matches "the 5th"
the+\50th
matches "the 50th"
isn+t
matches "isn't" (see the next section).
fred+s
matches "Fred's"

Description of Query Language

A query can be thought of as a series of slots, each of which represents a word or choice of words. In each slot you give a paradigmatic specification. Slots can then be built up to specify a syntagmatic structure. At time of writing, the maximum number of alternatives for a single slot is 1000.

All words must be given as lower case, but will match both lower and upper case occurrences. If you are only interested in one or the other, you can use the g command once the lines are displayed to narrow down the selection (see the section `One Character Commands'). Words can include digits.

The following characters have a special meaning:

|
or. Used to separate alternative words or alternative tags.
/
logical and. Read as "as" or "as a". Used between a word and a tag, e.g. "jam/VB" reads "jam as a verb".
!
not. The "!" character negates the tag which immediately follows it. It can only be used with tags - so "jam/!VB" is legal ("jam" as anything other than a verb), but "jam+!cream" ("jam" followed by anything other than "cream") is not legal. There is no way to express that query at present. The "!" character can also only be used with tags which are attached to a word. So "jam+!NN" is not legal.
+
separates slots. You can string together as many slots as you are likely to want.
*
A wildcard matching any number of characters. "*" can only be used at the end of a word.
@
A word form with "@" appended matches all forms of the lemma.
( and )
used to override default parenthesis.

"/" always takes precedence over "|", ie binds more tightly. For example, the query:

hat|glove/NOUN|VERB

Is equivalent to:

hat|(glove/NOUN)|VERB

(though this is not a legal query, as lexical items and tags cannot be or'd).

The following are the other components you can use to form a query:

gap
A "gap" comes between the "+" operator and the following slot. It indicates how many words can intervene between the slot specified before the plus and the slot specified after "gap". "gap" has the form "m,n". "m" is the minimum number of words which can intervene, and "n" is the maximum number of words which can intervene.

A "gap" can also be expressed as a single number only. The gap "m" is equivalent to "0,m".
word
a word, made up of lower case letters and/or digits. Even words which always contain capitals should be specified in lower case.

If the word begins with a digit, the digit should be preceded by a backslash ("\"), so that the digit is not interpreted as a gap specification. Digits which are not word-initial should not be preceded by a backslash.
Note that no characters apart from letters and digits can form part of a word - not even apostrophe or hyphen. So if you wanted to look at "don't", the query would be "don+t" (2 slots). If you wanted to look at "point-blank", the query would be "point+blank". This may seem perverse, but both of these characters frequently occur outside words, and so declaring them to be word characters would also have produced anomalies. The current system, though at first glance non-intuitive, is, we hope, straightforward.
prefix*
a prefix, made up of lower case letters and/or digits, followed by an asterisk to indicate that you want to allow all words beginning with that prefix.
wordform@
a wordform, followed by the "@" character. This will match all forms of the lemma to which the word form belongs.
TAG
any word class tag from the tag list (as described in the section entitled "The Tag List").

Here are some possible forms of single slots. They form legal queries in themselves, or can be strung together with +, followed by an optional gap specification.

Note that you can't do:

The Tag List

NN
common singular noun
IN
preposition (in, up)
DT
determiner
NP
proper noun
JJ
adjective
NNS
common plural noun
CC
co-ordinating conjunction (and, or)
RB
adverb
VB
verb base form
VBN
verb past participle form
VBD
verb past tense form
CS
subordinating conjunction (unless, although)
PPS
personal pronoun subject case: (I, she)
VBG
verb -ING form
PPP
possessive pronoun: (mine, yours, hers)
CD
number
TO
'to' infinitive marker
MD
modal verb
PPO
personal pronoun object case: me, her)
BEZ
verb 'to be' 3rd pers, pres sing: is
BEDZ
verb 'to be' past tense: was
DEM
demonstrative pronoun: (this, that)
VBZ
verb 3rd pers pres sing
BE
verb 'to be' base form: be
WH
WH- word
HVD
verb 'to have' past tense: had
BER
verb 'to be' 3rd pers pres plural: are
NEG
negation particle: not
HV
verb 'to have' base form
BED
verb 'to be' past tense: were
PN
general non-personal pronoun (anyone, everything, none)
HVZ
verb 'to have' 3rd person pres sing: has
BEN
verb 'to be' past participle: been
DTG
determiner/pronoun: (these those, both, either)
EX
existential 'there'
DO
verb 'to do' base form
DOD
verb 'to do' past tense: did
PPL
reflexive pronoun singular herself, myself
BEG
verb: 'to be' ING form: being
UH
formulaic interactive expression: yes ugh um
DOZ
verb 'to do' 3rd pers pres sing: does
BEM
verb 'to be' 1st pers pres sing: am
DTP
possessive determiner: (my, our)
PPLS
reflexive pronoun plural: themselves, yourselves
HVG
verb 'to have' ING form: having
HVN
verb 'to have' past participle: had

In addition to the above tags, there are some tag macros. If you use these as part of a query, they are expanded as follows:

DET
DT|DTG|DTP (any kind of determiner)
NOUN
NN|NNS|NP (any kind of noun)
VERB
VB|VBD|VBG|VBN|VBZ (any kind of verb)
PRON
PPS|PPP|PPO|DEM|PN|DTG|PPL|PPLS (any kind of pronoun)

Selecting Word Forms

After you have typed in your query, lookup goes through each slot, finding all the words it knows about which match your specification. The matches are displayed on the left of the screen.

To move down the list, hit the j key. To move up, hit k. (On some machines, the up and down arrow keys also work). A screenful of words is displayed at a time - if your request matched more words, the screen scrolls as you go down. If you try and move beyond the end, there's a beep.

To the right of the screen, the corpus frequencies of the word you are currently on will be displayed. (These can all be zero if the word doesn't occur in any of the corpora you are using.) To the left of each word, a lowercase y is displayed, indicating that the word form is selected. If you hit the n key, this y will change to an n. You can hit y to re-select a word if you change your mind. When you're done, hit RETURN and the words matching the next slot will be displayed, and so on.

When you've selected the word forms you want for each slot, the message

Searching ...
is displayed - the program is finding concordance lines which match your query. It then displays the number of lines from each corpus which have been found, then prompts you to enter the number of concordance lines you would like to be retrieved. Enter a number, or RETURN to see all the lines.

The Concordance Line Display

The least common slot in your query is called the node word, or node. If your query is just a single word, this word is the node word, and can also be called the keyword.

The program retrieves examples from the corpus which match your query. The node word is lined up in the middle of the screen. (That's why we recommend you use a fixed width font.) Each line contains one example, and the example extends from the node word in both directions to the limits of the screen. These examples are the concordance lines. Sometimes we also call them "citations", or just "lines".

The display includes codes enclosed in angle brackets (<</kbd> and >). These are the Cobuild text mark-up codes. A full list of the codes is available from Cobuild.

The hash character has a special meaning. So the program can run quickly, concordance lines are retrieved from an integerized version of the corpus. This means that each word is stored in our database as a number. Information about punctuation between words, and capitalisation, is stored separately, and is known as "glue". Only a limited number of glue strings are stored. If a particular glue string is not one of these, it is represented as ` # '. This means the punctuation and/or capitalisation of the text at that point can't be recreated exactly.

One Character Commands

When concordance lines are displayed, there are a number of one character commands which you can execute. The following control cursor movement:

Cursor Movement

j
Go down a line
k
Go up a line
u
Go up a screenful
d
Go down a screenful
t
Return to the top screen of citations
b
Go to the bottom screen of citations
:
Prompts you for a line number and takes you there.
/
(forwardslash). Lets you search for a line containing a particular regular expression. (A regular expression describes a pattern you want to match. At its simplest, it's just a word or words you want to search for. There is a comprehensive description of regular expressions in the online help. (Hit ? when lines are dislayed, then /). You will be prompted for a regular expression. Hitting RETURN at this point searches again for your previous regular expression. You are placed on the first matching line. The search runs from the current line to the end of your citations, and offers the opportunity to wrap back to the first citation. Briefly, the regular expression special characters include the following:
+
"one-or-more-times" operator
?
"zero-or-one-times" operator
\|
alternation operator ...
\( ... \)
... and parentheses for delimiting its scope
\<
the imaginary character just before the start of a word
\>
the imaginary character just after the end of a word

Selecting/Refining Lines

The following commands allow you to alter the set of lines, once you have them, by selecting only those which match your criteria or by deleting unwanted lines:

DELETE
Erase the current citation from the screen
ctrl-x, ctrl-w
Delete a block of lines from the screen. Go to one end of the block, hold down the `Control' key and type x, then go to the other end of the block, hold down the `Control' key and type w. The block will be deleted.
R
Delete adjacent identical lines.
Q
Select lines where the node word is tagged with a particular word class. You are prompted for a tag. All lines which are not tagged this way are then removed. You can enter any tag, or several tags separated by | (or). Note: If you use |, you can't use any of the tags PRON, VERB, NOUN or DET (these are implemented differently, being any of several other tags).
N
Select lines whose corpus ID is matched by a regular expression. You are prompted for the regular expression.
Z
Select lines whose text reference is matched by a regular expression. You are prompted for the regular expression.
g
(mnemonic "grep"). Prompts you for a regular expression (see forwardslash), then reduces lines to those that match. You can do this more than once. Typing RETURN gets you back to your previous set of lines. If you have many thousands of lines, this can take a few minutes.
v
Mnemonic "grep -v". Works just like g, but selects only those lines which DON'T match your regular expression.

Displaying Information About Lines

P
Display the tag of the node word on the left-hand end of each line. P again removes them.
z
Displays the text reference for each line on the left of the screen. z again removes them.
n
Displays the corpus ID for each line on the left of the screen. n again removes them.

Collocation

There are three ways to produce a list of collocates. They use different selection criteria. For more discussion of the three alternatives, see the online help. If you have many thousands of lines, the list may take a few minutes to generate. The three commands are:

c
Gives the top 50 collocates of the node, ordered by t-score. Briefly, t-score is a measure of confidence that the change in frequency of the collocate when the node word is present is statistically significant.
C
Gives the top 50 collocates of the node, ordered by MI (mutual information). Briefly, MI indicates the amount of information that finding one word gives you about the presence of the other, within the corpus.
F
Gives the top 50 collocates of the node, ordered by frequency.

There are three ways to produce lists of positional collocates (called "picture" - see section on picture below, and the online help). The three commands are:

T
Display picture for the current word, showing positional collocates of the node word ordered by t-score.
m
Display picture for the current word, showing positional collocates of the node word, ordered by MI.
p
Display picture for the current word, showing positional collocates of the node word, ordered by frequency.

Other

RETURN
Go back to previous state. If you have made a selection, RETURN will revert to the previous set of lines. If you have come from picture, you will go back to picture. Otherwise you will be returned to the Query: prompt.
?
On-line help. The on-line help shows a brief summary of the one-character commands. You can then get detailed help for any command.
a
Article mode - drops you into the text the current line came from, and lets you page through. (Not available for WordbanksOnline).
f
Put your current lines into a file. You will first be asked to enter the name of the file. You will be able to retrieve the file later by connecting to Cobuild using ftp, again with your WordbanksOnline username and password. If, for instance, you have the lines for the word "fleece" and want to create a file of the same name in a subdirectory of your home directory called "concs", type:
concs/fleece
at the Filename: prompt. If you get the message:
Append/overwrite (a/o):
then a file of that name already exists. If you type a, the lines will be added to the end of the existing file. If you type o, the contents of the old file will be replaced. You are also asked to specify the concordance line length.
s
Re-sort your current lines, by a position defined with respect to the node word (eg the word after the node word, two words before etc). After you hit s, you will be prompted to enter the position.
x
Expand the current line to a five-line context. RETURN returns the screen to its previous state.
%
Displays part-per-million information for your current lines.
D
Displays bibliographic information for the current citation, if available.

One Character Commands (alpha order)

Here is a summary of the commands, in alpha order, for ease of reference:

DELETE
Erase the current citation from the screen
RETURN
Go back
ctrl-x, ctrl-w
Mark and delete a block of lines
%
Display parts-per-million information for current lines
/
(forwardslash). Search for a regular expression
:
Prompts you for a line number and takes you there
?
On-line help
C
Lists top 50 collocates of the node, by MI.
D
Display bibliographic information for current line
F
Lists top 50 collocates of the node, ordered by frequency
N
Select lines according to a regular expression match over their corpus IDs
P
Display the tag of the node word on the left-hand end of each line. P again removes them.
Q
Select lines where the node word is tagged with a particular word class.
R
Delete adjacent identical lines
T
Display picture, using t-score
Z
Select lines according to a regular expression match over their text references
a
Article mode. (Not available for WordbanksOnline.)
b
Go to the bottom screen of citations
c
Lists top 50 collocates of the node, ordered by t-score
d
Go down a screenful
f
Put your current lines into a file
g
(mnemonic "grep"). Select lines matching a regular expression
j
Go down a line
k
Go up a line
m
Display picture, by MI
n
Display the corpus ID of each citation at the left-hand end of each line. n again removes them.
p
Display picture, by frequency
s
Re-sort your current lines
t
Return to the top screen of citations
u
Go up a screenful
v
(mnemonic "grep -v"). Deselect lines matching a regular expression
x
Expand the current line to a five-line context. RETURN returns the screen to its previous state.
z
Displays the text reference for each line on the left of the screen. z again removes them.

Picture

The screen known as "picture" gives you a visual representation of the words which most frequently occur near a particular node. Don't try to read along the lines - the screen really consists of 6 columns, each representing a position with respect to the node word. Each column lists the most significant collocates in that position.

The columns can be ordered by t-score, MI or frequency, depending on how you invoked picture. T uses t-score, m uses MI and p uses frequency.

It's simple to switch between the picture and the citations which are the data underlying it (see below). You can move round the collocates with j, k, l and r.

Here's what you might get if you ask for the word "solace", do p for picture, and then position the cursor on the word "find" (column 3, row 9):

    the        to         the        NODE     in         the        the
    was        and        sought     NODE     of         a          and
    and        the        seek       NODE     and        her        of
    of         a          for        NODE     to         in         in
    is         was        of         NODE     from       that       s
    to         for        great      NODE     for        those      from
    for        of         seeking    NODE     the        his        that
    there      provide    find       NODE     with       pilgrimes  fact
    they       found      found      NODE     at         of         to
    but        i          no         NODE     it         this       you
    that       she        and        NODE     is         they       who
    he         <FCH>      some       NODE     or         i          or
    him        who        finds      NODE     after      it         a
    she        have       little     NODE     if         our        could
    <t>        her        only       NODE     a          these      arms
    could      he         finding    NODE     i          an         company
    had        offered    offer      NODE     was        support    source
    in         his        seeks      NODE     he         not        bottom
    have       comfort    <FCH>      NODE     she        never      <FCH>
    need       no         to         NODE     <LTH>      was        kuwaiti
    i          is         a          NODE     <FCH>      she        lov
    her        or         draw       NODE     that       fulfilment time
    every      been       take       NODE     only       leaving    i
    other      can        took       NODE     did        comfort    her
    offer      desperatel spiritual  NODE     outside    drink      great
    may        find       ye         NODE     yesterday  cried      did
    offered    others     with       NODE     <t>       succor     was
    rather     by         one        NODE     on         no         rather
    might      women      main       NODE     now        be         his
    "find". Tot freq:47613 Freq as coll:10. t-sc:3.129809. MI:6.606445. `?' for help


This tells you that the word most frequently occurring immediately before "solace" is "the". Also, the 15th most frequent word occuring three words to the right of "solace" is "arms" etc. The line at the bottom of the screen (the status line) gives information about the word the cursor is on. From it, we know that the word "find" occurs 47,613 times in the corpus, and that ten of these occurrences immediately precede the word "solace".

You can view the lines for a particular node and collocate pair by positioning your cursor on the collocate and typing x. This takes you into a line-viewing mode where you have all the usual options. RETURN returns you to the picture environment.

When you are in picture, you can alter the collocation criteria, by hitting T, m or p, without having to go back to the lines first.

Typing ? while you are in picture gives you on-line help for picture.

Glossary

Bank of English
The Bank of English is a corpus. It is a collection of samples of modern English language, totalling over 400 million words of books, newspapers, magazines, radio broadcasts, transcribed informal speech, etc. It is held on computer at COBUILD.
Cobuild
Cobuild is a department and imprint of HarperCollins Publishers, specializing in the preparation of reference works for language learners in English. Cobuild is based at the University of Birmingham, where it carries out research into corpus-based lexicography.
citation
The word "citation"is used here synonymously with "concordance line" .
collocate
A collocate of a particular word is another word which tends to occur with that word. In lookup, collocates are considered to be words which occur within four words of the node. For instance, some collocates of "summer" are "spring" and "season".
collocation
Collocation is the phenomenon of collocates. It is also a pseudonym of "collocate".
concordance line
A concordance line is an example of a particular word, phrase or pattern. The example has the node word in the middle, and extends the same distance on either side.
corpus
"corpus" is here used to mean a collection of samples of language, used for analysis of words, meanings, grammar and usage.
frequency
"frequency" is here used to mean the number of times a word, phrase or pattern occurs in a corpus, or the density of its occurrence.
ftp
"ftp" is a command used to move files between computers. See your local system administrator if you want help with ftp.
grep
"grep" is a UNIX command for selecting lines matching a regular expression, on which the lookup commands g and v are based.
keyword
If a single word is being concordanced, it is known as the keyword.
lexicographer
A lexicographer is a writer of dictionaries
lexicography
Lexicography is the writing of dictionaries
lookup
Lookup is the software which is used to provide a corpus search service for WordbanksOnline. "Lookup" is the actual program that starts when you select "Interactive Corpus Access Tool" from the main menu, after you have logged in to our server. "Lookup" is also used by lexicographers at Cobuild. It was written by employees of Cobuild and HarperCollins. Thanks are due to Ken Church of Bell Labs, for his help in its development.
MI
MI is an abbreviation for mutual information. It is a calculation based on information theory. It indicates how strongly two words are related. It measures the likelihood with which the presence of one word indicates the presence of the other.
node word
The node word is also sometimes called the node. It is the least common word in a query. It is the word which is lined up down the middle of the screen.
paradigmatic
The paradigmatic axis is the list of language choices at a particular point in a text. In diagrams, it is usually represented as a vertical axis. This term is big in systemic linguistics. See also syntagmatic.
picture
Picture is a screen format used by lookup, to represent positional collocation. Lists of positional collocates for six positions are shown.
positional collocate
A positional collocate is a word which tends to occur in a specified position with regard to the node word. For instance, some positional collocates of "summer" in the position immediately after "summer" are "holiday" and "months".
query
For the purposes of this document, a query is something you type at the Query: prompt. It is a specification of a word, phrase or pattern for which you would like concordance lines. It conforms to the query language described in this document.
query language
A query language is a grammatical specification of a set of legal queries.
regular expression
A regular expression is a string of characters which conform to the regular expression pattern matching language. (Note, this is different from the query language described in this document). Regular expressions are used extensively by the UNIX operating system. A full description of the regular expression pattern matching language can be found in the on-line help for lookup, under g.
status line
The status line is a line at the bottom of the screen, giving information. In lookup, the concordance line screen and the picture screen have status lines.
syntagmatic
The syntagmatic axis is the time axis as you read, write or hear text. It is usually represented as horizontal. See also paradigmatic.
telnet
"telnet" is a command used to log in to a remote computer. See your system administrator if you need help with telnet.
t-score
t-score is a statistical score. It is a measure of confidence that the change in frequency of the collocate when the node word is present is statistically significant.
email: direct@cobuild.collins.co.uk