In Linguistics, a corpus (plural: corpora) can be defined as a collection of authentic language samples, i.e., texts that were usually not produced with the intention of one day being analyzed by a linguist. Yet it is this authenticity that linguists are usually interested in. As language is used in all sorts of contexts and across different media, the kinds of texts in corpora can vary, too. For example, a corpus can contain written texts from newspapers and novels but also spoken texts from phone calls, sales talks, or political speeches that are usually converted into a transcript. Depending on the research focus, linguists sometimes enrich corpora with further information. These annotations can be divided into two categories:

1 Linguistic features: information about the word class, part of speech, headword (= lemma or uninflected ‘base’ form) of each word, for example:

word	part of speech	headword	class
loving	-ing form of a lexical verb	love	verb

2) Paralinguistic features: information about the speaker or author, the genre, the time of origin and a short title or ID of the specific text.

genre	author_sex	date	id
novel	female	2000	TBA

EXERCISE

Why could information about word class, part of speech, or headword be valuable?

SOLUTION

Every additional feature that is annotated in a corpus enables new possibilities of examining language. Just imagine if information about part of speech and class were missing in your corpus: to get a list of all verbs, you would have to enter every single verb and you would run the risk of missing some more. With an annotated class feature, you can easily search for all verbs with one short query. Paralinguistic features are helpful if you want to compare language samples from particular eras or genres.

1.2 How can I access corpora?

In this manual, we will only explain how to get access to the corpora the FU provides. In case you are interested in other corpus websites, click here (https://www.english-corpora.org/). We recommend using a PC or Mac for working with corpora, but technically you can use a tablet or even a smartphone. Note, however, that smaller keyboards are less practical and special keys on external (tablet) keyboards sometimes do not work properly.

Given that you are enrolled as an FU student, you have two options to access the university’s corpus system. You can either access the FU corpora on this website (https://www.zedat.fu-berlin.de/Shell) or via a program that is usually installed on your operating system: it is called Terminal in macOS and Linux systems and PowerShell on Windows.^[1]

EXERCISE

Depending on the choice you make, you must follow different steps in the beginning to set up the INLET (environment for empirical linguistic research). This is necessary to allow access to the different corpora under your account. As nothing will be installed or downloaded on your device, there is no need to worry about memory space.

A. Sign in

ZEDAT shell (web browser)

Terminal, PowerShell

1. Go to https://www.zedat.fu-berlin.de/Shell

2. Enter your ZEDAT username (without @fu-berlin.de) and hit RETURN.

3. Enter your password and hit RETURN.

1. Open one of the programs on your PC or Mac.

2. Type the following, replacing USERNAME with your ZEDAT username:

ssh USERNAME@login.fu-berlin.de

Hit RETURN

3. If it is your first login, you will be asked whether you trust the ZEDAT server. Type yes and hit RETURN

4. Enter your password and hit RETURN.

B. Setup

5. After signing in, you will see this prompt: USERNAME@login:~$

6. To run the setup, copy the following code to your command line and hit the RETURN key:

sh /home/s/structeng/cqp.sh

7. Sign out of your ZEDAT account by typing exit

8. To test whether you have access, sign in again (see section A.) and type cqp. The following prompt should appear: [no corpus]>

C. Adding corpora

9. To add some English language corpora, sign in again, type add-fjord.sh and hit RETURN.

10. To see a list of the corpora you have just installed, open the Corpus Query Processor (cqp) and type show corpora. By typing the name of the corpus (e.g., BNC) and hitting RETURN, the prompt switches to BNC>

Whenever you want to switch to a different corpus, type the name of the corpus and hit RETURN. The prompt will automatically change.

TROUBLESHOOTING

If you have trouble setting up the INLET corpus system, make sure that you copied the code and all blank spaces correctly. You can also try another device or browser. If it still does not work, ask your tutor or instructor for help.

For more options to access the FU Linux server, visit this website: (http://userpage.fu-berlin.de/~structeng/wiki/doku.php?id=resources:fu-linux-server). To add more corpora to your account, click here: http://userpage.fu-berlin.de/~structeng/wiki/doku.php?id=inlet:setup.

Simple queries

Note: If you have not set up your INLET corpus system yet, please read chapter 1.2 before you continue.

2.1 Searching for a single word

You do so by typing [word="xxx"] and press ENTER. Instead of xxx, type the word you are looking for. If you have problems finding the keyboard shortcuts for the square brackets or double quotes, see the appendix.

Note that this is a very short query. You may want to prepare longer queries in a more convenient text editor and then paste them into Shell. A list of text editors for different operating systems can be found in the appendix as well.

EXERCISE

Search for the word fabulous in the BNC.

SOLUTION

[word="fabulous"]

Your query should return 609 matches. You probably notice that CQP provides more information than just the number of hits: at the top, you see a mask separated by hyphens and hashtags from the results. This mask contains valuable information about the corpus you are using, the name of your query (which is, by default, Last), the number of matches, the number of characters that appear around your word (which is, by default, 30), and the query itself.

2.2 Navigating through the results

Below the mask you see the concordance, which is a list of results. They are represented in the KWIC format, which means ‘key word in context’. Use the arrow keys (↑ and ↓) to navigate through the results line by line or w and SPACE to jump from page to page. To leave the list of results and start a new query, hit q. You can either type a new query or use the arrow keys to retrieve a previous query.

2.3 Searching for different attributes

Let’s have a look at the previous exercise: the type of linguistic information, which we call an attribute in CQP, is indicated on the left side of the equals sign. As we were looking for a particular word, we used the attribute word. On the right side of the equals sign we inserted a value for our attribute, in this case "fabulous". You could also insert other values for this attribute like "apple" "house" or “antiauthoritarianism”.

To search for these different types of linguistic information, we need to change the attribute. With hw we can find all forms of a particular lemma. With the attribute pos, we search the corpus for parts of speech (singular nouns, auxiliary verbs etc.), and with class we search for words that belong to a certain word class (nouns, verbs etc.). For the value, we insert a particular lemma, part of speech, or word class. Note that the different parts of speech and word classes have their own abbreviations. You can find a full list in the appendix.

[word="love"]
[hw="love"]
[pos="NN1"]
[class="SUBST"]

EXERCISE

Have a look at the four different queries (1-4) and guess what kind of results they would return.

SOLUTION

- [word="love"] All instances of the string love, i.e. the combination of the lowercase letters <l, o, v, e>. This query returns the uninflected verb love as well as the singular noun love, but no inflected forms.

- [hw="love"] All inflected forms of the verb LOVE (love, loves, loving, loved, lovest) and the noun LOVE (love, loves). Note that lover and lovely will not appear, as they belong to different lemmas (LOVER and LOVELY).

- [pos="NN1"] All singular nouns in the corpus. (The code NN1 stands for a non-pronominal common noun in the singular.)

- [class="SUBST"] All nouns in the corpus, i.e. proper names as well as pronouns and common nouns in the singular or plural.

ADVICE

If you want to see the annotation of your corpus, you can have tags displayed with the show command. Typing show +pos, for instance, will switch on pos tags in your concordances. To display class tags instead, type show -pos +class.

2.4 Saving queries

By default, your latest query is always saved in a variable named Last. When you enter a new query, this variable is overwritten. If you want to save your query from being overwritten, you must rename it by using this simple command:

fabulous = Last

You can of course use any name instead of fabulous as long it is not a CQP command. Use cat fabulous to open the query again. Type show named to receive a list of all variables you have saved. If you want to learn more about saving and combining queries, click on this link:

http://userpage.fu-berlin.de/~structeng/wiki/doku.php?id=cqp:concordances#saving_a_concordance

Complex queries

3.1 Alternative values and attributes

We can combine several queries into one and search for multiple words at once with a pipe |, also called logical OR: [hw="(girl|boy)"]

This query will find all occurrences of the headwords GIRL and BOY regardless of capitalization or word form. As you know, you can look for the same word form with the attribute word. The addition %c will further ignore capitalization.

[word="(girl|boy)" %c]

This query will yield the same word forms, but case-insensitive, e.g. "Girl", "BOY" will be matched as well.

You can find |:

- on a (German) QWERTZ Windows keyboard: press ALTGR + <

- on a QWERTZ Mac keyboard: OPTION + 7

We can specify multiple alternative hw, class and/or pos attributes. For example, if we want to find all adjectives in either the comparative (pos="AJC") or the superlative form (pos="AJS"), we can specify both, connected by logical OR (the pipe |):

[pos="(AJC|AJS)"]

3.2 Combining values and attributes

We can also combine multiple attributes and values to make our results meet certain requirements. For example, if we want to look for the word form "love", but only as a verb, we can specify those two conditions in a single query with an ampersand &:

[word="love" & class="VERB"]

This query will find all instances of the verb form love, but not the noun love.

3.3 Multi-word queries

To search for complex patterns, use one set of square brackets per word slot:

[word="my"] [word="favourite"] [word="things"]

In each set of square brackets, you can specify whichever attributes you are interested in:

[class="PRON"] [word="favourite"] [pos="NN0"]

You can also leave a slot completely empty:

[word="my"] [word="favourite"] []

Note that an empty set of square brackets will also match punctuation marks:

The question mark can be used to specify optional words and curly brackets can be used to specify how many:

[class="PRON"] []? [pos="NN0"]

EXERCISE

Formulate a query that matches all instances of any adjective followed by the word "snow" OR "love" (but only as nouns), ignoring case, in the BNC.

SOLUTION

[class="ADJ"] [word="(snow|love)" %c & class="SUBST"]

We have to formulate a multiple-word query that will, first, search for the class "adjective" and, second, the words "snow" or "love" as nouns. The second part of the query must name these conditions, the words and the class. As we are ignoring case for words, we add %c.

Alternatively, you could opt for [hw="(snow|love)" & class="SUBST"], because the tag hw will automatically ignore case. That way, you’ll also match instances in the plural.

3.4 PrintStructures: display text ID, genre etc.

The texts in the BNC have different attributes, called metadata, for instance text_id, text_title, text_genre, text_domain. You can find these listed in the info file of a corpus. For that, enter a corpus and type info.

To display these attributes with every concordance, we can use the command set PrintStructures. If you’re interested to see which text type or genre your concordance lines are from, type:

set PrintStructures "text_genre".^[2]

Once you query the corpus, you can see the PrintStructure in every concordance from now on. You can look up several attributes at a time if you list them in the command, separated by commas:

set PrintStructures ‘text_id, text_title, text_genre’

It is important that you list them all in one command rather than execute several commands in a row because every command overwrites the previous one. To turn off PrintStructures, use the command without any arguments:

set PrintStructures ""

3.5 Match: search specific texts (subcorpora)

You can restrict your search to specific texts or other metadata like text mode (written or spoken), the gender of the author or speaker and so on with the match command. Each of these categories comprises a so-called subcorpus.

To examine language in fictional texts, for example, we could look at the subcorpora for drama, poetry and prose. These are all values of the category text_genre.

You can find every available category and its values listed in the info file.

We can look for the lemma flower like this:

[hw="flower"] :: match.text_genre="W:fict:drama"

Or, keeping in mind that W:fict:drama, W:fict:poetry and W:fict:prose are the only categories with fict in the name: [hw="flower"] :: match.text_genre="W:fict.+"

Like before, we can combine several conditions with an ampersand &. So we can look for all instances of flower in the subcorpus that were also written by a woman like this:

[hw="flower"] :: match.text_genre="W:fict.+"
& match.text_author_sex="female"

3.6 Group: count by metadata

We can count how often a particular query has returned a hit in a particular category with the group command:

group Last match text_genre

The result is a list showing all genres in which the query has found matches, ordered by descending number of matches in each genre.

EXERCISE

Formulate a query that matches all occurrences of "hence" in essays published between 1985-1993.

SOLUTION

In the BNC’s info file, you’ll find W:essay:school and W:essay:univ in the category text_genre. We can match both like this "W:essay.*" or "W:essay:(school|univ)". In the category text_publication_date, you’ll also find the time period 1985-1993.

[hw="hence"] :: match.text_genre="W:essay.*"
& match.text_publication_date="1985-1993"

3.7 Counting concordances by position

We already know that matches are stored in the variable Last. You can count your (last) matches by different attributes like hw, class or pos:

count Last by word

You can add %c to ignore case in your results. Once you enter the command, you will be returned a frequency list.

By default, this command will return a frequency list on the position match[0]: It is the first word that matches your query. To return a list of all words occurring directly after that first word, we move to the right by one position.

count Last by word on match[1]

To count the occurrences by class directly before type:

count Last by class on match[-1]

There’s also a value called matchend[0], which matches the last word of your query. Thus, you can navigate the query as shown before: e.g. matchend[1] or matchend[-1] and so on.

Repetition Operators

4.1 Match any symbol with the wildcard: .

The period . can match any character.

If you try the query [word="."], you’ll see that you’ll match any character enclosed by spaces.

The following regular expressions make this wildcard especially useful.

4.2 Repeating one or more times: +

You now already know ".", which can match every letter or symbol, "?", which makes the preceding expression optional, and "|", which enables our query to alternate between two values. With "+" we can repeat a letter or symbol one or more times.

To look at the exclamation "oh", knowing that it may well be spelled "ohh" or "ohhh" based on the intensity of the expressed sentiment, we can formulate a query that catches all of its spellings as follows:

[hw="oh+"]

All concordances will be spelled with at least one "h".

4.3 Repeating zero or more times: *

With "*" we can repeat a letter or symbol zero or more times. For example, to match all words that start with un- and end with -able, including unable, we can formulate a query that contains both affixes and between them a wildcard to match any letter, and the asterisk * to repeat that wildcard zero or any number of times, because the word could have any length.

[hw="un.*able"]

This matches words like uncomfortable, unreasonable and unable.^[3]

4.4 Alternating between single symbols: [ ]

We can also alternate between single letters or groups of letters within one query.

If you want to look for the words into and onto, you can use square brackets [] to list letters that we want to alternate between, in this case <i> and <o>.^[4]

[hw="[io]nto"]

4.5 Alternating between strings within a string: ( )

Let’s alternate between groups of letters and search for the lemmas uptown and downtown in a single query. For that, we list every letter group inside of parentheses and separate them with a pipe |. Do it like so:

[hw="(up|down)town"]

4.6 Min and max number of matches: {}

Say you want to match words but one slot may be taken by a variable number of matches.

[word="it"][hw="be"][]{1, 3}[word="to"][class="ADJ"]

In this example, the braces dictate that any match in the preceding [] slot may occur at least once and three times at most. So, the first number sets the upper limit and the second the lower one. We could match phrases like it was logical to assume or it is specifically designed to assist.

You can also specify a number:

[class="ADJ"]{3}[hw="thing"]

Or set an upper limit without a lower one:

[word="to"][]{,3}[class="VERB"]

Or a lower one without an upper one:

[word="to"][]{1, }[class="VERB"]

4.7 Escaping special characters

You can escape all of the special characters from earlier, i.e. use them like normal symbols, with a backslash \.

[word="\?"]

This will match all question marks in the corpus.

EXERCISE

Formulate a query that matches snowed and snowing and a second query that matches snow and slow with two of the methods shown above. Ignore case for both queries.

SOLUTION

[word="snow(ed|ing)" %c]

[word="s[nl]ow" %c]

With the addition %c the query will ignore case. As we are looking for specific word forms and not the lemma, we’ll use the attribute word. With the pipe between strings listed within parentheses, we suggest alternate strings. Within square brackets we list alternate letters or symbols.

Beyond querying

5.1 Sorting and sampling

By default, CQP orders concordances by text ID. The sort command can display concordance lists in a different order, for example by class, POS or lemma.

To order them by class, for instance, type:

sort Last by class

To reset the order to the default: sort Last

5.2 Randomizing concordances

To get a more representative impression of our concordances and their context, it is wise to put them in random order. Randomize your results stored in "Last" like this:

sort Last randomize

5.3 Reducing concordances

To extract a sample of your concordance list, use the reduce command.^[5]

You can reduce your concordances to a specified number: reduce Last to 200

Or to a percentage: reduce Last to 20%

ALWAYS randomize before you reduce! Otherwise your sample will contain the first concordance lines in corpus order, which may mean that they are all from the same text(s).

5.4 Exporting concordances

Your results can be saved in a file. Concordances can be saved by redirecting the output of the cat command to a ready-made perl script that ‘cleans up’ i.e. formats the data, and redirecting that output, in turn, into a text file, whose name you specify.

In short, it is done like this:

cat Last > "| tidycwb.pl > filename.txt"

Instead of a TXT-file, you can also choose a comma-separated format, i.e. CSV (e.g. filename.csv).

5.5 Exporting frequency lists

To save frequency lists in a file, you need the whole command that generates the list and redirect its output into the cleaning script and that output into a file, whose name you specify.

count Last by word > "| tidycwb.pl > list.txt"

You can access the files in the file repository "Datenablage" of your Zedat account. You can find it here: https://www.zedat.fu-berlin.de/Home.^[6]

5.6 Spreadsheet software

To analyze and annotate data manually, you can use spreadsheet software like Excel or LibreOffice. The latter program is easier to handle, as it produces less encoding errors when you import data. Open a new Calc Spreadsheet and simply drag and drop the txt-file. A new dialogue window pops up and asks you to select a separator. If the columns in your txt-file are separated by commas, tick the comma box and click OK.

EXERCISE

Search for the lemma "slow" as a verb in the BNC-BABY. Save the concordance in a variable, randomize them (with the variable) and extract a sample of 50 concordances. Count all words directly following your match in the concordances, ignoring case, and export that frequency list into a CSV-file.

SOLUTION

[hw="slow" & class="VERB"]

slow_concordances = Last

sort slow_concordances randomize

reduce slow_concordances to 50

count Last by word %c on match[1] > "| tidycwb.pl > slow_frequencies.csv"

For more details on sorting and sampling, see this page on the wiki: http://userpage.fu-berlin.de/~structeng/wiki/doku.php?id=cqp:sorting-sampling

Statistical testing

6.1 A short introduction

To determine if your corpus findings have statistical value, that is whether they differ statistically from what you would expect, a statistical test needs to be conducted. As Anatol Stefanowitsch explains in his book Corpus Linguistics: A guide to the methodology, scientific hypotheses are usually formulated in such a way that they can only be falsified, not verified. If we should fail to falsify an assumption then, we can take that to mean that it may be true. In order to do that, we formulate a so-called null hypothesis, the H0 hypothesis, which states the opposite of our initial one and is the one we try to disprove. This approach is called null hypothesis significance testing.

Worry not, with a practical example, this will become clearer: if we look at how the occurrences of the adjectives "little" and "small" are distributed across the words "girl" and "boy" in the BNC, we get this two-by-two contingency table. Based on the data, we could formulate the regular hypothesis (H1) that the adjectives are not randomly distributed, i.e. that the choice of adjectives is influenced by the noun ("girl" or "boy") that they modify. But in order to examine our H1 in accordance with the approach stated above, we also formulate an H0 which posits that the data is in fact randomly distributed, that there is no effect between the choice of adjective and noun. This H0 hypothesis is the one we now try to falsify. We do that with a statistical test.^[7]

	small	little	Row Total
girl	81	1151	1232
boy	336	791	1127
Column Total	417	1942	2359

6.2 The chi-square test

If no cell in this table has a value of 0 and if no more than a quarter of the data have a value of below 5, the Chi-Square Test can be used to determine the p-value, the probability of the data if our H0 hypothesis is true. Should the p-value fall below 0.05, meaning that the probability of our data is less than 5% and thereby very unlikely, we could say that our H0 hypothesis could not be convincingly falsified (in less words: our initial H1 hypothesis might be correct). A value of p<0.05 is customarily used to attribute statistical significance.

Above are the observed frequencies, now we need to calculate the expected frequencies. This is done by multiplying row total and column totals and dividing that number by the table total for every cell:

	Column 1	Column 2	Total
Row A	1AB*A12/Table Total	2AB*A12/Table Total	A12
Row B	1AB*B12/Table Total	2AB*B12/Table Total	B12
Total	1AB	2AB	Table Total

This is what the data would look like if the distribution was perfectly random:

	small	little
girl	217,78	1014,22
boy	199,22	927,78

As a next step, the differences between the observed and expected frequencies are calculated by subtracting the expected frequencies from the observed ones, squaring that sum, and dividing it by the expected frequencies.

(observed – expected)2 / expected

	small	little
girl	85.91	18.45
boy	94.01	20.16

Now all these numbers, the cell components of Chi-Square, need to be added, which gives us a Chi-Square value of 218.53.

Before checking the associated p-value, the size of the table needs to be considered. To do that, we calculate the so-called degrees of freedom like so:

df = (Nrows – 1 ) x (Ncolumns – 1)

For our table, this comes to 1.

Looking into the row of the corresponding df value (table below), i.e. row 1, we look for a number that comes closest to our calculated Chi-Square value. As our Chi-Square value is larger than 10.83, the corresponding p-value is lower than 0.001. Therefore, the deviation of our data from what would be expected is highly statistically significant.

Significance levels (probabilities of error)

	marginally significant	significant		very significant		highly significant
degrees of freedom	0.100	0.050	0.025	0.010	0.005	0.001
1	2.706	3.841	5.024	6.635	7.879	10.828
2	4.605	5.991	7.378	9.210	10.597	13.816
3	6.251	7.815	9.348	11.345	12.838	16.266
4	7.779	9.488	11.143	13.277	14.860	18.467
5	9.236	11.070	12.833	15.086	16.750	20.515
6	10.645	12.592	14.449	16.812	18.548	22.458
7	12.017	14.067	16.013	18.475	20.278	24.322
8	13.362	15.507	17.535	20.090	21.955	26.124
9	14.684	16.919	19.023	21.666	23.589	27.877
10	15.987	18.307	20.483	23.209	25.188	29.588

Appendix

Text editors

You can use these programs to write your queries before pasting them in CQP. In contrast to programs like Word, they won’t autocorrect any characters.

Online	Windows	Mac
vim Open the ZEDAT shell, log on, and type vim. The text editor opens automatically.	vim Notepad++ (preinstalled)	BBEdit Skripteditor (preinstalled)

Special characters

sign	name	Windows (German keyboard)	Mac (German keyboard)
(	opening round bracket	Shift + 8	Shift + 8
)	closing round bracket	Shift + 9	Shift + 9
[	opening square bracket	Alt Gr + 8	Alt + 5
]	closing square bracket	Alt Gr + 9	Alt + 6
{	opening curly bracket	Alt Gr + 7	Alt + 8
}	closing curly bracket	Alt Gr + 0	Alt + 9
<	opening angular bracket	<	<
>	closing angular bracket	Shift + >	Shift + >
\|	pipe	Alt Gr + <	Alt + 7
*	asterisk	Shift + +	Shift + +
:	colon	Shift + .	Shift + .
_	underscore	Shift + -	Shift + -
"	double quote	Shift + 2	Shift + 2
'	single quote	Shift + #	Shift + #
&	ampersand	Shift + 6	Shift + 6
%	percent sign	Shift + 5	Shift + 5
/	slash	Shift + 7	Shift + 7
\	backslash	Alt Gr + ?	Alt + Shift + 7

[1] Usually, the Terminal on Linux systems runs the shell Bash, which is the shell that is also used on the university server. Your MacOS terminal can be set to run Bash as well.

[2] You can either use double quotes "" or single quotes ‘’, but be consistent within a single query.

[3] If we used + instead of * (e.g. [hw="un.+able"] ), we would not match unable.

[4] This query will match into and onto. You can try [hw="[iou]nto"] to include unto as well.

[5] Note that in any of these examples, you can use another variable, in which you have stored your search, in place of "Last".

[6] You can also exit CQP, which will put you in the home folder of your ZEDAT account, where you can list all your files with the command ls and display each one with cat.

[7] For more detailed information on the use of statistics and the Chi-Square Test specifically, refer to "Corpus Linguistics: A guide to the methodology" by Anatol Stefanowitsch 2020, p. 166-170 (statistics), 177-183 (the chi-square test), 447 (chi-square values table). The PDF is available for free.