SUBTLEXus

WORD FREQUENCY AMERICAN ENGLISH

The importance of word frequency

Word frequency is an important variable in cognitive processing. High-frequency words are perceived and produced faster and more efficiently than low-frequency words. At the same time, they are easier to recall but more difficult to recognize in episodic memory tasks.

The bad quality of Kucera and Francis (1967) and Celex (1993)

To investigate the word frequency effect or to match stimuli on word frequency, psychologists need estimates of how often words occur in a language. In American English the Kucera and Francis (KF) frequencies have become the norm. This is surprising because the KF frequencies are dated (from 1967) and based on a corpus of 1.014 million words only. Several studies have confirmed the bad quality of the Kucera and Francis word frequencies (Burgess & Livesay, 1998; Zevin & Seidenberg, 2002; Balota et al., 2004).

Another word frequency measure regularly used is based on the Celex database (Baayen, Piepenbrock, & van Rijn, 1993). This measure is better than Kucera and Francis, but not optimal either (Balota et al., 2004; Zevin & Seidenberg, 2002).

To assess the quality of a frequency measure, one needs word processing times. These have become available as part of the Elexicon project (http://elexicon.wustl.edu/). Brysbaert & New (Behavior Research Methods, in press) calculated the percentages of variance accounted for by Kucera and Francis, and Celex in the accuracies and reactions times of a lexical decision task.

	Acc_{All words} N=37,059	RT_{All words} N=31,201
Kucera and Francis	19.6	57.7
Celex	25.2	60.6

Improved frequency measures based on American English subtitles (SUBTLEX_US)

Brysbaert & New compiled a new frequency measure on the basis of American subtitles (51 million words in total). There are two measures:

The frequency per million words, called SUBTLEX_WF (Subtitle frequency: word form frequency)
The percentage of films in which a word occurs, called SUBTLEX_CD (Subtitle frequency: contextual diversity; see Adelman, Brown, & Quesada (2006) for the qualities of this measure).

The percentage of variance accounted for by these measures is significantly higher than the variance accounted for by Kucera & Francis, and Celex.

	Acc_{All words} N=37,059	RT_{All words} N=31,201
SUBTL_WF	30.1	62.3
SUBTL_CD	31.3	62.9

For short words, the percentages of variance accounted for are also better than the fit with HAL, Zeno et al., and the word frequencies based on the British National Corpus. In addition, the corpus indicates which words are likely to be used as names (e.g., Mark, Archer, etc.). The frequencies of these words are overestimated, as more variance in RTs is accounted for when the frequencies of these words starting with a lowercase letter are used rather than the total frequencies. Download the full analysis by Brysbaert & New.

Download the new frequency measures

The new frequency measures based in the SUBTLEXUS database can be found here:

Zipped Excel file with 60,384 words that have a frequency higher than 1 (interesting for everyone looking for good word frequencies in American English),
Zipped Excel 2007 file with all 74,286 words in the corpus (interesting for those who need word frequencies in American English and have MS Office 2007)
Zipped Text version with all 74,286 words in the corpus (interesting for those who need word frequencies in American English and do not have MS Office 2007)
Zipped Text file with the raw data on all 282,170 letter strings in the corpus (mainly of interest to those working on frequency measures themselves)

How to read the files?

The Excel files contain the following information:

The word. This starts with a capital when the word more often starts with an uppercase letter than with a lowercase letter.
FREQcount. This is the number of times the word appears in the corpus (i.e., on the total of 51 million words).
CDcount. This is the number of films in which the word appears (i.e., it has a maximum value of 8,388).
FREQlow. This is the number of times the word appears in the corpus starting with a lowercase letter. This allows users to further match their stimuli.
CDlow. This is the number of films in which the word appears starting with a lowercase letter.
SUBTL_WF. This is the word frequency per million words. It is the measure you would preferably use in your manuscripts, because it is a standard measure of word frequency independent of the corpus size. It is given with two digits precision, in order not to lose precision of the frequency counts.
Lg10WF. This value is based on log10(FREQcount+1) and has four digit precision. Because FREQcount is based on 51 million words, the following conversions apply for SUBTLEX_US:

Lg10WF SUBTL_WF

1.00 0.2

2.00 2

3.00 20

4.00 200

5.00 2000
SUBTL_CD indicates in how many percent of the films the word appears. This value has two-digit precision in order not to lose information.
Lg10CD. This value is based on log10(CDcount+1) and has four digit precision. It is the best value to use if you want to match words on word frequency. As CDcount is based on 8388 films, the following conversions apply:

Lg10CD SUBTL_CD

0.95 0.1

1.93 1

2.92 10

3.92 100

Get the SUBTLEX frequencies for your list of words!

At http://subtlexus.lexique.org you enter a list of words and immediately get your SUBTLEX frequencies. This site also allows you to select stimuli within a specific frequency range (e.g. between 1 and 10 per million).

Part-of-Speech information added to the SUBTLEX-US word frequencies

We have now tagged the SUBTLEX-US corpus with the CLAWS tagger, so that we can add Part-of-Speech (PoS) information to the SUBTLEX-US word frequencies. Five new columns have been added to the file:

The dominant (most frequent) PoS of each entry
The frequency of the dominant PoS
The relative frequency of the dominant PoS
All PoS observed for the entry
The frequencies of each PoS

You find more information about the tagging in Brysbaert, New, & Keuleers (Behavior Research Methods, 2012)

Download a Excel version of the SUBTLEX-US word frequency file with PoS information.

Zipf values added to the SUBTLEX-US frequencies

In Van Heuven, Mandera, Keuleers, & Brysbaert (QJEP, 2014) we proposed a new frequency measure, the Zipf scale, which is much easier to understand than the usual frequency measures. Zipf values range from 1 to 7, with the values 1-3 indicating low-frequency words (with frequencies of 1 per million words and lower) and the values 4-7 indicating high-frequency words (with frequencies of 10 per million words and higher). Download a zipped Excel file of SUBTLEX-US with the Zipf values included.

Lg10WF	SUBTL_WF
1.00	0.2
2.00	2
3.00	20
4.00	200
5.00	2000

Lg10CD	SUBTL_CD
0.95	0.1
1.93	1
2.92	10
3.92	100