language-icon Old Web
English
Sign In

Word lists by frequency

Word lists by frequency are lists of a language's words grouped by frequency of occurrence within some given text corpus, either by levels or as a ranked list, serving the purpose of vocabulary acquisition. A word list by frequency 'provides a rational basis for making sure that learners get the best return for their vocabulary learning effort' (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of 'word'. While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field. Word lists by frequency are lists of a language's words grouped by frequency of occurrence within some given text corpus, either by levels or as a ranked list, serving the purpose of vocabulary acquisition. A word list by frequency 'provides a rational basis for making sure that learners get the best return for their vocabulary learning effort' (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of 'word'. While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field. In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list. Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists: Most of currently available studies are based on written text corpus, more easily available and easy to process. However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of traditional textual analysis approach, and support a move from written corpus toward oral corpuses analysis and thanks to open film subtitles available online. This has recently been followed by a handful of follow-up studies, providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al.), Vietnamese (Pham, Bolger & Baayen 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013) and Polish (Mandera et al. 2014). SUBTLEX-IT (2015) provides raw data only. In any case, the basic 'word' unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English 'can't', French 'aujourd'hui', or idioms. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word. It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics. German linguists define the Häufigkeitsklasse (frequency class) N {displaystyle N} of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16. where ⌊ … ⌋ {displaystyle lfloor ldots floor } is the floor function.

[ "Speech recognition", "Linguistics", "Artificial intelligence", "Natural language processing", "Word (computer architecture)", "Hybrid word", "Missing letter effect", "Line wrap and word wrap", "Buckeye Corpus", "Heaps' law" ]
Parent Topic
Child Topic
    No Parent Topic