Jump to content

Wiktionary:Corpora

From Wiktionary, the free dictionary

This page is dedicated to listing collections of texts useful for the work of creating a dictionary. These collections are often known as "corpora" or less commonly "corpora". Many of them feature functions like full-text search, term frequency information and collocation search.

For a more user-friendly introduction to some of the most prominent corpora, as well as other resources like dictionaries, see Wiktionary:Quotations/Resources. Another page, Wiktionary:Searchable external archives also contains information with a more specific focus on those which can solidly provide citations passing Wiktionary's criteria for inclusion.

Note that corpora that contain text in multiple languages but where English text makes up of a significant portion of the corpora are listed in the English table below with their "Dialect" in the listing including the word "Multilingual".

If there are any other resources that you know of which aren't listed here, please do add them or suggest them on the talk page.

Glossary

[edit]

The following is a brief explanation of how various terms are used in describing and categorizing the resources on this page.

  • Access restrictions: Any barriers to accessing the resource's contents, such as registration or paying a subscription. A number of listed resources can be accessed through the Wikipedia Library for free.
  • Active: Whether a corpus is still publicly accessible, either through the internet or another method. See also the synonymous use of strikethroughs described below.
  • Available medium: The format through which the language can be accessed in the resource, such as written text or as a spoken and recorded in a video.
  • Library: Collection of texts gathered with a wide net and without linguistic work particularly in mind. It must be possible to search the contents of these texts.
  • Native Name: The name of the resource in its default language. Doesn't necessarily match the language of the resource's contents. A dash is used instead if the default language of the resource is English.
  • Original Medium: The way the language was originally produced, whether it was spoken, written, etc.
  • Social media: A live website or other online center for mass user communication, or the attempt at a near complete archive of such. If the resource is an archive with a particular focus, then it is considered a library or corpus.
  • Re-use restrictions: Unique restrictions on the distribution of the resource's contents beyond general copyright law, in particular restrictions on commercial use or to academic users only. This restriction is particularly relevant to Wiktionary were all content must be able to be redistributed commercially per Wiktionary's CC BY-SA 4.0 license.
  • Tagged Corpus: Collection of texts gathered within a specific scope with linguistic work at least partly in mind. The contents of the texts are marked by part of speech, meaning, pragmatics, or any other method.
  • Translated Name: A translation of the resource's name from another language if the default name of the resource is not already in English.
  • Text: A continuous use of language published, released, or spoken as a coherent work. This could be a forum post in a thread, a book, an issue of a magazine, or a speech.
  • Untagged Corpus: Collection of texts gathered within a specific scope with linguistic work at least partly in mind. The contents of the texts are not marked by part of speech, meaning, pragmatics, or any other method.

Symbols

[edit]
  • Approx.: "Approximately", used to indicate that a date or quantity was estimated or not exactly known when inputted, but a best guess was given.
  • e: "Exponent", used to represent sizes as part of E notation, a type of scientific notation. "5e3" for example represents 5×103 or a 5 with three 0's after it, 5,000.
  • Esp.: "Especially", used to qualify the most common quality of a resource, even if there are notable exceptions.
  • Dash (—): The dash symbol "—" is usually used in tables for information about a resource that cannot be readily determined or approximated. Sometimes it is also used when a particular column is not applicable to a resource.
  • Question mark (?): The symbol "?" is used in tables for information about a resource that has not yet been determined, but probably could be.
  • Strikethrough: Resources with their names' crossed out with a strikethrough were nonfunctional or otherwise broken at the time of the entry's last update. See also the synonymous column table "Active" described above.

English corpora

[edit]
English corpora table
Name Resource Type Size in words[1] Size in texts[1] Dialect Start year End year Original Medium Available Medium Genre Re-use restrictions Access restrictions Active Date of entry update
News on the Web (NOW) Corpus, Tagged 2e10 3e7 (Various)[2][3] 2010 Present Written, Computer, Internet Written Nonfiction, News None Free registration required Yes 2022/10/31
iWeb: The Intelligent Web-based Corpus Corpus, Tagged 1e10 2e7 (Various)[4][3] 2017 2017 Written, Computer, Internet Written General, esp. Nonfiction None Free registration required Yes 2022/10/31
Global Web-Based English (GloWbE) Corpus, Tagged 2e9 2e5 (Various)[2][3] 2012 2013 Written, Computer, Internet Written General, esp. Nonfiction None Free registration required Yes 2022/10/31
Wikipedia Corpus Corpus, Tagged 2e9 4e6 (Various) 2014 2014 Written, Computer, Internet Written Nonfiction, Encyclopedia None Free registration required Yes 2022/10/30
Coronavirus Corpus[5] Corpus, Tagged 2e9 2e6 (Various)[2][3] 2020 2022 Written, Computer, Internet Written Nonfiction, News, COVID-19 None Free registration required Yes 2024/05/10
Corpus of Contemporary American English (COCA) Corpus, Tagged 1e9 5e5 American 1990 2019 Multimedia Written General, esp. Nonfiction None Free registration required Yes 2023/03/27
Early English Books Online (EEBO) Corpus, Tagged 8e8 3e4 British 1470 (approx.) 1690 (approx.) Written, Books, Print Written General None Free registration required Yes 2022/10/30
Early English Books Online (EEBO) TCP Corpus, Untagged 6e4 British 1475 1700 Written, Books, Print Written General None None Yes 2022/10/31
Early English Books Online (EEBO, V2) Corpus, Untagged 6e8 1e4 British 1470 (approx.) 1690 (approx.) Written, Books, Print Written General None Free registration required Yes 2022/11/02
Filmot Library 5e8 (Various, Multilingual) 2005 (approx.) Present Spoken, General Audio-visual General, esp. Nonfiction None None Yes 2022/10/30
YouGlish Library 1e8 (Various) 2005 (approx.) Present Spoken, Formal[6] Audio-visual Nonfiction None None Yes 2022/10/30
TED Corpus Search Engine (TCSE) Corpus, Tagged 1e7 5e3 (Various) 2007 2023[7] Spoken, Formal, Speeches Audio-visual Nonfiction None None Yes 2022/10/30
Archive-It Collections Library 2e6 (Various) 1996 Present Written, Computer, Internet Written General, esp. Nonfiction None None Yes 2022/10/30
ACL Anthology Reference Corpus (ARC) Corpus, Tagged 6e7 2e4 (Various) 1979 2015 Written, Periodicals, Journals Written Nonfiction, Academic, NLP None None Yes 2022/10/30
COVID-19 Open Research Dataset (CORD-19) Corpus, Tagged 3e9 7e5 (Various) 1922[8] 2020[9] Written, Periodicals, Journals Written Nonfiction, Academic None None Yes 2022/10/30
EcoLexicon English Corpus, Tagged 2e7 2e3 (Various) 1973 2016 Written Written Nonfiction, Academic, Environment None None Yes 2022/10/30
Lipstick Alley Social Media American, African 2000 Present Written, Computer, Social Media, Forum Written General, esp. Nonfiction, Celebrity News None Free registration required[10] Yes 2023/06/23
Corpus of Regional African American Language (CORAAL) Corpus, Untagged 1e6 2e2 American, African 1968 2017 Spoken, Interviews Audio General, Sociolinguistic interviews[11] None None Yes 2022/10/31
Google Trends Trends (Various, Multilingual) 2004 Present Written, Computer, Internet Searches Written General None None Yes 2022/10/31
Google Ngrams Trends 2e7[12][13] 2e12[12][14][13] (Various, Multilingual)[15] 1470[13] Present Multimedia Written General None None Yes 2022/10/31
Google Books Library 4e7 (Various, Multilingual) 1400 (approx.) Present Multimedia Written General None None Yes 2022/10/31
Google Scholar Library 1e8[16] (Various, Multilingual) 1700 (approx.) Present Written, Periodicals, Journals Written Nonfiction, Academic; Law None None Yes 2023/01/19
Corpus of Middle English Prose and Verse Corpus, Untagged 3e2 Middle 1000 1500 Written, Books, Print Written General, esp. Nonfiction None None Yes 2022/10/31
Michigan Corpus of Upper-level Student Papers (MICUSP) Corpus, Untagged 3e6 8e2 (Various, ESL[17]) 2002 2009 Written, College Work Written Nonfiction, Academic Restrictions on commercial use[18] None Yes 2022/12/28
Michigan Corpus of Academic Spoken English (MiCASE)[19][20] Corpus, Untagged 2e6 2e2 American (mostly) 1998 2001 Spoken, Formal, Speeches Audio,[19] Written Nonfiction, Academic Restrictions on commercial use[21] None Yes 2022/10/31
British Academic Spoken English Corpus (BASE) Corpus, Tagged 1e6 2e2 British 1998 2005 Spoken, Formal, Speeches Written Nonfiction, Academic None None Yes 2022/11/02
British Academic Written English Corpus (BAWE) Corpus, Tagged 7e6 3e3 British 2000 2007 Written, College Work Written Nonfiction, Academic None None Yes 2022/11/02
Public Papers of the Presidents of the United States Library 1e2 American 1938 2002 Multimedia Written Nonfiction, Politics None None Yes 2023/06/17
Google Groups Social Media (Various) 1981 2024 Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None Yes 2024/03/20
UsenetArchives.com[22] Social Media 7e8 (Various) 1981[23] Present? Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None Yes 2024/03/20
Narkive Social Media 3e8 (Various) 1990 (approx.) Present Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None Yes 2024/03/20
Europeana Library 2e7 (Various, Multilingual) 0400 (approx.) Present Multimedia Multimedia General None None Yes 2022/10/31
Internet Archive Library 6e7 (Various, Multilingual) Present Multimedia Multimedia General None Free registration required Yes 2022/10/31
Eighteenth Century Collections Online (ECCO) TCP Corpus, Untagged 2e3 (Various, Multilingual) 1701 1800 Written, Books, Print Written General None None Yes 2022/10/31
Old Bailey Corpus (OBC) 2.0 Corpus, Tagged 4e7 1e6 British (various dialects) 1720 1913 Spoken, Formal, Court Proceedings Written Nonfiction, Law, Courts, Criminal None Free registration required Yes 2022/10/31
Old Bailey Proceedings Online Corpus, Untagged 1e8 British (various dialects) 1674 1913 Spoken, Formal, Court Proceedings Written Nonfiction, Law, Courts, Criminal None None Yes 2022/10/31
Royal Society Corpus (RSC) 6.0.1 Open Corpus, Tagged 8e7 2e4 British 1665 1920 Written, Periodicals, Journals, Print Written Nonfiction, Academic None Yes[24] Yes 2025/10/12
Royal Society Corpus (RSC) 6.0.4 Open with Topics Corpus, Tagged 3e8 2e4 British 1665 1920 Written, Periodicals, Journals, Print Written Nonfiction, Academic None Free registration required Yes 2022/10/31
X (formerly Twitter) Social Media 3e12[25] (Various, Multilingual) 2005 Present Written, Computer, Social Media, Twitter Written General, esp. Nonfiction None None Yes 2022/10/31
SocialGrep (Reddit) Corpora[26] Corpus, Untagged 9e7 (Various) 2005 (approx.) Present? Written, Computer, Social Media, Reddit Written General, esp. Nonfiction None None No 2025/02/20
Europarl 7 Sample, English Corpus, Tagged 2e7 8e3 International/ELF[27] 2007 2011 Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None Yes 2022/11/01
Europarl 3, English Corpus, Tagged 2e7 7e2 International/ELF[27] 1996 2006 Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None Free registration required Yes 2022/11/01
TARA Corpus, Tagged 9e5 2e4 British 2006 (approx.) 2006 (approx.) Written, Periodicals, Newspapers, Print Written Nonfiction, News None Free registration required Yes 2022/11/01
British National Corpus (BNC) Corpus, Tagged 1e8 4e3 British 1960 1993 Multimedia Written General None Free registration required Yes 2022/11/01
British National Corpus (BNC) Sampler Corpus, Tagged 2e6 2e2 British 1975 1993 Multimedia Written General None Free registration required Yes 2022/11/01
Phrases in English (BNC)[28][29] Corpus, Tagged 1e8 4e3 British 1960 1993 Multimedia Written General None None Yes 2023/02/12
Just The Word (BNC)[28] Corpus, Tagged 1e8 4e3 British 1960 1993 Multimedia Written General None None Yes 2023/02/12
British English 2006 (BE06) Corpus, Tagged 1e6 5e2 British 2003 2008 Written Written General None Free registration required Yes 2022/11/01
American English 2006 (AME06) Corpus, Tagged 1e6 5e2 American 2006 (approx.) 2006 (approx.) Written Written General None Free registration required Yes 2022/11/22
Hansard Corpus (British Parliament) Corpus, Tagged 2e9 8e6 British 1803 2005 Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None Free registration required Yes 2022/11/01
British Parliament Hansard Library British 1800 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None Yes 2022/11/01
Australian Parliament Hansard Library Australian 1901 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None Yes 2022/11/01
Canadian House of Commons Hansard Library Canadian 2002 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None Yes 2022/11/01
New Zealand Parliament Hansard Library New Zealand 1854 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None Yes 2022/11/01
GovInfo (United States) Library American 1793 Present Multimedia Written Nonfiction, Law None None Yes 2022/11/01
Transgender Usenet Archive (TUA) Corpus, Untagged 4e5 (Various) 1994 2013 Written, Computer, Social Media, Usenet Written General, Transgender Topics None None Yes 2022/11/01
Science Forums[30] Social Media - 1e5 (Various) 1992 2014 Written, Computer, Social Media, BBS Written Nonfiction, Science None None No 2025/09/16
TextFiles.com Library (Various) 1980 (approx.) 1995 (approx.) Multimedia Multimedia General, esp. Nonfiction, Technology None None Yes 2022/11/01
LDS General Conference Corpus Corpus, Tagged 3e7 1e4 American 1851 Present Spoken, Formal, Speeches Written Religious, Latter Day Saints None None Yes 2022/11/01
FidoNet Echomail Archive Social Media (Various) 1990 (approx.) 2016 (approx.) Written, Computer, Social Media, FidoNet Written General, esp. Nonfiction, Technology None None Yes 2022/11/01
FidoNet HolySmoke Archive[31] Library 4e5 (Various) 1993 2004 Written, Computer, Social Media, FidoNet Written Nonfiction, Religion None None No 2025/10/12
Dúchas Project Library 2e6 Irish 1900 (approx.) 1940 (approx.) Multimedia Written Fiction, Folklore None None Yes 2022/11/02
Freiburg-Brown Corpus of American English (FROWN) Corpus, Tagged 1e6 5e2 American 1992 1992 Written, Print Written General None Free registration required Yes 2022/11/02
Brown Corpus Family Corpus, Tagged 1e6 2e3 Written, Print Written General None Free registration required Yes 2022/11/02
Brown Family (C8 tags) Corpus, Tagged 6e6 2e3 (Various) 1931 1991 Written, Print Written General None Free registration required Yes 2022/11/02
Brown Corpus[32] Corpus, Tagged 1e6 1e3 American 1961 1961 Written, Print Written General None None Yes 2022/11/02
Corpus of English Dialogues Corpus, Tagged 1e6 2e2 British(?) 1560 1760 Multimedia Written General, Dialogues None Free registration required Yes 2022/11/02
Florence Early English Newspapers (FEEN) Corpus, Tagged 3e5 -[33] British(?) 1620 1649 Written, Periodicals, Newspapers, Print Written Nonfiction, News None None Yes 2023/03/27
Transhistorical Corpus of Written English Corpus, Tagged 5e5 8e2 (Various) 1405 2019 Written Written General None None Yes 2022/11/02
Linguistic Landscape Corpus Corpus, Tagged 5e6 6e2 (Various) 1997 2018 Written Written Nonfiction, Academic None Free registration required Yes 2022/11/02
ICNALE Online[34] Corpus, Tagged 4e6 2e4 (Various, ESL[17])[35] 2007 (approx.) 2022 (approx.) Multimedia, College Work Multimedia Nonfiction, Academic None None Yes 2022/11/02
European Football Championship Interpreting Corpus (EFCIC) Corpus, Tagged 1e4 1e1 2020 2020 Spoken, Entertainment, Interpretation, Interview Written Nonfiction, Sports None None Yes 2022/11/02
UkWac Complete[36] Corpus, Tagged 2e9 3e6 British[3] 2005 (approx.) 2007 (approx.) Written, Computer, Internet Written General None None Yes 2022/11/02
UkWac Small[36] Corpus, Tagged 8e7 1e5 British[3] 2005 (approx.) 2007 (approx.) Written, Computer, Internet Written General None None Yes 2022/11/02
Postcard Archive @ Florida State University[37] Library 3e3[38] (Various) 1829 (approx.) 2016 (approx.) Written, Postcards Written Nonfiction, Postcards None None Yes 2022/11/06
PlayPhrase.me Corpus, Tagged 8e6[39] (Various) 1970 (approx.) Present? Spoken, Entertainment, Movies Audio-visual Fiction, Movies None None Yes 2022/11/07
European Union DGT-UD: English Corpus, Tagged 1e8 5e4 International/ELF[27] 1948 (approx.) 2016 Written, Legislative Acts Written Nonfiction, Law, Legislatures None None Yes 2022/11/16
Opus-MontenegrinSubs 1.0: English Corpus, Tagged 5e5 2e2 (Various) 2007 2013 Spoken, Entertainment, Television Written Fiction, Television None None Yes 2022/11/16
Archive of Our Own (AO3) Library 1e7 (Various) 2007 Present Written, Computer, Internet Written Fiction, Short Stories, Fan Works[40] None None Yes 2022/11/22
SCP Foundation Library 2e3 (Various) 2007 Present Written, Computer, Internet Written Fiction, Short Stories, Sci-Fi[40] None None Yes 2022/11/22
NEWS-GB (British newspapers)[41] Corpus, Tagged 2e8 British 2004 (approx.) 2004 (approx.) Written, Print Written Nonfiction, News None None No 2025/10/12
INTERNET-EN[41] Corpus, Tagged 2e8 5e4 (Various) 2006 (approx.) 2006 (approx.) Written, Computer, Internet Written General None None No 2025/10/12
BLOGS-EN (Political blogs)[41] Corpus, Tagged 5e8 (Various) 2008 (approx.) 2008 (approx.) Written, Computer, Internet Written Nonfiction, Politics None None No 2025/10/12
Manually Annotated Sub-Corpus (MASC) Library[42] 5e5 4e2 American 1990 (approx.) 2010 (approx.) Multimedia Written General None None Yes 2022/11/23
Open American National Corpus (OANC) Library[43] 2e7 9e3 American 1990 2005 (approx.) Multimedia Written General, esp. Nonfiction None None Yes 2025/09/22
Lancaster Newsbooks Corpus (1654 part) Corpus, Tagged 9e5 2e2 British 1653 1654 Written, Periodicals, Newspaper, Print Written Nonfiction, News None Free registration required Yes 2022/11/23
The Mail Arcive Library 2e8 (Various) 1990 Present Written, Computer, Mailing List Written Nonfiction, esp. Coding and Computers None None Yes 2022/11/26
CataList (LISTSERV catalog)[44] Library -[45] (Various) 1990 (approx.) Present Written, Computer, Mailing List Written Nonfiction None None Yes 2022/11/28
United Nations Digital Library Library 7e5[46] (Various, International/ELF[27]) 1875[47] Present Multimedia Multimedia Nonfiction, Politics None None Yes 2022/11/29
Genius.com Library (Various, Multilingual) 1900 (approx.) Present Spoken, Entertainment, Music Written General, Music None None Yes 2022/12/06
Chronicling America Library American 1777 1963 Written, Periodicals, Newspaper, Print Written Nonfiction, News None None Yes 2022/12/06
Library of Congress Library 3e6[48] (Various, Multilingual) 1470 (approx.) Present Multimedia Multimedia General None None Yes 2022/12/06
World Radio History Library 1e5[49] (Various, Multilingual)[50] 1900 (approx.) Present Written, Periodicals, Magazines, Print Written Nonfiction, Radio; Television; Music None None Yes 2022/12/06
Google News Newspapers Archive Library 6e6[51][52] (Various, Multilingual)[50] 1738 (approx.) 2009 Written, Periodicals, Magazines, Print Written Nonfiction, News None None Yes 2022/12/14
VESPA[53] Corpus 2e6 9e2 International/ESL[17] 2008 (approx.) 2008 (approx.) Written, College Work Written Nonfiction, Academic Restriction to non-profit educational use only[54] Free registration required Yes 2022/12/28
I-EN (Internet English Corpus)[41] Corpus, Tagged 2e8 (Various) 2005 2005 Written, Computer, Internet Written Nonfiction, News? None None No 2025/10/12
I-EN-CC (Internet English Creative Commons Corpus)[41] Corpus, Tagged 2e8 (Various) 2005 (approx.) 2005 (approx.) Written, Computer, Internet Written Nonfiction, News? None None No 2025/10/12
Springfield! Springfield! Library 2e5 (Various) 1910 (approx.) Present Spoken, Entertainment, Movies and Television Written General None None Yes 2023/03/27
Issuu Library 5e7[55] (Various, Multilingual) 2000 (approx.)[56] Present Written, Periodicals, Magazines Written Nonfiction None Free registration required for full access[57] Yes 2023/01/19
Smithsonian Transcription Center Library -[58] American 1400 (approx.)[59] Present Written Written Nonfiction None None Yes 2023/01/22
Voices Remembering Slavery: Freed People Tell Their Stories Library 7e4[51][60] 3e1[61] American, African 1932 1975[62] Spoken, Interviews Audio General, Anthropological interviews None None Yes 2023/01/28
Born in Slavery: Slave Narratives from the Federal Writers' Project Library 2e3[63] American, African[64] 1936 1938 Written Written Nonfiction, Biographies[64][65] None None Yes 2023/01/28
Corpus of Historical American English (COHA)[66] Corpus, Tagged 5e8 1e5 American 1820 2019 Multimedia Written General None Free registration required Yes 2023/02/14
The TV Corpus Corpus, Tagged 3e8 8e4 (Various)[67] 1950 2017 Spoken, Entertainment, Television Written General None Free registration required Yes 2023/03/27
The Movie Corpus Corpus, Tagged 2e8 3e4 (Various)[67] 1930 2018 Spoken, Entertainment, Movies Written General None Free registration required Yes 2023/03/27
Corpus of American Soap Operas (CASO) Corpus, Tagged 1e8 2e4 American 2001 2012 Spoken, Entertainment, Movies Written Fiction, Television, Soap Operas None Free registration required Yes 2023/03/27
Corpus of US Supreme Court Opinions Corpus, Tagged 1e8 3e4 American 1790 (approx.) 2019 (approx.)[68] Written Written Nonfiction, Law, Courts, Constitutional None Free registration required Yes 2023/02/16
TIME Magazine Corpus Corpus, Tagged 1e8 3e5[69] American 1923 2006 Written, Periodicals, Magazines, Print Written Nonfiction, News None Free registration required Yes 2023/02/16
Corpus of Online Registers of English (CORE) Corpus, Tagged 5e7 5e4 (Various)[70] 2013 (approx.) 2016 (approx.) Written, Computer, Internet Written General None Free registration required Yes 2023/02/16
Strathy Corpus of Canadian English Corpus, Tagged 5e7 1e3 Canadian 1921[71] 2011[71] Multimedia Written General None Free registration required Yes 2023/02/16
Biodiversity Heritage Library Library 3e5[72] (Various, Multilingual) 1400 (approx.) Present Written Written Nonfiction, Academic, Biology None None Yes 2023/02/23
African American Writers, 1892-1912 (AAW) Corpus, Untagged 5e5 8e0 American, African 1892 1912 Written Written General None None Yes 2023/03/15
Children's Literature (ChiLit) Corpus, Untagged 4e6 7e1 (Unclear)[73] ? ? Written Written Fiction, Children None None Yes 2023/03/15
The Philadelphia Neighborhood Corpus of LING560 Studies (PNC)[74] Corpus 2e6 3e2 American 1972 Present?[75] Spoken, Interviews Written (Unclear) Restrictions on excerpt size[76] Yes[77] Yes 2023/03/15
British Pathé[78] Library 2e5 British 1896 1984 Spoken, Formal Audio-visual Nonfiction, News None? None Yes 2023/04/06
Newspapers.com Library 8e5 (Various)[79] 1690 Present Written, Periodicals, Newspaper, Print Written Nonfiction, News None Wikipedia Library access available. Paid subscription otherwise required. Free trials are available. Yes 2023/04/30
NewspaperArchive Library 2e7[52][80] (Various, Multilingual)[81] 1607 Present Written, Periodicals, Newspaper, Print Written Nonfiction, News None Wikipedia Library access available. Paid subscription otherwise required. Free trials are available. Yes 2024/06/16
PressReader Library ? (Various) ? Present Written, Periodicals, Newspaper, Print Written Nonfiction, News None Some snippets freely visible, most content requires paid subscription. Free trials are available. Yes 2023/05/31
ProQuest Library ? (Various) ? Present Written, Periodicals, Newspaper, Print Written Nonfiction, News None Wikipedia Library access available. Some snippets freely visible, most content requires paid subscription. Free trials are available. Yes 2023/05/31
Welsh Newspapers Library ?[82] Welsh,[83] Multilingual 1804 1919 Written, Periodicals, Newspaper, Print Written Nonfiction, News None? None Yes 2023/08/08
Welsh Journals Library ?[84] Welsh, Multilingual 1735 2007 Written, Periodicals, Print Written General None? None Yes 2023/08/08
Crime and Punishment Database Library English?[85] 1730 1830 Written, Formal, Court Records Written Nonfiction, Law, Courts, Criminal None? None Yes 2023/08/08
American Archive of Public Broadcasting Library 1e5[86] (Various, Multilingual)[50] 1931[87] Present Spoken, esp. Formal Audio-visual General, esp. Nonfiction None None, additional content available on-site at GBH or the Library of Congress. Yes 2023/11/01
Buckeye Speech Corpus Corpus, Tagged 3e6 4e2 American 1999 2000 Spoken, Interviews Audio, Written General, Sociolinguistic interviews[88] Restriction to educational and research use only Free registration required Yes 2024/02/19
Westminster Detective Library Library 5e7[51][89] 2e4[89][90] American 1818 1891 Written, Periodicals, Newspapers, Print[91] Written Fiction, Short Stories, Detective Stories None None Yes 2024/02/26
Usenet Archive (UTZOO Wiseman/Zach Barth) Social Media 2e6[92] (Various) 1981 1991 Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None Yes 2024/03/20
Searchids.com[93][94] Library[95] 7e7[96] 2e7[97] (Various) 2006 2006 Written, Computer, Internet Searches Written General Restriction to non-commercial research use only[98] None No 2025/09/22
Freiburg Corpus of English Dialects (FRED) - Interactive Database[99] Corpus, Untagged[100] 1e6[101] 1e2[101] British (various dialects) 1970[101] 2000[101][102] Spoken, Interviews Audio, Written Nonfiction, History, Oral History[101] None None Yes 2024/05/09
MTSamples.com[103][104] Library 3e6[105] 5e3[106] (Various?) 2007[107] 2023 Written, Computer Written Nonfiction, Medicine Requires attribution[108] None Yes 2024/12/16
Evans Early American Imprints TCP (Evans-TCP) Corpus, Untagged ? 5e3 American 1640[109] 1800[110] Written, Print Written General None[111] None Yes 2025/10/12

Non-English corpora

[edit]
Non-English corpora table
Translated Name Native name Language Language Code Resource Type Size in words[1] Size in texts[1] Start year End year Original medium Available medium Genre Re-use restrictions Access restrictions Date of entry update
Czech National Corpus[112] Český národní korpus Czech cs Corpus, Tagged ? ? ? ? ? Multimedia General None? None 2024/06/18
Polish National Corpus[112] Narodowy Korpus Języka Polskiego Polish pl Corpus, Tagged 2e9 ? ? ? ? Written General None? None 2023/02/12
Russian National Corpus[112] Национальный корпус русского языка Russian ru Corpus, Tagged 2e9[113] 5e6[113] 1100 Present Multimedia Written General Restriction to non-commercial linguistic use only[114] None 2023/02/12
Turkish National Corpus[115] Türkçe Ulusal Derlemi Turkish tr Corpus, Tagged? 5e7 6e3 1990 2009 Written[116] Written General Restriction to educational use only[117] Free registration required 2023/02/12
Bruno Corpus[118] Spanish es Corpus, Untagged 1e6 5e2 ? 2010 (approx.) Written Written? General None None 2023/02/12
Braun Corpus[118] German de Corpus, Untagged 1e6 5e2 ? 2008 (approx.) Written Written? General None None 2023/02/12
Corpus of Spanish: Genre/Historical Corpus del Español: Genre/Historical Spanish es Corpus, Tagged 1e8 1e4 1200 (approx.) 2000 (approx.) Multimedia, esp. Written Written General None Free registration required 2023/03/24
Corpus of Spanish: Web/Dialects Corpus del Español: Web/Dialects Spanish[119][120] es Corpus, Tagged 2e9 2e6 2010 (approx.) 2014 Written, Computer, Internet Written General None Free registration required 2023/03/24
Corpus of Spanish: NOW Corpus del Español: NOW Spanish[119][121] es Corpus, Tagged 7e9 1e7 2012 2019 Written, Computer, Internet Written Nonfiction, News None Free registration required 2023/03/24
21st Century Corpus of Spanish[122] Corpus del Español del Siglo XXI Spanish es Corpus, Tagged 4e8 4e5 2001 2022 Multimedia, esp. Written Multimedia, esp. Written General None? None 2023/03/24
Lemko and Karpatska Rus’ Archive[123] Carpathian Rusyn rue Library 2e3 1928 1989 Written, Periodicals, Newspaper, Print Written Nonfiction, News None? None 2024/06/18
Spauda.org[123] Lithuanian lt Library ? 1886 2015 Written, Periodicals, Newspaper, Print Written Nonfiction, News None? None 2023/04/04
Gallica Gallica French fr Library 1e7 ? ? Written, Periodicals, Newspaper, Print Written General, esp. Nonfiction, News None None 2023/05/31
RetroNews RetroNews French fr Library 3e6 (at least) 1631 1951 Written, Periodicals, Newspaper, Print Written Nonfiction, News None None 2023/05/31
The Database of Early Cantonese Bible 早期粵語聖經資料庫 Cantonese yue Corpus, Untagged? 7e0 1863 1927 Written, Religious Text Written Religious, Christianity, Bible Passages None? None 2023/12/10
The Database of Early Christian Literature 早期基督教文學資料庫 Cantonese yue Corpus, Untagged? 5e0 1845 (approx.) 1906 Written, Books, Print Written Religious, Christianity None? None 2023/12/10
A Comprehensive Edition of Tocharian Manuscripts Tocharian B, Tocharian A txb, xto Corpus, Tagged 2e5[124] 2e3[125] 0500 (approx.)[126] 700 (approx.)[126] Written Written General, esp. Religious, Buddhism None? None 2024/05/05
Manx Corpus Search Manx gv Corpus, Untagged 2e6 7e2 1610 2012 Multimedia, esp. Written Written General? None? None 2025/03/10
Comprehensive Aramaic Lexicon Project Aramaic arc Corpus, Tagged ? ? BCE 0900 (approx.) 1300 (approx.) Written Written ? None? None 2025/03/24
Corpus of Portuguese Corpus do Português Portuguese,

Old Galician-Portuguese

pt, roa-opt Corpus, Tagged 2e9 4e6 1214[127] 2019 Written, esp. Computer, Internet Written General None? Free registration required 2025/08/10
Computerized Reference of the Medieval Galician Language (Corpus Xelmírez) Tesouro Medieval Informatizado da Lingua Galega (Corpus Xelmírez) Old Galician-Portuguese roa-opt Corpus, Tagged 5e7[128] 2e5[128] 0787[128] 1600 (approx.)[128] Written Written General Restricted to "research, teaching, and general purposes", for-profit use prohibited without obtaining explicit permission[129] None 2025/08/10
Electronic Corpus of Pre-Islamic Old Turkic Texts Vorislamische Alttürkische Texte: Elektronisches Corpus Old Uyghur oui Corpus, Untagged ? 5e2 0880 (approx.)[11] 1150 (apprx)[15] Written Written Religious[18] None? None 2025/09/10
Turkic Inscriptions Түрік Бітік Old Turkic otk Corpus, Untagged ? 3e3[5] 0730 (approx.)[21] 0900 (approx.) Written Written Nonfiction, Politics and History Normal copyright restrictions apply[9] None 2025/09/10
Diachronic Corpus of Spanish Corpus Diacrónico del Español Spanish, Old Spanish es, osp Corpus, Untagged 2e10[130] 1e5 (approx.)[131] 0759[132] 1975 Written Written General None? None 2025/10/08
Reference Corpus of Contemporary Spanish Corpus de Referencia del Español Actual Spanish es Corpus, Untagged 2e10[130] 7e4 (approx.)[131] 1975[133] 1999[133] Written[134] Written General, esp. Nonfiction None? None 2025/10/08
Vocabulary of Medieval Commerce Vocabulario de Comercio Medieval Old Spanish, Early Modern Spanish osp, es-ear Corpus, Untagged 2e4[135] 6e4[135] 9th century[135] 16th century[135] Written Written Commerce CC BY-NC-ND 2.5[136] None 2025/10/14
Cantigas Universe Universo Cantigas Old Galician-Portuguese roa-opt Corpus, Tagged ? 1683 ? 15th century Written Written Songbooks CC BY-NC-SA[137] None 2025/10/14
Historical and chronological vocabulary of Medieval Portuguese Vocabulário histórico-cronológico do Português Medieval Old Galician-Portuguese roa-opt Corpus, Tagged 17e3[138] ? ? 15th century Written Written General None None 2025/10/14
Old Spanish Textual Archive Old Spanish, Old Navarro-Aragonese, Old Leonese osp, roa-ona, roa-ole Corpus, Untagged 4e7[139] 4e2[139] 1085[140][141] 1660 (approx.)[140][142] Written Written General None None 2025/10/14
Thesaurus Linguae Aegyptiae (Repository of the Egyptian Language) Egyptian egy Corpus, Tagged 2e6[143] 2e4[143] BCE 3000[144] 0300[144] Written Written General Use of the website is restricted to academic research purposes less than 10 webpages. Releases of the raw data behind the website are available under CC BY-SA 4.0 Int. None 2025/11/14

Other lists and databases

[edit]
Other lists and databases table
Name Language Language Code Size in corpora[1] Active Date of entry update
Corpus Resource Database (CoRD) Translingual, esp. English mul, en 1e2 Yes 2023/02/13
Czech National Corpus KonText interface Translingual mul 1e3[145] Yes 2023/02/13
English-Corpora.org English en 2e1 Yes 2023/02/13
Leipzig Corpora Collection Translingual mul 1e3 Yes 2023/02/13
Lextutor Web Concordance English English en 5e1 Yes 2023/02/13
Lextutor Web Concordance French French fr 2e1 Yes 2023/02/13
LINDAT/CLARIAH-CZ Corpora Translingual mul 7e2 Yes 2023/02/13
Linguistic Data Consortium (LDC) Translingual mul 1e3 Yes 2023/02/13
Martin Weisser's On-line Corpora of English Translingual, esp. English mul, en 2e1 Yes 2023/02/13
SketchEngine Translingual mul 2e1[146] Yes 2023/02/13
University of Warwick list of free online corpora English en 2e1 Yes 2023/02/13
University of Edinburgh Scots and Scottish English corpora Scots, English sco, en 3e1 Yes 2023/02/13
SHACHI Database of Language Resources[147] Translingual mul 2e3 No 2024/04/22
CLARIN.SI Online Concordancers Translingual, esp. Slovene mul, sl 2e2 Yes 2023/02/26
CLARIN.SI Corpus Repository Translingual, esp. Slovene mul, sl 2e2 Yes 2023/02/26
CLARINO Corpuscle Translingual, esp. Norwegian mul, no 6e1 Yes 2023/02/26
CLARINO Corpus Repository Translingual, esp. Norwegian mul, no 4e1 Yes 2023/02/26
Online Resources for African American Language (ORAAL), external data sources English en 1e1 Yes 2023/03/15
Online Resources for African American Language (ORAAL), supplements English en 4e0 Yes 2024/05/08
Corpus Linguistics in Context (CLiC) English en 5e0 Yes 2023/03/15
The Spanish Coprus Spanish es 4e0 Yes 2023/03/24
Pennsylvania State University scripts and transcripts of popular film, TV, and sports English[148] en 2e1 Yes 2023/04/02
/r/Screenwriting Guide to Finding Scripts Online English[148] en 2e1 Yes 2023/04/02
BBC.com[149] Translingual mul 3e1 Yes 2024/04/22
Corpus4U.org[150][151] English, Chinese en, zh 2e2 Yes 2023/06/17
Beijing Foreign Studies University CQPweb[152] Translingual mul 2e2 Yes 2023/06/17
Lancaster Univerity CQPweb Translingual, esp. English mul, en 1e2 Yes 2023/06/17
Hong Kong University of Science and Technology Resources for Chinese Linguistics Chinese, esp. Cantonese zh, yue 3e0 Yes 2023/12/10
PolyU Corpus of Spoken Chinese, links to other corpora and databases Translingual, esp. Chinese mul, zh 1e2 Yes 2024/01/13
Duke University list of collections of African American oral histories English en 1e1 Yes 2024/05/08
OPUS Open Parallel Corpus Collection Translingual mul 1e3 Yes 2024/06/18
OPUS Multilingual Search Interface Translingual mul 4e2 Yes 2024/06/18
Stanford Large Network Dataset Collection Translingual mul 1e1 Yes 2025/03/03
Corpus-Based Language Studies' Corpus Survey Translingual mul 1e2 Yes 2025/08/18

See also

[edit]

Notes

[edit]
  1. 1.0 1.1 1.2 1.3 1.4 Sizes are represented using E notation, a type of scientific notation. "5e3" for example represents 5×103 or a 5 with three 0's after it, 5,000.
  2. 2.0 2.1 2.2 Specifically Australia, Bangladesh, Canada, Ghana, Great Britain, Hong Kong, India, Ireland, Jamaica, Kenya, Malaysia, New Zealand, Nigeria, Pakistan, Philippines, Singapore, South Africa, Sri Lanka, Tanzania, the United States
  3. 3.0 3.1 3.2 3.3 3.4 3.5 Note that dialect information in internet-derived corpora tends to be somewhat inaccurate because of accidental inclusion of texts in other dialects.
  4. ^ Specifically Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States
  5. 5.0 5.1 Note that this corpus is a sub-corpus of the NOW corpus.
  6. ^ Particularly speeches and interviews
  7. ^ As of 2024-05-10, the latest change log entry was from 2023/12/26.
  8. ^ Most after 2005
  9. 9.0 9.1 Most before 2017
  10. ^ An account is required to use the site's built in search function. Nonetheless, the forum threads can still be viewed and navigated without hindrance when logged out.
  11. 11.0 11.1 Tyler Kendall, Charlie Farrington (June 2023), “CORAAL User Guide”, in Corpus of Regional African American Language[1], retrieved 9 May 2024:
    The core components of CORAAL focus on AAL in Washington DC, [] CORAAL:DC [] is comprised of over 100 sociolinguistic interviews [] In addition to CORAAL:DC, CORAAL includes several smaller components to provide regional breadth. As of July 2021, there are six supplemental components: CORAAL:ATL, which includes 14 sociolinguistic interviews from speakers living in Atlanta, Georgia; CORAAL:DTA, which includes 40 sociolinguistic interviews from the Detroit Dialect Study collected in 1966; CORAAL:LES, comprised of 10 sociolinguistic interviews of speakers from the Lower East Side of New York City; CORAAL:PRV, which includes 15 sociolinguistic interviews from the town of Princeville, a rural African American community in central North Carolina; CORAAL:ROC, which includes 14 sociolinguistic interviews from Rochester, a city in Western Upstate New York; and CORAAL:VLD, which includes 12 speakers from Valdosta, a small city in South Georgia. [] Interviews are sociolinguistic styled interviews on topics such as life in Valdosta, personal histories, and high school sports.
  12. 12.0 12.1 Note that this number only represents the size of the English portion of the 2020 release of the corpus.
  13. 13.0 13.1 13.2 For specific details, see the "Total counts for Dependencies" file hosted in the Dependencies Downloads Index for the English part of the 2020 release which contains word and book counts for each of the years in the corpus as described on the main Ngram Viewer Exports page.
  14. ^ Anna L. Shparberg (July 2021), “Google Books Ngram Viewer”, in The Charleston Advisor, volume 23, number 1, Annual Reviews, →DOI, pages 16–19
  15. 15.0 15.1 Note that "British English" and "American English" sub-corpora of Google Ngram are sometime very inaccurate/misleading because of the accidental inclusion of texts in other dialects. Consider color vs colour and airplane vs aeroplane in the "British English" corpus. In both cases, Google Ngram shows the forms as being roughly equally as common from 2000-2019, which is blatantly untrue.
  16. ^ Madian Khabsa; C. Lee Giles (9 May 2014), “The Number of Scholarly Documents on the Public Web”, in PLOS ONE, volume 9, number 5, →DOI, →ISSN
  17. 17.0 17.1 17.2 "English as a Second Language"
  18. 18.0 18.1 The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
  19. 19.0 19.1 Audio files are available separately on TalkBank.org.
  20. ^ The corpora manual can be accessed online.
  21. 21.0 21.1 The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
  22. ^ As of 2024-05-10, the search function seems to be very slow or entirely broken. The groups and discussion threads are still manually navigable, though. The website can also still be searched using Google.
  23. ^ The website incorporates the UTZOO Wiseman archive.
  24. ^ Specific account permissions required. The corpus used to be publicly available, but this changed some time between late 2022 and late 2025.
  25. ^ Based on back-of-the-napkin extrapolation of data at the Internet Live Stats website.
  26. ^ The corpus appears to have gone down starting around mid 2024 based on archived crawls in the Wayback Machine and remains down as of 2025-02-20.
  27. 27.0 27.1 27.2 27.3 "English as a Lingua Franca"
  28. 28.0 28.1 The website is composed of a series of search tools, including n-gram and concordance search, based on the BNC.
  29. ^ Selection of different tools can be done through the "Grams" menu in the top left of the page.
  30. ^ Based on captures of the website stored in the Wayback Machine, it looks like the website went down some time between 2024-03-03 and 2024-06-18.
  31. ^ Apparently became unavailable sometime between late 2022 and late 2025.
  32. ^ Full name "Brown University Standard Corpus of Present-Day American English"
  33. ^ The corpus is made of six "texts", but looking at their descriptions reveals that each one is actually a compilation of multiple texts. For example, "feen4" is described as "7 separate titles". Overall, the exact number of independent texts included is unclear.
  34. ^ Full name "International Corpus Network of Asian Learners of English, Online Version"
  35. ^ Shin Ishikawa (12 April 2022), “The ICNALE [] ”, in language.sakura.ne.jp[2], SAKURA Internet, archived from the original on 14 August 2022:The ICNALE includes [] speeches and essays produced by college students [] in ten countries/ regions in Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand) as well as English native speakers.
  36. 36.0 36.1 The name is based on an abbreviation of the phrase "UK web as corpus".
  37. ^ To search the collection, select either "User-Added Text (Back)" or "User-Added Text (Front)" under "Narrow by Specific Fields", then select "contains" from the drop down just to the right and then enter the search term next to that and hit enter. Note that the overall quality and style of the data presented in the collection varies considerably.
  38. ^ As of 2023/03/07, 2,574 cards have the field "Writing on Card (Yes or No)?" marked as "yes". Nonetheless, there are cards that do have hand writing on them and have the field marked "no".
  39. ^ Approximately, the site actually lists its size as "7,600,186 phrases" (emphasis added).
  40. 40.0 40.1 Though not exclusively short stories, the format dominates the library.
  41. 41.0 41.1 41.2 41.3 41.4 Corpuses, such as this one, which used the IntelliText interface when offline around December 2023.
  42. ^ Although MASC is technically a corpus, it is only directly available through a web browser as a library. A complete copy of MASC as a corpus can be downloaded, though, and then processed with another application.
  43. ^ Although OANC is technically a corpus, it is only directly available by being downloaded and then processed with another application.
  44. ^ Many, if not most, of the LISTSERVs in the catalog do not have publicly accessible archives.
  45. ^ The catalog describes itself (as of 28 Nov 2022) as containing of 58,100 public lists, each of which contains a number of messages.
  46. ^ Approximately, not all items cataloged in the library are available online. In particular, it seems none of the around 300,000 speeches cataloged are available online.
  47. ^ Most after 1945
  48. ^ Number of items which are both available online and have their language marked as "English".
  49. ^ Approximately, based on a search of the collection for the basic words "a" and "the".
  50. 50.0 50.1 50.2 The number of non-English items is small.
  51. 51.0 51.1 51.2 Approximately, based on a statistical calculation.
  52. 52.0 52.1 Note that this number represents the number of newspaper issues in the archive.
  53. ^ Full name "Varieties of English for Specific Purposes dAtabase"
  54. ^ The corpus' end-user license states "Grant of the Product license entitles Licensee to use the Product for non-profit educational and/or linguistic research purposes only. [...] Licensees agree not to lease, sell, or commercially exploit the results of their searches (such as texts, concordances, metadata)." which is incompatible with Wiktionary's license.
  55. ^ Per https://issuu.com/about as of 2023/01/19
  56. ^ Issuu was founded in 2006, but includes some publications uploaded since then, but most of those are from after 1990, if not 2000.
  57. ^ Registration is required in order to turn "safe mode" off/show explicit search results.
  58. ^ Unclear. The collection is organized by "projects" which sometimes correspond to individual texts (such as diaries or funeral programs) and other times correspond to a collection of short texts (such as notes or letters). There were 11,372 projects on 2023/01/22. The length of projects is reported by the number of pages they contain. Using random sampling, it was estimated that the total length of all projects was around 2 million pages on 2023/01/22.
  59. ^ Most after 1800
  60. ^ Note that some transcripts were incomplete when this number was calculated.
  61. ^ Each interview in the collection, regardless of the number of parts it has, is considered one text. According to the "Faces and Voices from the Presentation" article, 26 interviews are in the collection.
  62. ^ Most from before 1950.
  63. ^ See Appendix I: Narratives in the Slave Narrative Collection by State for numerical breakdown by state
  64. 64.0 64.1 Norman R. Yetman (2001), “The Limitations of the Slave Narrative Collection”, in Library of Congress[3], published c. 2017
  65. ^ The narratives are based on interviews, but because of the lack of ground-truth audio recordings and doubts about the accuracy of the published versions of the narratives, they are categorized here as "Nonfiction, Biographies" rather than "General, Anthropological interviews" or similar.
  66. ^ Note that the COHA was updated in 2021.
  67. 67.0 67.1 Specifically "United States/Canada", "United Kingdom/Ireland", "Australia/New Zealand", and "Miscellaneous".
  68. ^ Note that the corpus is listed as going up to the "present", but as of 2023/02/16 the most recent section is the 2010s implying that no opinions from later decades are included.
  69. ^ Note that this number reflects the number of articles in the corpus, not the number of issues of TIME Magazine in the corpus.
  70. ^ Specifically Australia, Canada, New Zealand, the United Kingdom, and the United States
  71. 71.0 71.1 Note that the Queen's University page describing the corpus describes the start year as 1970 and end year as 2010 despite english-corpora.org providing a source spreadsheet which spans the years 1921 to 2011 and its corpus description page showing a time span from the 1920s to 2010s.
  72. ^ On the website, this number is associated with how many "volumes" are available and is listed along side the number of "titles" (2e5) as well as the number of pages. The exact meaning of the terms "volumes" and "titles" in this context is unclear.
  73. ^ Note that although the corpus does explicitly mention its contents, I have not put in the effort to determine the dialect of each of the included texts.
  74. ^ The website for the corpus is now offline for unclear reasons, but the it is presumably still possible to access the corpus by contacting the university.
  75. ^ The corpus' description implies that it is continually expanding project, but in 2018 the page had not been updated in 5 years (since 2013) which may suggest the project stopped expanding around the same time.
  76. ^ An apparently genuine archived version of the corpus' confidenality agreement does state "If I need to cite more than one paragraph (300 words) in a publication, I will obtain permission from the Philadelphia Neighborhood Corpus Committee".
  77. ^ An archive of the corpus' home page states that "only members of the research group have access".
  78. ^ Note that searches cover both metadata and transcripts for newsreels simultaneously.
  79. ^ Specifically Australia, Canada, Ireland, New Zealand, Panama, and the United Kingdom.
  80. ^ Derived from the "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen in this published Google Sheet.
  81. ^ About 90% of the publications are based in predominantly Anglophone countries (the United States [12263], Australia [1223], the United Kingdom [811], Canada [525], Ireland [50], New Zealand [19]) while the rest are from a wide variety of countries. Information derived from the "Newspaper Publication" list. A more direct view of the data as collected on 2024/06/16 can be seen in this published Google Sheet.
  82. ^ Issue counts are provided for individual publications, but not for the entire collection. 12.7 million articles in English are available, though, with each issue featuring many articles.
  83. ^ A few publications originate from regions outside Wales, in particular three from London, one from the United States, and one from Argentina. An additional publication has no region listed though its "issuing body note" states "Published in Caernarfon by Thomas Jones", with Caernarfon being in Wales.
  84. ^ Issue counts are provided for individual publications, but not for the entire collection. 363 thousand pages in English are available, though, with each issue featuring many pages.
  85. ^ The English Wikipedia article on the Court of Great Sessions in Wales stated on 2023-08-08 that "[o]f the 217 judges who sat on its benches [...], only 30 were Welshmen". Those involved in keeping the court's records likely had a similar make up and so the database's dialect likely reflects England rather than Wales.
  86. ^ This number represents the number of recordings available online.
  87. ^ This date represents the earliest year specified for any recording in the archive, though that recording does not have audio. It is not immediately clear what the earliest recording with audio is. The earliest audio-only recording is from 1938.
  88. ^ “Buckeye Corpus Information”, in Buckeye Corpus[4], c. 2005, retrieved 9 May 2024:After a significant amount of piloting different protocols for eliciting large amounts of unmonitored speech, a modified sociolinguistic interview format was chosen.
  89. 89.0 89.1 Note that this number was calculated to include the about 25% of work listings which were placeholders on 2024/02/24 but should eventually become full entries and excluded the about 15% of work listings were redirects to other listings on the same date.
  90. ^ Based on the fact that the list pages for browsing works display 25 works at a time there are 78 pages to browser as of 2024/02/26.
  91. ^ Not explicitly stated, but browsing the collection on 2024-02-26 revealed only newspapers being cited as the source of the stories provided.
  92. ^ Samantha Cole (13 October 2020), “2.1 Million of the Oldest Internet Posts Are Now Online for Anyone to Read”, in Vice[5], archived from the original on 13 October 2020:Around 2.1 million posts from between February 1981 and June 1991 from Henry Spencer's UTZOO NetNews Archive are archived at the Usenet Archive for anyone to browse.
  93. ^ There is was also a mirror site, Explicit-Id.com.
  94. ^ Based on captures of the website stored in the Wayback Machine, it looks like the website and its mirror went down some time between 2024-08-15 and 2024-09-26.
  95. ^ Though the site does feature a built in search function, it is significantly limited and prone to errors. For this reason, I've classified it as a "library" rather than a "corpus". A complete copy of the original data can be downloaded (see here for details) and processed with another application, though.
  96. ^ From the number of queries multiplied by the average of 3.5 words per query mentioned in the scientific article that originally accompanied the data: Greg Pass; Abdur Chowdhury; Cayley Torgeson (May 2006), “A Picture of Search”, in Proceedings of the First International Conference on Scalable Information Systems, Hong Kong, →DOI, page 2
  97. ^ Number of queries, per the README included with the data
  98. ^ This requirement is incompatible with Wiktionary's license.
  99. ^ Although the database indexes and shows results for the entirety of FRED, audio and transcripts are only viewable for the FRED Sampler (FRED-S) portion. For this reason, most of the information presented in this table is based on the FRED-S, not the complete FRED.
  100. ^ Although tagged transcripts can be downloaded from from the database, the search function only allows for the plaintext transcripts to be searched.
  101. 101.0 101.1 101.2 101.3 101.4 Benedikt Szmrecsanyi, Nuria Hernández (2007), “Manual of Information to Accompany the Freiburg Corpus of English Dialects Sampler (“FRED-S”)”, in FreiDok Plus[6], archived from the original on 2 April 2013
  102. ^ Most before 1990.
  103. ^ As of 2024-12-16, the search function seems to be broken. It can still be manually searched using Google.
  104. ^ A scrap of the corpus from about 2018 is also available as a CSV with the registration of a free account.
  105. ^ Based on a word count the 2018 scrape which has a similar number of transcription samples in it as the live website.
  106. ^ According to the website, as of the last update on 2023-07-07.
  107. ^ Based on Wayback Machine records.
  108. ^ Per the landing page.
  109. ^ In the form of The VVhole Booke of Psalmes Faithfully Translated Into English Metre
  110. ^ In the form of, for example, A Sermon; Occasioned by the Death of His Excellency George Washington.
  111. ^ The rights and permissions section for the corpus states "These materials are in the public domain. There is no restriction on your use of the transcribed texts."
  112. 112.0 112.1 112.2 Note that multiple sub-corpora and related corpora can be searched on the site.
  113. 113.0 113.1 Note that these numbers represent the size of all the corpora on the site tallied together.
  114. ^ The corpus' terms FAQ states "All data published under [this website] are available exclusively for non-commercial use for research and educational purposes [...] they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon." This requirement is incompatible with Wiktionary's license.
  115. ^ As of 2023-02-12 the query interface was offline.
  116. ^ The corpus' about page states that it is specifically 98% written and 2% spoken.
  117. ^ The corpus' user agreement states "TUD sadece araştırma ve sunum amaçlı kullanıma açıktır ve fikri mülkiyet hakları tümüyle Sağlayıcıya aittir." (roughly, '[the corpus] is available for research and presentation purposes only and the intellectual property rights remain the sole property of the Provider.') This requirement is incompatible with Wiktionary's license.
  118. 118.0 118.1 This corpus was designed to imitate the English-language Brown Corpus.
  119. 119.0 119.1 Specifically Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, El Salvador, Spain, United States, Uruguay, Venezuela.
  120. ^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue is addressed on the website with the conclusion that the "categorization is quite good".
  121. ^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue was addressed for the related Web/Dialects corpus with the conclusion that the "categorization is quite good" so a similar level of quality may exist for this corpus.
  122. ^ Note that the CORPES is currently undergoing continuous revision and so this information may be out of date. To be specific, the information presented is for version 0.99.
  123. 123.0 123.1 Note that the newspapers were published in the United States.
  124. ^ Using the number of "total tokens" listed under "Types of complete words (including unresolved akṣaras)" on 2024-04-22.
  125. ^ Using the number of manuscripts publicly available on 2024-04-22.
  126. 126.0 126.1 George S. Lane; Douglas Q. Adams (16 July 2013), “Tocharian languages”, in Encyclopedia Britannica[7], retrieved 5 May 2024:Documents from AD 500–700
  127. ^ https://www.corpusdoportugues.org/hist-gen/help/cdp.xls
  128. 128.0 128.1 128.2 128.3 https://ilg.usc.gal/tmilg/corpus.html
  129. ^ https://ilg.usc.gal/tmilg/usar.html
  130. 130.0 130.1 https://apps2.rae.es/nomina/SrvltGUIBusTextos?est=1
  131. 131.0 131.1 Based on the total number of documents listed on the general statistics page.
  132. ^ Specifically, "Constitución del monasterio de San Miguel de Pedroso [Cartulario de San Millán de la Cogolla]"
  133. 133.0 133.1 https://corpus.rae.es/ayuda_c.htm
  134. ^ About 10% is technically originally oral in composition, but this represents a small portion of the overall corpus.
  135. 135.0 135.1 135.2 135.3 https://www.um.es/lexico-comercio-medieval/index.php/p/v/inicio
  136. ^ https://www.um.es/lexico-comercio-medieval/index.php/p/v/aviso%20legal
  137. ^ https://www.universocantigas.gal/aviso-legal
  138. ^ http://medieval.rb.gov.br/sobre.php
  139. 139.0 139.1 https://www.hispanicseminary.org/osta-en.htm
  140. 140.0 140.1 https://nextcloud.oldspanishtextualarchive.org/index.php/apps/onlyoffice/s/cjyee9FqPE55PQQ
  141. ^ Specifically the work "Fuero de Avilés".
  142. ^ There are works with listed date ranges as late as c. 1600 to c. 1750, but the lastest with a somewhat specific range is 1650-1661 for "Trágico suceso, mortífero estrago, que la Justicia Divina obró en la Ciudad de Córdoba".
  143. 143.0 143.1 https://thesaurus-linguae-aegyptiae.de/home
  144. 144.0 144.1 https://thesaurus-linguae-aegyptiae.de/info/text-corpus
  145. ^ Approximately, it is difficult to see the full list of corpora in order to get an accurate estimate.
  146. ^ Note that this number reflects the number of corpora freely available. Including the corpora which require a subscription or special permission the number comes up to 722 as of 2023/0/13.
  147. ^ Note that the database has not been updated since 2016 and has a somewhat buggy search system.
  148. 148.0 148.1 Not confirmed to be English exclusively, but probably almost all English.
  149. ^ The BBC publishes news online in a wide variety of languages which can then be searched manually using a search engine like Google. The languages are specifically Arabic, Azeri, Bangla, Burmese, Chinese, French, Hausa, Hindi, Indonesian, Japanese, Kinyarwanda, Kirundi, Kyrgyz, Marathi, Nepali, Pashto, Persian, Portuguese, Russian, Sinhala, Somali, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, Uzbek, and Vietnamese.
  150. ^ The forum is primarily written in Chinese, though some posts are in English.
  151. ^ The section which primarily hosts links to corpora is labeled "专题研究" (Google Translate translates this as "Special Research".)
  152. ^ Both user ID and password are "test" for freely available corpora.

Further reading

[edit]