Skip to Main Content

Text and Data Mining: Data Sources

This guide provides guidance for users interested in doing text and data mining. It includes data sources, support options, and best practices for using content licensed by the University Libraries.

Note

Listed below are a number of content providers with known data mining policies.  This is not intended to be a complete list, and policies may change. Please contact a librarian listed on the Support tab of this guide for assistance before trying to mine any database to which the University Libraries subscribes, or if you are interested in a database not listed below.

Selected Content Sources with TDM Capabilities (Library subscriptions)
Selected Openly Available Content Sources with TDM Capabilities
Text Corpora Purchased by the Library Specifically for TDM

Selected Content Sources with TDM Capabilities (Library subscriptions)

Selected Openly Available Content Sources with TDM Capabilities

Text Corpora Purchased by the Library Specifically for TDM

Text corpora compiled by linguist Mark Davies

For research use onlyonline versions are available for use with classes.

  • The Corpus of American Soap Operas (SOAP) contains 100 million words of data from 22,000 transcripts from American soap operas from the early 2000s, and it serves as a great resource to look at very informal language.
  • The Corpus of Contemporary American English (COCA) includes 560 million words over the period 1990 to present, scanned from novels, magazines, newspapers, and academic works, and from transcripts of the spoken word. Users may choose material by genre but not by decade.
  • The Corpus of Global Web-Based English (GloWbE, pronounced "globe") contains 1.9 billion words collected in 2012-2013. This corpus uniquely allows you to carry out comparisons between different varieties of English across the 20 included countries.
  • The Corpus of Historical American English (COHA) contains more than 400 million words of text from the 1810s to 2000s, scanned from novels, magazines, newspapers, and academic works, and from transcripts of the spoken word and balanced by genre decade by decade. Users may choose material by decade but not by genre.
  • El Corpus del Español (SPAN) contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries from the past three to four years. It offers the ability to compare between dialects.
  • The iWeb Corpus contains about 14 billion words in 22,388,141 web pages from 94,391 websites collected in 2017. The websites in iWeb were chosen in a systematic way, partly based on the percentage of users from six English-speaking countries:  the U.S., Canada, Ireland, the U.K., Australia, and New Zealand.
  • The Movie Corpus includes 200 million words in 25,094 movie texts from the 1930s to the 2010s (the last texts are from 2018). The texts were taken from the OpenSubtitles collection. Each movie was matched with the corresponding page from IMDB, which provides rich metadata for each movie (and which can be used to create one's own virtual corpus).
  • The News on the Web corpus (NOW) includes more than ten billion words from web-based newspapers and magazines over the period 2010 through May 2021.
  • The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create sub-corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc. The TV corpus (along with the Movies and SOAP Corpora) serves as a great resource to look at very informal language -- at least as well as with corpora of actual spoken English.
  • The Wikipedia Corpus (Wiki) contains contains 1.9 billion words in more than 4.4 million articles collected in 2014. This corpus is particularly useful for creating and using customized virtual corpora from any of the included Wikipedia articles. For example, you could create a corpus with 500-1,000 pages (perhaps 500,000-1,000,000 words) related to microbiology, economics, basketball, Buddhism, or thousands of other topics.