Text and Data Mining: Data Sources

This guide provides guidance for users interested in doing text and data mining. It includes data sources, support options, and best practices for using content licensed by the University Libraries.

Library Data Services

Library Data Services caters to researchers interested in working with data, mapping, texts, visualization, and technology. Many of these services are available online. Davis Library Data Services, located on the second floor of Davis Library, offers:

A computing lab with specialized software for GIS and data visualization & analysis.
Walk-in assistance provided by knowledgeable student consultants during set hours.
Consultations with specialists for more in-depth inquiries (by appointment).
Spaces for collaboration and presentation, complete with white boards and external displays.
Technology short courses and programs that promote digital scholarship.

Note

Listed below are a number of content providers with known data mining policies. This is not intended to be a complete list, and policies may change. Please contact a librarian listed on the Support tab of this guide for assistance before trying to mine any database to which the University Libraries subscribes, or if you are interested in a database not listed below.

Selected Content Sources with TDM Capabilities (Library subscriptions)
Selected Openly Available Content Sources with TDM Capabilities
Text Corpora Purchased by the Library Specifically for TDM

Selected Content Sources with TDM Capabilities (Library subscriptions)

arXiv
TDM available but WITH RESTRICTIONS. See notes and further links on this page.
Elsevier
Offers special interface for TDM.

Hathi Trust Digital Library
Offers a special interface for TDM.
Access: No restrictions.
Language: Various
IEEE Computer Society Digital Library
Before beginning TDM projects, requires UNC users notify both the UNC Computer Science Liaison Librarian and UNC’s IEEE sales manager (here's a handy link to contact both at once).
Access:Off Campus Access is available for: UNC Chapel Hill students, faculty, and staff; UNC Hospitals employees; UNC-Chapel Hill affiliated AHEC users
JSTOR
Offers a special tool for TDM: check out Data for Research, linked under Tools at the upper right. (Note: Text Analyzer only allows you to use your own document to search for related books and articles.)
Access: Off Campus Access is available for: UNC-Chapel Hill students, faculty, and staff; UNC Hospitals employees; UNC-Chapel Hill affiliated AHEC users.
Coverage: Varies. Excludes most recent 2-5 years of currently available journals.

ProQuest
Researcher must contact ProQuest directly via this page.
Social Media Archive (SOMAR) at ICPSR
A special-topic archive within ICPSR, SOMAR "contains a wide range of data collected from large-scale social media platforms such as Twitter, Facebook, Instagram, and Reddit, as well as smaller, more specialized data sets focused on specific research topics. These data sets have been collected and curated by researchers from around the world, and they cover various topics such as political communication, online behavior, and social networks. The data in our archive takes two forms. They can be public, i.e., available for immediate download, and/or restricted, i.e., available within our secure data enclave after receiving approval following a submitted restricted data application."
SpringerLink
Download subscription journal articles and books for TDM directly from Springer Nature's content platforms.
Wiley
TDM accessible via CrossRef through this link. Users must agree to TDM Agreement. (Note: as of October 2021, UNC does not have a separate TDM agreement in place.)

Selected Openly Available Content Sources with TDM Capabilities

Early English Books Online
TDM available through the Text Creation Partnership (openly available).
Gale's Eighteenth Century Collections Online
TDM available through the Text Creation Partnership (openly available).
GDELT
Supported by Google Jigsaw, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world." Offers several tools on the Computing tab, and also the ability to Download corpora.
Internet Archive
Offers TDM through its APIs.

The Internet Archive states, "For large scale requests, and especially rehosting offers please reach out to us at info@archive.org."
Library of Congress' Chronicling America
Offers TDM through its API. Note the resources in the left sidebar.
On the Books: Jim Crow and Algorithms of Resistance
A text mining project of the University Libraries with the goal of discovering Jim Crow and racially-based legislation signed into law in North Carolina between Reconstruction and the Civil Rights Movement. This project a created two corpora: one plain text corpus of all North Carolina session laws between 1866-1967 and a corpus of all Jim Crow laws identified by the project during those dates. Corpora are available in the Carolina Digital Repository.
On The Books: Jim Crow and Algorithms of Resistance open educational resource
Accompanying On the Books, this open educational resource (OER) enables students to interact with the resulting corpora, one of all laws and one of identified Jim Crow laws, with the goal of learning critical computational concepts and skills that can be applied in historical research. Scroll down on the page to see the description of the modules and how to get started.
PLOS
"PLOS is a nonprofit, Open Access publisher empowering researchers to accelerate progress in science and medicine by leading a transformation in research communication. ... Our approach to TDM is simple: PLOS articles may be mined, reused, and shared by anyone, anywhere, for any purpose."
Readex's Evans Early American Imprints
TDM available through the Text Creation Partnership (openly available).
State Legislators' tweets, released February 2023
a dataset with nearly 4 million tweets by American state legislators, coded for topic and ideology through supervised learning. Titled, "Replication Data for: Do Male and Female Legislators Have Different Twitter Communication Styles?" this dataset was compiled for the article:

Butler, Daniel, Thad Kousser and Stan Oklobdzija. [date]. “Do Male and Female Legislators Have Different Twitter Communication Styles?” State Politics & Policy Quarterly Forthcoming. https://www.cambridge.org/core/journals/state-politics-and-policy-quarterly.

Text Corpora Purchased by the Library Specifically for TDM

Congressional Hearings
Xml files of Congressional hearings, of both the Senate and House, from 1823 (the 18th Congress) to 2005 (the 108th Congress). Note that content after 2005 is freely available from the Government Publishing Office's site, https://www.govinfo.gov/app/collection/CHRG. Contact the staff member noted above for help obtaining this content as a corpus.
Congressional Record
Xml files of Congressional debates from 1789 (the 1st Congress) to 1999 (the 105th Congress). Note that data after 1995 are freely available from the Government Publishing Office's GitHub site, https://github.com/usgpo/api. Contact the staff member noted above for help obtaining this content as a corpus.
Washington Post
JSON files of the daily print edition of the Post from 1977 through 2025.

Text corpora compiled by linguist Mark Davies

For research use only: online versions are available for use with classes.

The Corpus of American Soap Operas (SOAP) contains 100 million words of data from 22,000 transcripts from American soap operas from the early 2000s, and it serves as a great resource to look at very informal language.
The Corpus of Contemporary American English (COCA) includes 560 million words over the period 1990 to present, scanned from novels, magazines, newspapers, and academic works, and from transcripts of the spoken word. Users may choose material by genre but not by decade.
The Corpus of Global Web-Based English (GloWbE, pronounced "globe") contains 1.9 billion words collected in 2012-2013. This corpus uniquely allows you to carry out comparisons between different varieties of English across the 20 included countries.
The Corpus of Historical American English (COHA) contains more than 400 million words of text from the 1810s to 2000s, scanned from novels, magazines, newspapers, and academic works, and from transcripts of the spoken word and balanced by genre decade by decade. Users may choose material by decade but not by genre.
El Corpus del Español (SPAN) contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries from the past three to four years. It offers the ability to compare between dialects.
The iWeb Corpus contains about 14 billion words in 22,388,141 web pages from 94,391 websites collected in 2017. The websites in iWeb were chosen in a systematic way, partly based on the percentage of users from six English-speaking countries: the U.S., Canada, Ireland, the U.K., Australia, and New Zealand.
The Movie Corpus includes 200 million words in 25,094 movie texts from the 1930s to the 2010s (the last texts are from 2018). The texts were taken from the OpenSubtitles collection. Each movie was matched with the corresponding page from IMDB, which provides rich metadata for each movie (and which can be used to create one's own virtual corpus).
The News on the Web corpus (NOW) includes more than ten billion words from web-based newspapers and magazines over the period 2010 through May 2021.
The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create sub-corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc. The TV corpus (along with the Movies and SOAP Corpora) serves as a great resource to look at very informal language -- at least as well as with corpora of actual spoken English.
The Wikipedia Corpus (Wiki) contains contains 1.9 billion words in more than 4.4 million articles collected in 2014. This corpus is particularly useful for creating and using customized virtual corpora from any of the included Wikipedia articles. For example, you could create a corpus with 500-1,000 pages (perhaps 500,000-1,000,000 words) related to microbiology, economics, basketball, Buddhism, or thousands of other topics.

Text and Data Mining: Data Sources

Library Data Services

Note

Selected Content Sources with TDM Capabilities (Library subscriptions)

Selected Openly Available Content Sources with TDM Capabilities

Text Corpora Purchased by the Library Specifically for TDM

Search & Find

Places & Spaces

Services

Support & Guides

About Us

Contact Us