Having trouble accessing government information? See the Libraries’ guide to Alternative Sources for Federal Information and Data.

Text and Data Mining: TDM Tools

This guide provides guidance for users interested in doing text and data mining. It includes data sources and tools, support options, and best practices for using content licensed by the University Libraries.

Library Data Services

Library Data Services caters to researchers interested in working with data, mapping, texts, visualization, and technology. Many of these services are available online. Davis Library Data Services, located on the second floor of Davis Library, offers:

A computing lab with specialized software for GIS and data visualization & analysis.
Walk-in assistance provided by knowledgeable student consultants during set hours.
Consultations with specialists for more in-depth inquiries (by appointment).
Spaces for collaboration and presentation, complete with white boards and external displays.
Technology short courses and programs that promote digital scholarship.

Note

TDM Tools

Listed below are a number of tools for text and data mining. This is not intended to be a complete list, and tool availability may change. Please contact a librarian listed on the Support tab of this guide for assistance if you have any questions about the tools listed on this page.

Tools with University Subscription

ProQuest TDM Studio

A text and data mining tool that has access to most of the ProQuest collections to which the University Library subscribes. TDM Studio has two main utilities:

TDM Studio Workbench: designed for experienced researchers who use Python or R for data analysis in a Jupyter Lab environment.
TDM Studio Visualization: designed for users of all levels to quickly spot trends and generate insights.

Note: Use your UNC email when creating your ProQuest TDM Studio account.

For more information about ProQuest TDM Studio, see ProQuest's libguide.

Open Source/Use Tools

Text and Data Extraction

Beautiful Soup

Python library for data extraction from webpages, HTML, and XML files. For more information, see this Programming Historian tutorial.

CrossRef

Open-source repository that links research objects, entities, and actions. Their REST API(?) is very helpful for large-scale extraction of scholarly metadata. For more information, see their documentation.

Text and Data Processing

AntConc

Open-source text-analysis software for textual corpus research. Extract and visualize word frequencies, see key words in context, cluster terms, and more! Click here for more information about AntConc.

OpenRefine

Open-source software for data exploration, transformation, and cleaning. See OpenRefine's documentation or this Programming Historian tutorial for more information.

Text and Data Visualization

Gephi

Open-source software for visualization network data and extract network statistics, ideal for text or social media network data. For more information, check out these tutorials.