Skip to Main Content

Text and Data Mining: Best Practices

This guide provides guidance for users interested in doing text and data mining. It includes data sources, support options, and best practices for using content licensed by the University Libraries.

Library Data Services

Library Data Services caters to researchers interested in working with data, mapping, texts, visualization, and technology. Many of these services are available online. Davis Library Data Services, located on the second floor of Davis Library, offers:

  • A computing lab with specialized software for GIS and data visualization & analysis.
  • Walk-in assistance provided by knowledgeable student consultants during set hours
  • Consultations with specialists for more in-depth inquiries (by appointment).
  • Spaces for collaboration and presentation, complete with white boards and external displays.
  • Technology short courses and programs that promote digital scholarship.

General Guidelines

Here are some basic guidelines to think about when starting a text or data mining project.

Always Verify Permission

Always check licensing agreements before you begin a text or data mining project.  This is to your benefit to:

Get Cleaner Data

Often, going through the appropriate channels before starting a text or data mining project will result in your receiving cleaner, more usable data. A more usable data set can save you an enormous amount of time in preparing it. Some websites may also provide access to pre-structured data underlying the website through an Application Programming Interface (API). These usually require some form of access authentication (provided by the vendor) and come with their own terms of service to investigate. Because API data are structured, they are often much easier to use. A librarian can walk you through best practices for getting data from the specific source that you want to use. 

Avoid Breaching Licensing Agreements

Although it is often possible to scrape information from databases (for a time) despite it being illegal to do so, proceeding without a vendor's permission can jeopardize your research as well as access to the resource for the entire campus, as vendors will revoke access for the entire campus when they detect scraping that violates their terms. Consult the Data Sources and Support pages to make sure you do not breach any licenses. 

Every Source is Different

The best practices for text and data mining will vary depending on the source and the end goal of the project

Content

The best way to go about text or data mining will vary significantly depending on the sources the researcher wants to use. For example, mining from shipping records in tabular form will require a different process than mining from newspaper articles.  Support is available from Library Data Services for designing a technical approach and analyzing your data--see links on the Support tab.

Vendor Agreements

Some vendors may provide tools within online databases that facilitate text and data mining. Other vendors may have different agreements for different databases they offer. Some may require a fee to acquire data. The Support tab of this guide links to contact information for librarians who can help navigate these differences.

*Important*

Do not scrape data from databases to which the University subscribes without first checking with a librarian.