Here are some basic guidelines to think about when starting a text or data mining project.
Always check licensing agreements before you begin a text or data mining project. This is to your benefit to:
Often, going through the appropriate channels before starting a text or data mining project will result in your receiving cleaner, more usable data. A more usable data set can save you an enormous amount of time in preparing it. Some websites may also provide access to pre-structured data underlying the website through an Application Programming Interface (API). These usually require some form of access authentication (provided by the vendor) and come with their own terms of service to investigate. Because API data are structured, they are often much easier to use. A librarian can walk you through best practices for getting data from the specific source that you want to use.
Although it is often possible to scrape information from databases (for a time) despite it being illegal to do so, proceeding without a vendor's permission can jeopardize your research as well as access to the resource for the entire campus, as vendors will revoke access for the entire campus when they detect illegal scraping. Consult the Data Sources and Support pages to make sure you do not breach any licenses.
The best practices for text and data mining will vary depending on the source and the end goal of the project
The best way to go about text or data mining will vary significantly depending on the sources the researcher wants to use. For example, mining from shipping records in tabular form will require a different process than mining from newspaper articles. Support is available from the Davis Library Research Hub for designing a technical approach and analyzing your data--see links on the Support tab.
Some vendors may provide tools within online databases that facilitate text and data mining. Other vendors may have different agreements for different databases they offer. Some may require a fee to acquire data. The Support tab of this guide links to contact information for librarians who can help navigate these differences.
Do not scrape data from databases to which the University subscribes without first checking with a librarian.