Skip to Main Content

AI and Machine Learning for Evidence Syntheses: Predictive AI/ML Tools Available at UNC

Created by Health Science Librarians

Types of AI/ML

Artificial Intelligence versus Machine Learning

Artificial intelligence (AI) refers to any situation where a machine functions to copy human intelligence. This is a broad category, and includes things such as virtual assistants like Siri and Alexa, self-driving cars, and facial-recognition technology.

Machine learning (ML) is a type of artificial intelligence where the machine is given data to learn how to complete certain tasks.

Predictive AI versus Generative AI

The UNC Health Sciences Library (HSL) has a set of internal tools that can be applied to systematic reviews and other large scale evidence-syntheses to prioritize citations for screening. These tools use predictive AI models rather than generative AI. Application of generative AI (GenAI) and large language models (LLMs) to article screening is evolving rapidly. Currently, there are no commercial products available that use GenAI to categorize citations as relevant or not relevant. Machine learning approaches (predictive AI) used by HSL are available to prioritize citations for screening. This guide will be updated when resources and tested approaches become available.

Generative AI tools take in large amounts of training data to learn how to produce or generate content on their own. ChatGPT (text generation) and DALL-E (image generation) are examples of generative AI tools. In contrast, predictive AI uses labeled data (i.e. categorized by humans) to make predictions about future outcomes. Predictive AI is well-established and has been used often in various fields such as weather forecasting and finance.  

While predictive AI models have been used for decades, they have not been adopted on a large scale for systematic reviews or literature-based research. With the growing amount of published research, it is likely that AI will become an integral part of the research process. Predictive AI models can be effectively used to prioritize citations for human review. Extensive testing of the methodology particularly for biomedical literature can be found in peer-reviewed studies dating back to 2000 (Mostafa & Lam, 2000).

Element Predictive AI Generative AI
Input Based on labeled data (i.e., categorized manually by humans)  Large language model (LLM) and user prompts
Output Predictions for unlabeled data (e.g., indicates citations likely to be relevant)

New content

Reproducible Yes No
Validated Algorithms Available Yes No
Training Data is Transparent Yes No

Some of the information presented in the table above is adapted from Dr. Siw Waffenschmidt's presentation titled "Could large language models and/or AI-based automation tools assist the screening process?". This presentation was sponsored by Cochrane and more information can be found at https://training.cochrane.org/resource/could-large-language-models-and-or-ai-based-automation-tools-assist-the-screening-process.  

Predictive AI for Systematic Reviews

There are four types of predictive AI/ML that have applications for systematic reviews including: 1.) unsupervised machine learning (clustering), 2.) semi-supervised machine learning, 3.) supervised machine learning, and 4.) active machine learning. Generally speaking, these approaches work best on projects with 3,000 or more unique citations, although, clustering may be used for topic analysis on smaller datasets. You can find more information about each of these approaches on this page.

Unsupervised Machine Learning

What is unsupervised machine learning (clustering)?

Unsupervised machine learning, or clustering, uses algorithms such as k-means, Nonnegative Matrix Factorization (NMF), or linear discriminant analysis (LDA).​ ​At HSL, we primarily use k-means clustering which groups data into a fixed number (k) of clusters based on text similarities in titles and abstracts. Citations from a literature search, most commonly using titles and abstracts, are run through k-means clustering, which creates groups, or clusters, based on words commonly featured in each. Researchers can view the keywords that were used to develop the cluster, use them to see the distribution of literature within the search results, and help define or narrow a literature search topic.

Input

List of citations, including titles and abstracts for each. These data are unlabeled, meaning the user does not assign relevance to any of the input citations.

Output

Clusters of citations with keyword summaries for each cluster.

Unsupervised clustering, where unlabeled titles and abstracts are clustered based on common words.

In the image above, each shape represents one article. Unlabeled titles and abstracts are input, and clusters of articles based on text similarities of input data.

Benefits of Clustering

  • Does not require developing a set of seed studies (also called labeled or training data)
  • Provides insight into themes appearing withing the entire corpus
  • Quick and simple
  • Data-driven approach
  • Good for refining the literature search of large or complex reviews

Limitations of Clustering

  • Relies on human judgment to prioritize clusters for screening, which requires subject matter expertise
  • No prediction of the number of relevant studies in the search
  • Reduced precision as clusters are not based on seeds or a training set

Example of Clustering Output from a Systematic Review Search

In the example below, titles and abstracts of all citations from a large systematic review literature search were analyzed using clustering. Each citation is assigned to one of ten groups based on text analysis. The output from clustering is shown below. Keywords for each cluster are generated by the algorithm and can provide insight into topics found in the corpus. Subject matter experts may use the keywords to look closer at citations in a particular cluster.

Unsupervised clustering output for project on patient-reported outcomes measures (PROMs) in oncology care
Cluster Total Citations (n = 9578) Keywords
1 1153 ['surgery', 'surgical', 'complications', 'resection', 'postoperative', 'underwent', 'procedure', 'tumor', 'performed', 'survival', 'reconstruction', 'months', 'follow', 'patients underwent', 'patient', 'group', 'mean', 'term', 'cases', 'laparoscopic']
2 1262 ['screening', 'self', 'health', 'reported', 'among', 'women', 'associated', 'self reported', 'risk', 'factors', 'survivors', 'age', 'related', 'survey', 'participants', 'breast', 'population', 'data', 'years', 'higher']
3 661 ['validity', 'reliability', 'psychometric', 'version', 'internal', 'internal consistency', 'consistency', 'items', 'scale', 'questionnaire', 'item', 'cronbach', 'instrument', 'properties', 'psychometric properties', 'qlq', 'construct', 'factor', 'eortc', 'alpha']
4 856 ['intervention', 'survivors', 'exercise', 'program', 'cancer survivors', 'participants', 'feasibility', 'physical', 'breast', 'breast cancer', 'group', 'pilot', 'week', 'fatigue', 'physical activity', 'post', 'pilot study', 'self', 'activity', 'baseline']
5 592 ['survival', 'chemotherapy', 'toxicity', 'response', 'median', 'advanced', 'progression', 'cell', 'phase', 'months', 'therapy', 'mg', 'grade', 'disease', 'overall', 'overall survival', 'dose', 'phase ii', 'ii', 'line']
6 465 ['prostate', 'prostate cancer', 'urinary', 'epic', 'prostatectomy', 'sexual', 'men', 'radical prostatectomy', 'radical', 'months', 'function', 'localized', 'expanded prostate', 'cancer index', 'index composite', 'expanded', 'therapy', 'localized prostate', 'bowel', 'index']
7 2023 ['clinical', 'patient', 'care', 'health', 'use', 'data', 'oncology', 'research', 'based', 'outcomes', 'practice', 'evaluation', 'assessment', 'breast', 'management', 'reported', 'used', 'trials', 'information', 'using']
8 718 ['cost', 'cost effectiveness', 'effectiveness', 'qaly', 'costs', 'cost effective', 'incremental', 'per', 'quality adjusted', 'model', 'adjusted life', 'adjusted', 'incremental cost', 'effective', 'qalys', 'gained', 'life years', 'sensitivity', 'analysis', 'year']
9 539 ['care', 'palliative', 'palliative care', 'caregivers', 'needs', 'family', 'patient', 'caregiver', 'end life', 'advanced', 'cancer patients', 'end', 'symptom', 'advanced cancer', 'support', 'assessment', 'home', 'intervention', 'hospice', 'burden']
10 1309 ['therapy', 'pain', 'patient', 'chemotherapy', 'group', 'treated', 'significantly', 'significant', 'scores', 'qol', 'effects', 'baseline', 'clinical', 'months', 'dose', 'mean', 'feasibility', 'score', 'cancer patients', 'compared']

Semi-Supervised Machine Learning

What is semi-supervised machine learning?

Semi-supervised machine learning also uses clustering algorithms. UNC refers to this methodology as supervised clustering. The method is considered semi-supervised because a portion of the studies sent to clustering have been manually reviewed for relevance. These manually reviewed studies, also called seeds, are typically known relevant studies. The seeds are clustered along with unlabeled data and can provide another signal (beyond the keywords generated by the algorithm) as to which clusters are likely to contain studies of interest.  

Input

Titles and abstracts for all unlabeled citations and seed studies.

Output

Clusters of citations with keyword summaries and proportion of seeds for each cluster.

Supervised Clustering, where seed studies and all other titles and abstracts organized into clusters based on their similarity to the seed studies.

In the image above, each shape represents one article. Several relevant seed studies are identified from screening a random subset, and input into the system. The other remaining articles are clustered based on their similarity to the seed studies.

Supervised clustering output for project on patient-reported outcomes measures (PROMs) in oncology care
Cluster Total Citations (n = 9578) Proportion of Seeds (%) Keywords
1 1153 0 ['surgery', 'surgical', 'complications', 'resection', 'postoperative', 'underwent', 'procedure', 'tumor', 'performed', 'survival', 'reconstruction', 'months', 'follow', 'patients underwent', 'patient', 'group', 'mean', 'term', 'cases', 'laparoscopic']
2 1262 42 ['screening', 'self', 'health', 'reported', 'among', 'women', 'associated', 'self reported', 'risk', 'factors', 'survivors', 'age', 'related', 'survey', 'participants', 'breast', 'population', 'data', 'years', 'higher']
3 661 8 ['validity', 'reliability', 'psychometric', 'version', 'internal', 'internal consistency', 'consistency', 'items', 'scale', 'questionnaire', 'item', 'cronbach', 'instrument', 'properties', 'psychometric properties', 'qlq', 'construct', 'factor', 'eortc', 'alpha']
4 856 0 ['intervention', 'survivors', 'exercise', 'program', 'cancer survivors', 'participants', 'feasibility', 'physical', 'breast', 'breast cancer', 'group', 'pilot', 'week', 'fatigue', 'physical activity', 'post', 'pilot study', 'self', 'activity', 'baseline']
5 592 2 ['survival', 'chemotherapy', 'toxicity', 'response', 'median', 'advanced', 'progression', 'cell', 'phase', 'months', 'therapy', 'mg', 'grade', 'disease', 'overall', 'overall survival', 'dose', 'phase ii', 'ii', 'line']
6 465 1 ['prostate', 'prostate cancer', 'urinary', 'epic', 'prostatectomy', 'sexual', 'men', 'radical prostatectomy', 'radical', 'months', 'function', 'localized', 'expanded prostate', 'cancer index', 'index composite', 'expanded', 'therapy', 'localized prostate', 'bowel', 'index']
7 2023 12 ['clinical', 'patient', 'care', 'health', 'use', 'data', 'oncology', 'research', 'based', 'outcomes', 'practice', 'evaluation', 'assessment', 'breast', 'management', 'reported', 'used', 'trials', 'information', 'using']
8 718 0 ['cost', 'cost effectiveness', 'effectiveness', 'qaly', 'costs', 'cost effective', 'incremental', 'per', 'quality adjusted', 'model', 'adjusted life', 'adjusted', 'incremental cost', 'effective', 'qalys', 'gained', 'life years', 'sensitivity', 'analysis', 'year']
9 539 9 ['care', 'palliative', 'palliative care', 'caregivers', 'needs', 'family', 'patient', 'caregiver', 'end life', 'advanced', 'cancer patients', 'end', 'symptom', 'advanced cancer', 'support', 'assessment', 'home', 'intervention', 'hospice', 'burden']
10 1309 26 ['therapy', 'pain', 'patient', 'chemotherapy', 'group', 'treated', 'significantly', 'significant', 'scores', 'qol', 'effects', 'baseline', 'clinical', 'months', 'dose', 'mean', 'feasibility', 'score', 'cancer patients', 'compared']

Supervised Clustering with Ensemble Approach

At UNC HSL, librarians are trained in using supervised clustering with an ensemble approach. The ensemble approach uses six models and can be thought of like a voting method. Each study receives an ensemble score of 0 to 6 indicating how many models “voted” that study likely to be relevant. Using an ensemble approach allows us to further prioritize studies into batches that can be prioritized for manual screening. Studies with an ensemble score of 6 are voted relevant in all six models and therefore are more likely to be relevant and should be screened first.  

Supervised clustering with an ensemble approach is typically where we start when prioritizing studies for screening. This method is preferred as a starting place over machine learning because less training data are required. For this method we can use as few as 25 seed studies assuming they are identified at random from the full corpus. Machine learning requires a more robust training dataset and typically we recommend using several hundred positive studies.  

Ensemble approach showing cut off based on number of models that predicted the article is relevant

The image above shows an example of an ensemble approach with 30,000 total references. The references were run through 6 different models. As fewer numbers of models predict an article relevant, the likelihood that the reference is relevant decreases. For instance, 18,307 of the references were not predicted to be relevant by any of the models, and therefore likely do not need to be manually screened by a human.

When to Use

  • Large number of search results
  • Ongoing review series
  • Search updates of previously published, large evidence synthesis
  • Low precision search with very few relevant studies (needle in a haystack)

Benefits of Supervised Clustering

  • Requires a small set of training data (i.e., seed studies)
  • Unbiased judgement of which clusters to prioritize for manual review
  • Estimates recall (i.e., number of positive studies found) 
  • Uses a data-driven approach to increase efficiency in title/abstract screening stage of the review

Limitations of Supervised Clustering

  • Requires randomly identifying seed studies

Examples of Published Reviews Using Supervised Clustering Methods

  • Varghese, A., Cawley, M. & Hong, T. Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. Environ Syst Decis 38, 398–414 (2018). https://doi.org/10.1007/s10669-017-9670-5
  • Anderson DM, Cronk R, Fejfar D, Pak E, Cawley M, Bartram J. Safe Healthcare Facilities: A Systematic Review on the Costs of Establishing and Maintaining Environmental Health in Facilities in Low- and Middle-Income Countries. Int J Environ Res Public Health. 2021 Jan 19;18(2):817. doi: 10.3390/ijerph18020817
  • Christenson EC, Cronk R, Atkinson H, Bhatt A, Berdiel E, Cawley M, Cho G, Coleman CK, Harrington C, Heilferty K, Fejfar D, Grant EJ, Grigg K, Joshi T, Mohan S, Pelak G, Shu Y, Bartram J. Evidence Map and Systematic Review of Disinfection Efficacy on Environmental Surfaces in Healthcare Facilities. Int J Environ Res Public Health. 2021 Oct 22;18(21):11100. doi: 10.3390/ijerph182111100
  • Cawley M, Carlson R, Vest TA, Eckel SF. Machine learning-assisted literature screening for a medication-use process-related systematic review. Am J Health Syst Pharm. 2024 Nov 22:zxae357. doi: 10.1093/ajhp/zxae357

Supervised Machine Learning

What is supervised machine learning?

With supervised machine learning, also generally referred to as machine learning, a set of citations are selected randomly from the comprehensive systematic review search. A human then labels the articles as either relevant or not relevant, which are input into the machine learning algorithm to develop a model that can predict whether or not an article is relevant to the research project. All of the other citations from the search are put into the model, which predicts whether articles are likely to be relevant.

Input

Randomly selected training dataset labelled by humans, and the rest of the unlabeled search results.

Output

Predictions of whether or not search results are likely to be relevant.

Supervised ML, where labeled training data and unlabeled titles and abstracts input to predict the relevance of the unlabeled titles and abstracts.

In the image above, each shape represents one article. Articles are randomly selected from the search results, and marked as relevant or not relevant. Training data are input along with the rest of the search results. The algorithm predicts which of the unlabeled references are likely to be relevant.

Benefits of Supervised Machine Learning

  • Provides a probability score 0 to 1 of whether an article is likely to be relevant. 
  • Indicates which articles should be screened to reach a desired level of recall. 
  • Can be used to make cut off decisions for when to stop screening 
  • Good for title/abstract screening stage of the review 

Limitations of Supervised Machine Learning

  • Requires training data
  • Difficult to know how large training data should be
  • Bias in the training data can impact the results, either leading to false negatives or reinforcing what the researchers already know

Example of Using Supervised Machine Learning for a Review

In the figure above, order of screening based on probability score derived from machine learning. Black dots indicate studies identified as relevant in manual screening. Stage 1 indicates studies most likely to be relevant based on probability score and recommended for manual review. Stage 2 is the “insurance step”—the next 500 studies most likely to be relevant after the studies recommended for review by the model. Stage 3 indicates studies with low probability of being relevant that are not recommended for manual screening; however, the research team elected to screen them. The black dot in stage 3 was a study labeled “interesting” by the research team but not necessarily relevant to the project.​

Active Machine Learning

What is active machine learning?

Active machine learning is a type of predictive AI, but is different from the techniques described above. Active machine learning uses a training dataset of labeled titles and abstracts (known relevant and not relevant) to train the system to learn which articles are more relevant to the project. However, in contrast to supervised machine learning, the training dataset is developed using a non-random sample. Additionally, the training data is iterative: the study prioritization continues to be updated a humans go through the screening process. 

Input

Training dataset of human-labeled citations and set of unlabeled citations

Output

Predictions on whether the unlabeled documents are relevant. These predictions continue to adjust as more screening is completed by humans.

Benefits of Active Machine Learning

Limitations of Active Machine Learning

  • Requires training dataset
  • Potential for higher bias than supervised machine learning because the training dataset is not based on a random sample
  • Requires an interface to run, often through paid programs or subscriptions
  • Can be unclear when to stop screening.
  • Less transparent than other methods.

Example of Using Active Machine Learning for a Review

Covidence is a tool used for systematic review screening, data extraction, and quality assessment. While screening, users have the option to organize references by "Most Relevant". When this option is chosen, Covidence sorts citations based on those that are most likely to be relevant to your project based on previously screened citations (determined to be included and excluded). At least 25 studies must be marked as included or excluded, with at least two of each. As users continue to screen citations, the active machine learning algorithm continues to learn and adjust, and the list order continues to update.