Artificial intelligence (AI) refers to any situation where a machine functions to copy human intelligence. This is a broad category, and includes things such as virtual assistants like Siri and Alexa, self-driving cars, and facial-recognition technology.
Machine learning (ML) is a type of artificial intelligence where the machine is given data to learn how to complete certain tasks.
The UNC Health Sciences Library (HSL) has a set of internal tools that can be applied to systematic reviews and other large scale evidence-syntheses to prioritize citations for screening. These tools use predictive AI models rather than generative AI. Application of generative AI (GenAI) and large language models (LLMs) to article screening is evolving rapidly. Currently, there are no commercial products available that use GenAI to categorize citations as relevant or not relevant. Machine learning approaches (predictive AI) used by HSL are available to prioritize citations for screening. This guide will be updated when resources and tested approaches become available.
Generative AI tools take in large amounts of training data to learn how to produce or generate content on their own. ChatGPT (text generation) and DALL-E (image generation) are examples of generative AI tools. In contrast, predictive AI uses labeled data (i.e. categorized by humans) to make predictions about future outcomes. Predictive AI is well-established and has been used often in various fields such as weather forecasting and finance.
While predictive AI models have been used for decades, they have not been adopted on a large scale for systematic reviews or literature-based research. With the growing amount of published research, it is likely that AI will become an integral part of the research process. Predictive AI models can be effectively used to prioritize citations for human review. Extensive testing of the methodology particularly for biomedical literature can be found in peer-reviewed studies dating back to 2000 (Mostafa & Lam, 2000).
Element | Predictive AI | Generative AI |
---|---|---|
Input | Based on labeled data (i.e., categorized manually by humans) | Large language model (LLM) and user prompts |
Output | Predictions for unlabeled data (e.g., indicates citations likely to be relevant) |
New content |
Reproducible | Yes | No |
Validated Algorithms Available | Yes | No |
Training Data is Transparent | Yes | No |
Some of the information presented in the table above is adapted from Dr. Siw Waffenschmidt's presentation titled "Could large language models and/or AI-based automation tools assist the screening process?". This presentation was sponsored by Cochrane and more information can be found at https://training.cochrane.org/resource/could-large-language-models-and-or-ai-based-automation-tools-assist-the-screening-process.
There are four types of predictive AI/ML that have applications for systematic reviews including: 1.) unsupervised machine learning (clustering), 2.) semi-supervised machine learning, 3.) supervised machine learning, and 4.) active machine learning. Generally speaking, these approaches work best on projects with 3,000 or more unique citations, although, clustering may be used for topic analysis on smaller datasets. You can find more information about each of these approaches on this page.
Unsupervised machine learning, or clustering, uses algorithms such as k-means, Nonnegative Matrix Factorization (NMF), or linear discriminant analysis (LDA). At HSL, we primarily use k-means clustering which groups data into a fixed number (k) of clusters based on text similarities in titles and abstracts. Citations from a literature search, most commonly using titles and abstracts, are run through k-means clustering, which creates groups, or clusters, based on words commonly featured in each. Researchers can view the keywords that were used to develop the cluster, use them to see the distribution of literature within the search results, and help define or narrow a literature search topic.
List of citations, including titles and abstracts for each. These data are unlabeled, meaning the user does not assign relevance to any of the input citations.
Clusters of citations with keyword summaries for each cluster.
In the image above, each shape represents one article. Unlabeled titles and abstracts are input, and clusters of articles based on text similarities of input data.
In the example below, titles and abstracts of all citations from a large systematic review literature search were analyzed using clustering. Each citation is assigned to one of ten groups based on text analysis. The output from clustering is shown below. Keywords for each cluster are generated by the algorithm and can provide insight into topics found in the corpus. Subject matter experts may use the keywords to look closer at citations in a particular cluster.
Cluster | Total Citations (n = 9578) | Keywords |
---|---|---|
1 | 1153 | ['surgery', 'surgical', 'complications', 'resection', 'postoperative', 'underwent', 'procedure', 'tumor', 'performed', 'survival', 'reconstruction', 'months', 'follow', 'patients underwent', 'patient', 'group', 'mean', 'term', 'cases', 'laparoscopic'] |
2 | 1262 | ['screening', 'self', 'health', 'reported', 'among', 'women', 'associated', 'self reported', 'risk', 'factors', 'survivors', 'age', 'related', 'survey', 'participants', 'breast', 'population', 'data', 'years', 'higher'] |
3 | 661 | ['validity', 'reliability', 'psychometric', 'version', 'internal', 'internal consistency', 'consistency', 'items', 'scale', 'questionnaire', 'item', 'cronbach', 'instrument', 'properties', 'psychometric properties', 'qlq', 'construct', 'factor', 'eortc', 'alpha'] |
4 | 856 | ['intervention', 'survivors', 'exercise', 'program', 'cancer survivors', 'participants', 'feasibility', 'physical', 'breast', 'breast cancer', 'group', 'pilot', 'week', 'fatigue', 'physical activity', 'post', 'pilot study', 'self', 'activity', 'baseline'] |
5 | 592 | ['survival', 'chemotherapy', 'toxicity', 'response', 'median', 'advanced', 'progression', 'cell', 'phase', 'months', 'therapy', 'mg', 'grade', 'disease', 'overall', 'overall survival', 'dose', 'phase ii', 'ii', 'line'] |
6 | 465 | ['prostate', 'prostate cancer', 'urinary', 'epic', 'prostatectomy', 'sexual', 'men', 'radical prostatectomy', 'radical', 'months', 'function', 'localized', 'expanded prostate', 'cancer index', 'index composite', 'expanded', 'therapy', 'localized prostate', 'bowel', 'index'] |
7 | 2023 | ['clinical', 'patient', 'care', 'health', 'use', 'data', 'oncology', 'research', 'based', 'outcomes', 'practice', 'evaluation', 'assessment', 'breast', 'management', 'reported', 'used', 'trials', 'information', 'using'] |
8 | 718 | ['cost', 'cost effectiveness', 'effectiveness', 'qaly', 'costs', 'cost effective', 'incremental', 'per', 'quality adjusted', 'model', 'adjusted life', 'adjusted', 'incremental cost', 'effective', 'qalys', 'gained', 'life years', 'sensitivity', 'analysis', 'year'] |
9 | 539 | ['care', 'palliative', 'palliative care', 'caregivers', 'needs', 'family', 'patient', 'caregiver', 'end life', 'advanced', 'cancer patients', 'end', 'symptom', 'advanced cancer', 'support', 'assessment', 'home', 'intervention', 'hospice', 'burden'] |
10 | 1309 | ['therapy', 'pain', 'patient', 'chemotherapy', 'group', 'treated', 'significantly', 'significant', 'scores', 'qol', 'effects', 'baseline', 'clinical', 'months', 'dose', 'mean', 'feasibility', 'score', 'cancer patients', 'compared'] |
Semi-supervised machine learning also uses clustering algorithms. UNC refers to this methodology as supervised clustering. The method is considered semi-supervised because a portion of the studies sent to clustering have been manually reviewed for relevance. These manually reviewed studies, also called seeds, are typically known relevant studies. The seeds are clustered along with unlabeled data and can provide another signal (beyond the keywords generated by the algorithm) as to which clusters are likely to contain studies of interest.
Titles and abstracts for all unlabeled citations and seed studies.
Clusters of citations with keyword summaries and proportion of seeds for each cluster.
In the image above, each shape represents one article. Several relevant seed studies are identified from screening a random subset, and input into the system. The other remaining articles are clustered based on their similarity to the seed studies.
Cluster | Total Citations (n = 9578) | Proportion of Seeds (%) | Keywords |
---|---|---|---|
1 | 1153 | 0 | ['surgery', 'surgical', 'complications', 'resection', 'postoperative', 'underwent', 'procedure', 'tumor', 'performed', 'survival', 'reconstruction', 'months', 'follow', 'patients underwent', 'patient', 'group', 'mean', 'term', 'cases', 'laparoscopic'] |
2 | 1262 | 42 | ['screening', 'self', 'health', 'reported', 'among', 'women', 'associated', 'self reported', 'risk', 'factors', 'survivors', 'age', 'related', 'survey', 'participants', 'breast', 'population', 'data', 'years', 'higher'] |
3 | 661 | 8 | ['validity', 'reliability', 'psychometric', 'version', 'internal', 'internal consistency', 'consistency', 'items', 'scale', 'questionnaire', 'item', 'cronbach', 'instrument', 'properties', 'psychometric properties', 'qlq', 'construct', 'factor', 'eortc', 'alpha'] |
4 | 856 | 0 | ['intervention', 'survivors', 'exercise', 'program', 'cancer survivors', 'participants', 'feasibility', 'physical', 'breast', 'breast cancer', 'group', 'pilot', 'week', 'fatigue', 'physical activity', 'post', 'pilot study', 'self', 'activity', 'baseline'] |
5 | 592 | 2 | ['survival', 'chemotherapy', 'toxicity', 'response', 'median', 'advanced', 'progression', 'cell', 'phase', 'months', 'therapy', 'mg', 'grade', 'disease', 'overall', 'overall survival', 'dose', 'phase ii', 'ii', 'line'] |
6 | 465 | 1 | ['prostate', 'prostate cancer', 'urinary', 'epic', 'prostatectomy', 'sexual', 'men', 'radical prostatectomy', 'radical', 'months', 'function', 'localized', 'expanded prostate', 'cancer index', 'index composite', 'expanded', 'therapy', 'localized prostate', 'bowel', 'index'] |
7 | 2023 | 12 | ['clinical', 'patient', 'care', 'health', 'use', 'data', 'oncology', 'research', 'based', 'outcomes', 'practice', 'evaluation', 'assessment', 'breast', 'management', 'reported', 'used', 'trials', 'information', 'using'] |
8 | 718 | 0 | ['cost', 'cost effectiveness', 'effectiveness', 'qaly', 'costs', 'cost effective', 'incremental', 'per', 'quality adjusted', 'model', 'adjusted life', 'adjusted', 'incremental cost', 'effective', 'qalys', 'gained', 'life years', 'sensitivity', 'analysis', 'year'] |
9 | 539 | 9 | ['care', 'palliative', 'palliative care', 'caregivers', 'needs', 'family', 'patient', 'caregiver', 'end life', 'advanced', 'cancer patients', 'end', 'symptom', 'advanced cancer', 'support', 'assessment', 'home', 'intervention', 'hospice', 'burden'] |
10 | 1309 | 26 | ['therapy', 'pain', 'patient', 'chemotherapy', 'group', 'treated', 'significantly', 'significant', 'scores', 'qol', 'effects', 'baseline', 'clinical', 'months', 'dose', 'mean', 'feasibility', 'score', 'cancer patients', 'compared'] |
At UNC HSL, librarians are trained in using supervised clustering with an ensemble approach. The ensemble approach uses six models and can be thought of like a voting method. Each study receives an ensemble score of 0 to 6 indicating how many models “voted” that study likely to be relevant. Using an ensemble approach allows us to further prioritize studies into batches that can be prioritized for manual screening. Studies with an ensemble score of 6 are voted relevant in all six models and therefore are more likely to be relevant and should be screened first.
Supervised clustering with an ensemble approach is typically where we start when prioritizing studies for screening. This method is preferred as a starting place over machine learning because less training data are required. For this method we can use as few as 25 seed studies assuming they are identified at random from the full corpus. Machine learning requires a more robust training dataset and typically we recommend using several hundred positive studies.
The image above shows an example of an ensemble approach with 30,000 total references. The references were run through 6 different models. As fewer numbers of models predict an article relevant, the likelihood that the reference is relevant decreases. For instance, 18,307 of the references were not predicted to be relevant by any of the models, and therefore likely do not need to be manually screened by a human.
With supervised machine learning, also generally referred to as machine learning, a set of citations are selected randomly from the comprehensive systematic review search. A human then labels the articles as either relevant or not relevant, which are input into the machine learning algorithm to develop a model that can predict whether or not an article is relevant to the research project. All of the other citations from the search are put into the model, which predicts whether articles are likely to be relevant.
Randomly selected training dataset labelled by humans, and the rest of the unlabeled search results.
Predictions of whether or not search results are likely to be relevant.
In the image above, each shape represents one article. Articles are randomly selected from the search results, and marked as relevant or not relevant. Training data are input along with the rest of the search results. The algorithm predicts which of the unlabeled references are likely to be relevant.
In the figure above, order of screening based on probability score derived from machine learning. Black dots indicate studies identified as relevant in manual screening. Stage 1 indicates studies most likely to be relevant based on probability score and recommended for manual review. Stage 2 is the “insurance step”—the next 500 studies most likely to be relevant after the studies recommended for review by the model. Stage 3 indicates studies with low probability of being relevant that are not recommended for manual screening; however, the research team elected to screen them. The black dot in stage 3 was a study labeled “interesting” by the research team but not necessarily relevant to the project.
Active machine learning is a type of predictive AI, but is different from the techniques described above. Active machine learning uses a training dataset of labeled titles and abstracts (known relevant and not relevant) to train the system to learn which articles are more relevant to the project. However, in contrast to supervised machine learning, the training dataset is developed using a non-random sample. Additionally, the training data is iterative: the study prioritization continues to be updated a humans go through the screening process.
Training dataset of human-labeled citations and set of unlabeled citations
Predictions on whether the unlabeled documents are relevant. These predictions continue to adjust as more screening is completed by humans.
Covidence is a tool used for systematic review screening, data extraction, and quality assessment. While screening, users have the option to organize references by "Most Relevant". When this option is chosen, Covidence sorts citations based on those that are most likely to be relevant to your project based on previously screened citations (determined to be included and excluded). At least 25 studies must be marked as included or excluded, with at least two of each. As users continue to screen citations, the active machine learning algorithm continues to learn and adjust, and the list order continues to update.