EHA Library - The official digital education library of European Hematology Association (EHA)

A MACHINE LEARNING APPROACH TO IDENTIFICATION OF CLINICALLY RELEVANT HEMATOLOGY-ONCOLOGY RESEARCH PUBLICATIONS
Author(s): ,
Alexander Luchinin
Affiliations:
Kirov Scientific Research Institute of Hematology and Blood Transfusion of the Medical-Biological Agency,Kirov,Russian Federation
Oleg Kolupaev
Affiliations:
UNC Chapel Hill, Lineberger Comprehensive Cancer Center,Chapel Hill,United States
EHA Library. Luchinin A. 06/09/21; 324902; EP1181
Dr. Alexander Luchinin
Dr. Alexander Luchinin
Contributions
Abstract
Presentation during EHA2021: All e-poster presentations will be made available as of Friday, June 11, 2021 (09:00 CEST) and will be accessible for on-demand viewing until August 15, 2021 on the Virtual Congress platform.

Abstract: EP1181

Type: E-Poster Presentation

Session title: Quality of life, palliative care, ethics and health economics

Background
In the last 10 years, the amount of clinical research published in the field of oncology has grown dramatically due to accelerated pace of drug development and increased use of combination treatments. Concurrently, the problem of finding high-quality clinical research publications to develop evidence-based treatment plans for individual patients has become more challenging. Commonly used solutions primarily rely on bibliographic metadata and expert curation. Here, we describe a tool for fast automatic identification of clinically relevant publications that does not use the tags associated with the publication or curation.

Aims
Here, we describe a tool for fast automatic identification of clinically relevant publications that does not use the tags associated with the publication or curation.

Methods
We used a machine learning approach, trained on the titles of PubMed publications downloaded to the database through OncoTriage.com service. Papers predominantly describing clinical trials in hematological malignancies and clinical cases were used in the analysis. Balanced training data included texts cited in expert-curated sources to form a “relevant” dataset (i.e., high quality publications describing treatment of hematologic malignancy), and an “irrelevant” dataset that did not include data relevant to therapy. We used a Bayes approach with a binary classification. Briefly, 26,667 texts were processed to get a document-term matrix representation of both the training and the test set (80/20 split).

Results
Our model for “irrelevant” detection classified papers in the test dataset with AUC 0.859 accuracy (95%CI 0.853-0.865, p<0.0001), with sensitivity 0.93 and specificity 0.72. The balance of the model was biased towards the sensitivity. We speculate that our training dataset for the model was skewed towards publications describing clinical trials. Therefore, several clinically relevant categories of publications describing treatments were labeled as “irrelevant”. An expert examination of the false positives has revealed that these publications included therapy reviews, single center practices and observational studies that nonetheless are informative for clinical practice. We plan to address these drawbacks in future iterations of the model by incorporating supervised or reinforcement learning approaches. The interactive web app is available at https://luchinin.shinyapps.io/PubMed_Triage/

Conclusion
Machine learning is an effective approach for a large-scale automaticidentification of clinically relevant publications from a variety of databases, such asPubMed and conference abstracts. The use of machine learning techniques as one of the filtersto build a database tailored for clinicians. In the future tools using this database may help tominimize the time clinicians spend finding high-quality publications that fit their patient’s profile.In addition, it can be used as a clean-up step for getting a list of publications for further curationby the subject matter experts. Future work will extend this approach and may be integrated intodecision support systems and knowledge management databases.

Keyword(s): Hematological malignancy, Therapy

Presentation during EHA2021: All e-poster presentations will be made available as of Friday, June 11, 2021 (09:00 CEST) and will be accessible for on-demand viewing until August 15, 2021 on the Virtual Congress platform.

Abstract: EP1181

Type: E-Poster Presentation

Session title: Quality of life, palliative care, ethics and health economics

Background
In the last 10 years, the amount of clinical research published in the field of oncology has grown dramatically due to accelerated pace of drug development and increased use of combination treatments. Concurrently, the problem of finding high-quality clinical research publications to develop evidence-based treatment plans for individual patients has become more challenging. Commonly used solutions primarily rely on bibliographic metadata and expert curation. Here, we describe a tool for fast automatic identification of clinically relevant publications that does not use the tags associated with the publication or curation.

Aims
Here, we describe a tool for fast automatic identification of clinically relevant publications that does not use the tags associated with the publication or curation.

Methods
We used a machine learning approach, trained on the titles of PubMed publications downloaded to the database through OncoTriage.com service. Papers predominantly describing clinical trials in hematological malignancies and clinical cases were used in the analysis. Balanced training data included texts cited in expert-curated sources to form a “relevant” dataset (i.e., high quality publications describing treatment of hematologic malignancy), and an “irrelevant” dataset that did not include data relevant to therapy. We used a Bayes approach with a binary classification. Briefly, 26,667 texts were processed to get a document-term matrix representation of both the training and the test set (80/20 split).

Results
Our model for “irrelevant” detection classified papers in the test dataset with AUC 0.859 accuracy (95%CI 0.853-0.865, p<0.0001), with sensitivity 0.93 and specificity 0.72. The balance of the model was biased towards the sensitivity. We speculate that our training dataset for the model was skewed towards publications describing clinical trials. Therefore, several clinically relevant categories of publications describing treatments were labeled as “irrelevant”. An expert examination of the false positives has revealed that these publications included therapy reviews, single center practices and observational studies that nonetheless are informative for clinical practice. We plan to address these drawbacks in future iterations of the model by incorporating supervised or reinforcement learning approaches. The interactive web app is available at https://luchinin.shinyapps.io/PubMed_Triage/

Conclusion
Machine learning is an effective approach for a large-scale automaticidentification of clinically relevant publications from a variety of databases, such asPubMed and conference abstracts. The use of machine learning techniques as one of the filtersto build a database tailored for clinicians. In the future tools using this database may help tominimize the time clinicians spend finding high-quality publications that fit their patient’s profile.In addition, it can be used as a clean-up step for getting a list of publications for further curationby the subject matter experts. Future work will extend this approach and may be integrated intodecision support systems and knowledge management databases.

Keyword(s): Hematological malignancy, Therapy

By clicking “Accept Terms & all Cookies” or by continuing to browse, you agree to the storing of third-party cookies on your device to enhance your user experience and agree to the user terms and conditions of this learning management system (LMS).

Cookie Settings
Accept Terms & all Cookies