Prospective Evaluation of Adverse Event Recognition Systems in Twitter: Results from the Web-RADR Project

Lucie M Gattepaille; Sara Hedfors Vidlin; Tomas Bergvall; Carrie E Pierce; Johan Ellenius

doi:10.1007/s40264-020-00942-3

Prospective Evaluation of Adverse Event Recognition Systems in Twitter: Results from the Web-RADR Project

Drug Saf. 2020 Aug;43(8):797-808. doi: 10.1007/s40264-020-00942-3.

Authors

Lucie M Gattepaille¹, Sara Hedfors Vidlin², Tomas Bergvall², Carrie E Pierce², Johan Ellenius²

Affiliations

¹ Uppsala Monitoring Centre, Box 1051, 75140, Uppsala, Sweden. lucie.gattepaille@who-umc.org.
² Uppsala Monitoring Centre, Box 1051, 75140, Uppsala, Sweden.

Abstract

Introduction: A large number of studies on systems to detect and sometimes normalize adverse events (AEs) in social media have been published, but evidence of their practical utility is scarce. This raises the question of the transferability of such systems to new settings.

Objectives: The aims of this study were to develop an AE recognition system, prospectively evaluate its performance on an external benchmark dataset and identify potential factors influencing the transferability of AE recognition systems.

Methods: A pipeline based on dictionary lookups and logistic regression classifiers was developed using a proprietary dataset of 196,533 Tweets manually annotated for AE relations and prospectively evaluated the system on the publicly available WEB-RADR reference dataset, exploring different aspects affecting transferability.

Results: Our system achieved 0.53 precision, 0.52 recall and 0.52 F1-score on the development test set; however, when applied to the WEB-RADR reference dataset, system performance dropped to 0.38 precision, 0.20 recall and 0.26 F1-score. Similarly, a previously published method aiming at automatically detecting adverse event posts reported 0.5 precision, 0.92 recall and 0.65 F1-score on thus another dataset, while performance on the WEB-RADR reference dataset was reduced to 0.37 precision, 0.63 recall and 0.46 F1-score. We identified four potential factors leading to poor transferability: overfitting, selection bias, label bias and prevalence.

Conclusion: We warn the community about a potentially large discrepancy between the expected performance of automated AE recognition systems based on published results and the actual observed performance on independent data. This study highlights the difficulty of implementing an all-purpose system for automatic adverse event recognition in Twitter, which could explain the lack of such systems in practical pharmacovigilance settings. Our recommendation is to use benchmark independent datasets, such as the WEB-RADR reference, to investigate the transferability of the adverse event recognition systems and ultimately enforce rigorous comparisons across studies on the task.

Publication types

Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

Adverse Drug Reaction Reporting Systems / standards*
Databases, Factual
Drug-Related Side Effects and Adverse Reactions / classification
Drug-Related Side Effects and Adverse Reactions / epidemiology*
Humans
Logistic Models
Pharmacovigilance
Prevalence
Prospective Studies
Reproducibility of Results
Selection Bias
Social Media*

Grants and funding

115632/Innovative Medicines Initiative/International