This article attempts to summarize existing research on the accuracy of diagnosing apps (usually described as 'symptom checkers, SC). Given the novelty of such apps, the literature is neither large nor well-focused. And while it is at least all relatively recent, given the speed of app development, it is not clear it is recent enough.
There are of course vested interests in many studies that address diagnostic reliability. Companies that produce the apps obviously wish to prove their benefit while physicians, who may fear replacement or patients who don't trust them (because they discount the app diagnosis), may wish to prove the opposite. This further limits the usefulness of any findings.
In judging app reliability, there are three main considerations. First, how does the app perform compared to a professional human. Second, how does one view the costs of the two types of errors: false positives and false negatives. For false positives, we have two clear potential costs: needless patient anxiety (if there is no actual disease) or (if there is actual disease) that diagnosing the true cause is delayed once one settles on the wrong diagnosis. For false negatives, the cost is obvious. The third consideration is maintaining data security and doctor/patient privacy.
The problem is, there is a paucity of reliable evidence regarding the accuracy of medical apps. The gold standard of medical evidence is to use a Randomized Controlled Trial (RCT). Even for such trials, it was deemed necessary to develop standards.
CONSORT-EHEALTH (Consolidated Standards of Reporting Trials of Electronic and mHealth Apps and Online Telehealth, https://www.jmir.org/2011/4/e126/) is a mandatory standard for hundreds of journals. CONSORT was first published in 1996 (and CONSORT-EHEALTH in 2011). However, many trials are not RCT's and this seems to be almost universally true for medical app quality research (some exceptions are discussed shortly when discussing the last meta analysis). About such research, the problems of p-hacking and "the replication crisis" are now well documented. As a further warning, I offer the following recent abstract from economics:
https://www.nber.org/papers/w31666#fromrss
Abstract:
"When economists analyze a well-conducted RCT or natural experiment and find a statistically significant effect, they conclude the null of no effect is unlikely to be true. But how frequently is this conclusion warranted? The answer depends on the proportion of tested nulls that are true and the power of the tests. I model the distribution of t-statistics in leading economics journals. Using my preferred model, 65% of narrowly rejected null hypotheses and 41% of all rejected null hypotheses with |t|<10 are likely to be false rejections. For the null to have only a .05 probability of being true requires a t of 5.48."
With the disclaimers over, I now describe the results of three meta-analyzes that address (among other things) medical app accuracy.
Buechi et al. (2017), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5735404/) analyze the analyzers (i.e., perform a meta-analysis). They first search medical journal databases for articles about studies examining subjects that
A) occurred in a clinical setting and
B) investigated a health app that used inbuilt sensors of a smartphone for diagnosis of an illness.
These included mostly Melanoma, Parkinsons, Afib, and Stroke.
Of almost 3300 articles originally found, only 30 were concluded to report actual data. They then chose 11 for further analysis and concluded all had the potential for significant bias. In particular, they all suffered from potential selection bias and sample sizes that were too small to measure app accuracy (in fact, most did not even report an accuracy measure).
Millenson et al., 2018; (https://www.degruyter.com/document/doi/10.1515/dx-2018-0009/html?lang=en) do another meta-analysis. But this time the focus is not on the study's quaility but on it's results. They find a mixed bag, with some apps doing poorly and some well. In the later category, however, they do note the potential conflict of interest w.r.t. results of a study initiated by an apps sponser. There, it was claimed that Babylon Check produced accurate triage advice in 88.2% of cases (using hypothetical cases, not patients) vs. 75.5% for doctors and 73.5% for nurses [50]. Finally, they still only found about 35 studies that met their inclusion criteria.
More recently, You, Ma, Gui (2022) (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148318/)
still only find 31 articles to include. They use these studies to assess apps on eight dimensions, mostly related to user experience, but one of which was accuracy. The four studies that reported demographic information found that in China 54.80% of users were male compared to 24.3% in the U.S. and users tended to be young with low health but high technology literacy.
11 papers analyzed accuracy. Of these, three apps were relatively accurate (compared to physicians) in diagnosing knee pain, acute primary care, and influenza-like illness. However, the other eight papers found frequent inaccuracy when providing diagnoses for inflammatory joint diseases, hand diseases, inflammatory rheumatic diseases, knee pain, and general diseases.
Grundy, 2022 (https://www.annualreviews.org/doi/full/10.1146/annurev-publhealth-052020-103738) has a much more pessimistic take on accuracy. This review focuses more on the methodology (or lack thereof) used by the apps than the others and also discusses issues with RCT’s. But the overall zeitgeist is similar: one cannot find much evidence that medical apps work well due to study bias or small sample size.
Firth et. Al, (2017) (https://pubmed.ncbi.nlm.nih.gov/28456072/) perform a metaanalysis of studies that focus on a single medical issue: anxiety. There were nine RCT’s in their study with 1837 participants. They found that the apps lead to significant reductions of anxiety compared to the control group and saw no evidence of publication bias. However, there were no comparisons with standard medical treatments
Read Comments