The paper jointly published in Nature Medicine and the BMJ by Baptiste Vasey and the DECIDE-AI expert group is an important step for the scientific study of Artificial Intelligence (AI) in healthcare. The Delphi consensus group of experts Baptiste led were tackling an important problem. Whilst many “in silico” studies have made great claims for the accuracy of AI systems particularly in recognising patterns for diagnostic purposes, few studies have looked at how the systems perform under live conditions in a real clinical setting. Given some of the setbacks of earlier AI developments in fields like face recognition, it is entirely reasonable to ask for proof that systems can perform adequately outside the perfectly regular and artificial settings of in silico studies. An added complication is that, at least for the present, no responsible healthcare institution is likely to sanction the use of stand-alone AI systems in patient care. The head to head comparisons of expert clinicians and stand-alone AI which are popular in silico are therefore irrelevant to clinical research now. For the foreseeable future, there will always be a human clinician making the final decision, supported by the computer. This means that the level of trust in the system by the human becomes an important determinant of its accuracy. It may take time for clinicians to get used to the system and reach a conclusion about when they trust it, and in the meantime, the overall performance of the human/machine team will inevitably fluctuate.
At least two other factors may interfere with early clinical studies of AI. First, it is quite likely that imperfections in the system become evident in the course of the study of the first few cases. A particularly rich source of such imperfection is the human-computer interface, which is already well understood as a vital factor in the usability of computer support tools, and much studied by Human Factors scientists. Second, there is likely to be a learning curve for clinicians using AI systems, as there is for other complex tools, and learning curves are problematic for comparative effectiveness studies. Unless the clinical teams have all reached their plateau performance, the novel intervention will inevitably suffer from inherent bias in any comparison, because its performance is not yet optimised.
These issues are of course very similar to those whose malign effects on surgical research led to the development of the IDEAL Framework. Baptiste’s flash of insight was the recognition that early clinical research in AI looked a lot like early clinical research in surgery, with an anarchic mix of study designs and approaches, many of them yielding very low-quality evidence – and it did so because the underlying problems were the same. The DECIDE-AI guidelines are stage-specific because it is clear that, as in surgery, study designs need to be adapted to situations where the intervention is developing during the study, or where the quality of delivery of the intervention varies greatly between clinicians, and gradually improves with experience. DECIDE-AI, however, is more than simply IDEAL applied to AI. The emphasis on Human Factors and the AI interface, and the interesting issue of the “Trust curve” for users set it apart and pose questions which will require considerable further research. Like IDEAL, however, DECIDE AI has the potential to provide a bridge for researchers, helping them to develop rational study designs and reporting methods for the current chasm between in silico studies and the sparsely populated land of AI randomised trials, in the way that the IDEAL Stage 2a and 2b Recommendations do in surgery. The Collaboration salutes Baptiste’s great achievement and hopes to work with him on the future development of DECIDE AI and evaluation of how it works in practice.