To try and predict how useful AI may prove in healthcare, it is useful to first understand the types of problems where it may be helpful. There are a number of ways in which scientist and engineers can classify problems, but the four most important for this discussion
1) whether we have a good understanding of the problem (i.e., can we model it mathematically),
2) how many variables there are,
3) whether there is an underlying structure to the data, and
4) how much data we have on a given problem and its quality.
Machine learning (ML) is typically applied to problems that are not well modeled, have a large number of variables, and a large amount of quality data to process. However, and this is crucial, there must be some underlying (but presumably not yet discerned) structure to the problem. Otherwise, there is no hope.
Machine learning overlaps with artificial intelligence (AI). Currently, they are almost synonyms, but AI also includes more human-involved (and currently out of fashion) systems such as expert systems. Importantly, neural network s (NNs) have become a per-eminent tool for both fields.
NN’s are inspired by the computational ability of neurons and synapses in the brain, but resemble actual brains only very superficially. The idea in common is that ‘neurons’ connect via ‘synapses’ to each other. The neurons are typically arranged in layers (in contrast to real brains) which fire sequentially (again, in contrast to real brains). The first layer takes in the given input data and the final layer provides the output (i.e., answer).
There are three main reasons neural networks have become dominant in both AI and ML.
1) As first proven in the 1940’s, the input/output relationships that a class of functions (which includes neural networks) can learn is universal. This means that given any desired input/output function and a tolerance level, one can construct a neural network that reproduces that function with error less than the required tolerance.
2) As found in the 1980’s, efficient algorithms exist to train the neural network. This means that the network is presented with a sample of input/output pairs and learns to generalize (i.e., discovers something close to the true input/ output function). However, in general, an enormous number of such samples may be needed.
3) Beginning around 2005---2010, computing power became sufficient that
a) Huge training sets could be used
b) So called “Deep” neural networks could be trained. The ‘deep’ refers to the fact that these networks have many layers between input and output.
Of course, just because a NN is universal, does not mean it will give you something useful. If the input/output sequences are just random numbers, then the neural network will predict nothing of value because there is nothing of value to predict. Thus, as stated earlier, there must be some underlying structure to the data.
For example, consider the problem of diagnosing a disease from a set of symptoms. There are thousands of possible symptoms and thousands of diseases. This is especially so when one considers that a rash on your foot and on your cheek are quite different, e.g., with respect to the likelihood of having shingles vs. an insect bite. However, diseases and symptoms are usually highly clustered wherein a given disease presents only a relatively small number of symptoms. Moreover, while symptom sets for two diseases can overlap, there are usually some symptoms or tests (which can be viewed as symptoms) that can distinguish the true cause.
Similarly, ChatGPT basically takes a given input (your query) and 'auto-completes' an answer based on the properties of thousands of terabytes of training text. These properties involve correlations that can span large distances between words, and so the inherent underlying variable dimension is quite large. Nonetheless, the data is highly structured, both by the rules of grammar as well as the inherent uni-directional and one-dimensional nature of text. A recent study found that ChatGPT 4 performed better than first and second year medical students for diagnosis. https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2806980. It would also pass a multiple choice medical licensing test.
Image recognition also exploits the highly structured data of images. While even a small image (128 by 128 pixels) requires a variable dimension of about 100,000 to encode, pictures of real scenes (as opposed to random pixels) obviously have an underlying two dimensional inherent structure. This allows convolutional networks to provide human-level recognition (which may sound lame, until you realize how many neurons are used to achieve human vision).
Google's DeepMind AlphaFold is a deep NN used to study protein folding. It differs from conversational AI in a number of respects. First, the training data was relatively small --- on the order of 200,000 proteins at first (c2018) and later nearly 400,000. The number of GPU's used was also relatively small (100-200) compared to, e.g., ChatGPT.
The initial effort involved some theory. Researchers postulated that correlated codons in the gene sequence would correspond to nearby amino acids in the 3-D space of the protein. So it not only exploited low-dimensional structures (1-D DNA and 3-D protein) but also used a type of model (pattern recognition on the DNA and physical chemistry in 3-D).
Because NN’s are “black boxes” that produce answers without understanding or explanation, a crucial question is how do we trust the answers they give? One of the most important issues in this regard is over-fitting.
Over-fitting is when you fit too many parameters relative to the data sample size. Doing so, you certainly increase training goodness of fit, i.e., the ability to predict the data in the training set. However, at some point of increasing the number of parameters one will observe worse accuracy predicting new cases (i.e., data outside the training set). This is typically because almost all real-world data contains some random elements that are uncorrelated with the true input/output function, i.e., noise. After a certain point, the parameters have “understood” as much as they can about the true input/output function and any additional parameters simply get used “learning the noise”. One of the most surprising aspects of neural networks is that they seemingly mock the admonition against over-fitting.
There are two common ways of preventing over-fitting. The most common is to explicitly restrict the number of model parameters. The second (gaining in popularity) is by use of 'regularization' methods, essentially adding a penalty for either the size of parameters (i.e., deviation from zero) or the number of non-zero parameters. So called "L2" regularization is used in the first case (controlling parameter size) and "L1" regularization usually for the second (number nonzero parameters). The advantage of L1 regularization is that one can let the data throw out unneeded variables instead of the researcher having to pre-decide. This is most useful when the problem is poorly understood.
Recent research suggests that the noise involved in neural network training acts similarly to an L2 regularizer. So, the regularization from noise may explain why NN s are not really over-fit. But are they really not over-fit? There is some evidence that they are in some cases. For example, some image recognizers break if only a few pixels are switched. AlphaFold was trained on a database of proteins whose structure was determined by crystallization. It did not perform so well when applied to proteins that could not be crystallized (and who's structure was determined instead by NMR). Finally, the 'hallucinations' of ChatGPT, while rare, have been well documented.
Another problem with AI is that the availability of data determines the things that it can predict. Consider AlphaFold again. The data it was trained on was protein structure and so that was what it predicted. However, protein structure is just an intermediate step, not the end goal.
One important end goal is to predict ligand binding so that one can understand how the protein functions and what (e.g., drug) will bind to it. In this regard, AlphaFold’s performance was interesting.
https://elifesciences.org/reviewed-preprints/89386v1
It predicted the shape of the ligand binding sites better than current standard models but did not predict binding affinity any better. Thus, physical experiments will still be necessary to predict the end goal.
So where does this leave us? NN's have demonstrated an ability to discover some patterns in data that humans cannot. Moreover, even when the patterns have been found, humans still have no idea what actually was found. And the NN can't explain itself either. Given the possibility that your great discovery is a hallucination, this is clearly an enormous issue in one-shot time-critical situations (such as whether to blow up a car with an attack drone).
However, especially in medical research, the suggestions spit out by AI do not have to be immediately applied to patients. Instead, standard techniques can and will be used to determine safety and efficacy. And this is a good thing because, as I have discussed in an earlier post, cellular chemical pathways are extremely complicated and intertwined. As auto-immune diseases show, even small mistakes can have serious consequences. The failure of AlphaFold to predict ligand binding despite apparently predicting well the shape of the binding site should be a warning that understanding may be an illusion. And this is on a relatively simple well specified problem (which pathways are not so much).
Nonetheless, I believe NN’s will have a significant role to play in drug discovery, and synthetic biology in general. AI will introduce speed and efficiency without increasing the already well-recognized risk.
For patient diagnosis too, I suspect it will prove successful. Not because it will never lead to mistakes, it will. But just as driver-less cars crash, the question is "compared to what". Doctors have less of a mitigating role than drug researchers to catch errors (because patients may not seek them out and simply self-medicate with OTC supplements). But because the disease/symptom data is more structured, the risks of making mistakes in the first place are also lower.
All of this optimism rests on having quality data with which to train the NN. As discussed above, many NN's break down rapidly with even small changes in the data. In order to have a useful tool, either robustness must be increased (which seems unlikely to me, short-term) or data quality ensured.
In the case of drug discovery and synthetic biology, good quality protein databases exist. Right now, we may be in a positive-feedback exponential growth phase --- good databases lead to better AI performance which adds to the size and quality of the databases
On the other hand, some other areas of healthcare may never have quality data.
For example, consider this abstract
for a paper on the accuracy of medical health records (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7284300/):
"In this survey study of 136 815 patients, 29 656 provided a response, and 1 in 5 patients who read a note reported finding a mistake and 40% perceived the mistake as serious. Among patient-reported very serious errors, the most common characterizations were mistakes in diagnoses, medical history, medications, physical examination, test results, notes on the wrong patient, and sidedness."
The Mayo clinic has spent several years cleaning it’s patient records database https://www.wsj.com/tech/ai/artificial-intelligence-medicine-doctors-diagnosis-bad736c6
further suggesting the magnitude of the problem.
Can we expect NN's to find patterns in such ‘dirty’ data? No. An 8% "serious" error rate is probably the same order of magnitude as any pattern that would not already have been understood by humans.
The Mayo clinic has spent several years cleaning it’s patient records database https://www.wsj.com/tech/ai/artificial-intelligence-medicine-doctors-diagnosis-bad736c6
suggesting the magnitude of the problem
Read Comments