The successes of new AI technology have received a great deal of attention of late. And, of course, these include applications in healthcare. But this is not the first time of great hope that AI will finally contribute to improved care at lower cost, and prior episodes did not end with much to show.

You can be forgiven for thinking that the recent success of AI is just a matter of bigness: Big numbers of big GPU’s (graphical processing units) using big amounts of energy to learn on big data. But developments in the technical, modeling and theoretical sides of AI have been just as important. This article discusses the importance of the latest breakthroughs including GPU’s, Deep Learning, and so-called “Transformers” .

#### Technical Progress: GPU’s

GPU’s, as the name implies, were originally designed and used for fast graphics updating. However, the mathematics of such updating is, to a large extent, linear algebra, i.e., matrix and vector multiplication. Linear algebra is also key part of neural network training and use, so a natural fit was born.

Most integrated circuit speeds have just doubled in the last 19 years. If the past growth in speed had held true, CPU’s would be over 1000 times faster rather than just twice. In contrast, the number of transistors on GPU’s have continued to follow Moore's law of exponential increase. Some (but not all) neural network types are able to exploit this increase.

#### Technical Progress: Deep Learning

It was proven almost a century ago that an arbitrary ‘target’ function can be approximated by a relatively small number of functions in a restricted class, linked together (through function composition) in a certain way. These linkages can be described as a graph, or network, with nodes representing functions and directed edges signifying functional composition (i.e., the function represented by the head node takes the outputs of the functions represented by its tail node as its arguments).

In the original theorem the approximating functions, though restricted, must be hand-crafted for each specific target function. If, instead, the approximating functions are preset, then arbitrary precision approximation is still possible, but the number of such functions needed is no longer necessarily small.

Very early on, researchers focused on network structures in which the nodes were arranged in layers with the outputs at layer n being the inputs of layer n+1. Unfortunately, it was not known how to train networks with more than one layer (and even then there were difficulties). Unfortunate because, for single layer networks, the number of nodes needed for good approximation would grow exponentially with the input size. Worse, many believed that training multi-layered networks would be impossible.

It was not until the mid 80’s that a way to train layered networks efficiently and correctly was invented. However, in practice, training networks with more than a few layers was still plagued with numerical issues. However, by around 2010 , the most debilitating problems were solved and the era of deep networks (i.e., networks with a large number of layers) began.

#### Technical Progress: Modularity, Pretraining

In the early 20th century, efficiency expert Frederick Taylor broke down jobs (now called “workflows” ) into the steps involved with the goal of achieving greater efficiency and quality. While `Taylorism’ when applied to human workers is a dirty word for some, this should not be an issue in AI. Many AI tasks can be broken into steps with new uses making use of prior learning. Consequently, part of current AI practice is figuring out how to mix, match, and integrate existing networks.

As an example, automated medical transcription could be built on top of an already existing LLM (large language model) with (relatively) few tweaks. Not only would learning such a model from scratch be inefficient, but probably impossible: LLM’s are trained with hundreds of billions of examples and such a large data set for medical records does not exist.

#### Theoretical Progress: Regularization and Overfitting

Perhaps one of the most important advances in AI was overcoming the fear of overfitting. In fact, modern AI systems appear to be extremely overfit but do not suffer from the lack of generalization that normally implies. The question is why? AI researchers have a number of theories that revolve around the idea of ‘regularization’ in statistics.

Consider the problem of identifying, through regression, the function 1+x+Sin(x) x^100 from observing 100 data points over x from 0 to 1.1. Assume you knew the function was of the form

a+b x+ c x^n with (a,b,c,n) unknown as yet and to be determined by the regression. There are many ways to try and solve this problem. One is to use nonlinear regression (because the parameter n appears nonlinearly). If you stick with linear regression you can try a regression for a, b, and c, keeping n fixed at some value and then repeating the process for different n. This would work, but if you went through n systematically from 2 to 100 (at which point you would recognize success), you have to do 99 regressions. A third way is to do a single regression using the function a+b x+c1 x^2+c2 x^3+… c99 x^100+...c199 x^200 (extra terms to be sure the right term, x^100, is included). If there were no noise in the data and you had enough precision, this would work. But otherwise this regression is overfit (201 parameters and only 100 data points) and so numerically fraught[1].

A better way is to do this same regression but with a term that penalizes the sum of the absolute values of the parameters. Using such penalties is called regularization and allows regression with ‘too many’ parameters to work.

It is now believed that a number of features of neural networks and their training act as de-facto regularization terms. And you can always add an explicit regularization term as done in the prior example. But while the ‘implicit regularization’ found is in neural networks still somewhat magic and lore, it is now SOP that bigger is better. Full steam ahead.

#### Theoretical Progress: Computer Science and algorithm design --- a personal view

Beginning in the 1960’s and 70’s, computer science adopted a focus on a few aspects of problem and algorithm difficulty (e.g., “NP-completeness”) that partially choked off more practical avenues of research. Even before AI, this narrowness was relaxed, and computer scientists are now fully engaged on AI with powerful tools up to the task.

#### Theoretical Progress: Transformers

Transformers, as in ChatGPT (Chat Generative Pre-Trained Transformer) are a form of pre-training manipulation that transforms original data into a representation of equal size but, presumably, better suited to train a neural network. Transformers can be trained using unlabeled data (ie., pictures of cats without cat labels) which is much more extensive than carefully labeled data. And transformer training is well-suited to exploiting the parallel processing provided by BIG quantities of GPU’s.

Of course, data has been pre-processed for quite some time (e.g., windowing, filtering, taking moving averages, clipping, Fourier transforming, …), so what is different now?

The answer is called ‘attention’. Attention is similar to the other pre-processing methods mentioned above and was inspired by the windowing done by the brain when ‘seeing’. We, meaning our brain, focus, or pay attention to, a very small fraction of what we ‘see’ in our field of vision. Where the form of attention used in transformers differs is that deciding what to pay attention to is not pre-set. As it is in our brains. Instead, attention is learned, and allowed to change from case to case.

To implement attention, LLM transformers divide the input into a “query” and the remaining context. For example, if the LLM is trying to translate text from one language to another, the query would be the next word that needs to be translated and the context the surrounding words in the original text and the preceding translated words. Both the query and context are encoded as keys, which already incorporate considerable knowledge of language structure. The first step of computing attention is to compute the similarity between the context keys and the key of the query word. There are many possible choices of similarity measure and different transformers can make different choices. The second step is to assign values to each context word. These values depend on what other words are present. Attention is then the average of these values with the weight of each context word in this average determined by the similarity score.

Current LLM systems process text chunks thousands of words long. They are trained on trillions of examples. Thus, it is less surprising that they can generate good performance on paragraph-level tasks.

These two changes (being able to use unlabeled data and attention), while seemingly simple, have led to a significant improvement in performance. So significant, that transformers have now replaced (by outperforming) specialized neural network architectures in a number of areas beyond text-processing (e.g., vision).

In addition to achieving human-level language capability, the latest generation of Transformers has started to perform at expert level in a number of STEM areas. The question is, will progress continue or stall?

[1]In fact there would be numerical issues even if you had a lot of data.

## Read Comments