A New Alchemy: Language Model Development as a Subfield?

Colin Raffel

December 6, 2023

Historically, language models have served as an important component of many learning systems — for example, to improve the transcriptions generated by a speech recognition system. However, the impact and usage of language models has grown dramatically over the past few years. Arguably, this growth is simply thanks to the fact that language models have gotten better, i.e. more accurate at predicting some text based on some context. Since most text-based tasks can be cast as predicting a response to a request (e.g. “summarize the following article”, “write me a Python function that queries Wikipedia”, etc.), recent large language models (LLMs) have proven somewhat effective at performing an incredibly wide range of tasks. Improvements in the language understanding and generation capabilities of LLMs have also led to their adoption in many larger systems (e.g. robots, image processing/generation, etc.), where they increasingly enable natural language to be used as an interface. These advances have led to a huge amount of research into building and using language models. I think this body of research has become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. The goal of this blog post is to sketch out the focuses and methodologies of the subfield of language model development as well as to provide some personal reflections on what to do when your field of study gives birth to a new one.

Some history

As a subfield, language modeling has many sibling and parent fields, including information theory, artificial intelligence, natural language processing, and machine learning. In my biased opinion, many recent advances in language modeling have stemmed from advances in deep learning. When thinking about fields like deep learning, I think it can be valuable to define what the assumptions and major problems of the field are. For deep learning, I would roughly say that the assumptions are:

We should end-to-end optimize everything.
Training a bigger model on a bigger dataset should yield improved performance, but we should also strive to develop efficient and performant model architectures.
If we can bake structure into our model (e.g. convolutions for images), things work better...
but what we really want is a system that can learn everything from data and relies on as few hard-coded assumptions as possible.
We care less about theoretical guarantees and more about how well something works in practice.

Notably, the assumptions of a field are not necessarily scientifically or philosophically motivated - they can be cultural or arise from extraneous factors (e.g. the availability of GPUs). The major problems of the field of deep learning might be:

How can we design neural network architectures that work well for a given problem, or better yet, across a wide variety of problems?
Similarly, what objective works best?
How should we optimize that objective?
How can we ensure all of the above can be scaled up effectively?

Arguably, one of the biggest successes of recent deep learning research is a powerful recipe for training effective models on a wide variety of problems, namely, the Transformer trained with some variant of Adam. While the objective used can vary across problem settings, in text-based problems a simple language modeling objective works well (and, as discussed above, encapsulates pretty much any text-based task). An important aspect of this Transformer recipe is its scalability, i.e. the ability to attain predictable gains from scaling up training compute and/or dataset size.

Language model development

I think the scalability of the Transformer has ushered in a new era of research that is distinct from deep learning research. For the first time, we can (to a significant degree) stop worrying about what model architecture to use, how to train the model, what objective to use, whether we'll continue to get returns from scaling, etc. Instead, this new line of research primarily aims to study the development of language models in order to expand and understand their capabilities. In addition, the fact that recent LLMs are reasonably competent at a huge range of tasks has led to major differences in terms of how we use LLMs (when compared to e.g. how we built and used neural networks in the context of deep learning) For lack of a better term, I'll refer to this new (sub)field as “language model development”, which might have the following assumptions:

We can assume that the model architecture, optimizer, and objective are basically fixed.
We hope or expect that a given LLM can be induced to perform basically any task out-of-the-box without performing any additional training (i.e. updating its parameters), and in general we should avoid updating parameters to specialize a model to a given task (i.e. task-specific fine-tuning).
The computational cost of getting a model to perform a task is mostly irrelevant, or at least, these costs will be resolved by something else (e.g. better/more hardware).
If we invest more compute in training an LLM, it will produce better results.

Arguably, some of these assumptions could be considered consequences of the fact that many state-of-the-art language models are only available through black-box APIs. The major problems of language model development are something like:

How can we get the model to do what we want (i.e. “prompt engineering”)?
How can we make the model run as efficiently as possible?
To the extent that we are going to update a model, how can we update it so that it is better at following instructions and less likely to generate harmful content (i.e. alignment)?
More broadly, if we are really hoping the model can do anything, how do we prevent it from doing things we don't want it to?
How can we integrate language models into other systems (i.e. tool use, multimodality, etc.)?

Let me give a few additional examples of papers and techniques that I think aim to attack these problems under the aforementioned assumptions.

An early technique for “getting an LLM to do what we want” (goal #1) is few-shot in-context learning (ICL), where a few examples of the desired input/output behavior are provided in the model's input before the model is asked to process an unseen example. Few-shot ICL avoids updating the model's parameters (assumption #1) and mostly ignores the fact that it significantly increases computational costs (assumption #3). A related and more recent variant of ICL is “chain-of-thought prompting”, which adds reasoning steps to the in-context examples in hopes of improving performance by inducing the model to generate similar reasoning steps before generating its prediction. The fact that including reasoning steps further increases computational costs is, again, mostly ignored (assumption #3).
Techniques like FlashAttention and Speculative Decoding aim to make the model run more efficiently (goal #2) without changing the model or its outputs whatsoever (assumption #1). More broadly, techniques like the Heavy-Hitter Oracle or quantization aim to reduce memory or computational costs with minimal performance degradation. The pursuit of these techniques, along with orthogonal hardware advances like NVIDIA's Transformer Engine, arguably supports the apparent disregard for increases in computational cost that arise from using a larger model (assumption #3).
While there certainly has been some effort to improve over the Transformer architecture or the optimizer used to train LLMs (in violation of assumption #1), the vast majority of these improvements have not been widely adopted, either due to inertia (i.e., enforcement of assumption #1) or the apparent fact that they do not always transfer across applications.

Separately, a sign of the maturity of a new subfield is the development of teaching materials. I think my friend Sasha Rush is leading the charge here, with e.g. GPTWorld for learning prompting, LLM training puzzles for learning about distributed training, and Transformer puzzles for understanding how Transformers might work. Another sign is the establishment of a conference on the subject, and we have one of those now too.

A New Alchemy

LLMs have ushered in a paradigm shift in the path toward imbuing computers with human-like capabilities. This paradigm shift is being felt in various fields, including deep learning (where the work of designing new architectures or optimizers is increasingly less relevant), natural language processing (where we now have a recipe that works reasonably well across subproblems that previously demanded custom methodologies), and beyond.

I started my PhD in 2012 during a similar paradigm shift from what I'd call “statistical machine learning” to deep learning. Unlike deep learning, statistical ML prioritized theoretical guarantees (e.g. convexity of the objective function and/or convergence under certain conditions). These guarantees arguably limited model expressivity, which arguably necessitated things like feature engineering that deep learning strove to avoid. While deep learning by no means “solved” the problems of statistical ML (just as language model development does not “solve” deep learning), it nevertheless presented a paradigm that made dramatic progress on the target problems of statistical ML and unlocked new applications. Such empirical successes of deep learning — which almost entirely eschewed theoretical guarantees — led to a great deal of hand-wringing on the part of the statistical ML crowd.

As my research increasingly made use of deep learning, I started to find myself at the receiving end of this hand-wringing. For example, during my first-ever oral presentation at a conference, I was presenting work that made use of convolutional neural networks. During questions, an audience member expressed distaste at my use of "convoluted" neural networks and suggested that something simpler would have worked better (of course I had tried simpler models and they worked significantly worse, but let's put that aside for the moment). This kind of despair was common at the time - people were applying deep neural networks in settings where they may or may not have been overkill, simply because it was the zeitgeist. At another conference I attended during my PhD, I happened to share a hostel room with a computer vision researcher who went on a long rant about the atrocity of deep learning (sometimes I wonder what this researcher is working on now). I think this sentiment is most elegantly laid out in Ali Rahimi's NeurIPS 2017 test-of-time award acceptance speech, where he argues that deep learning is like alchemy - trial-and-error that yields some effective techniques but lacks rigor. Ali's speech had a big impact on me and others but arguably didn't really stop people from continuing to develop and apply deep learning without worrying about rigor and in settings where simpler methods would have sufficed (simply because using a big fancy neural network was sexier).

These experiences led me to promise myself that when my field of study gave birth to another, I wouldn't dig my feet in and resist, I'd follow the tide of progress. Now that this is (arguably) happening I'm finding it more difficult than I had anticipated. As much as I wish it wasn't true, I cringe a little whenever I see a new LLM technique that ignores a dramatic increase in computational cost and bends over backwards to avoid updating the model's parameters, or an application of an LLM where something dramatically cheaper would suffice, or a paper studying the behaviors of an LLM as if it's a black box (or studying an LLM API, in which case it actually is somewhat of a black box), and on and on. And try as I might, I can't resist trying to stem the tide — for example, our T-Few paper aimed to convince everyone that few-shot ICL was absurdly computationally inefficient and that fine-tuning specialized models is cheaper and better. Of course, people are still using few-shot ICL and are still avoiding task-specific fine-tuning at all costs, because that's the zeitgeist — and I think this isn't totally wrong, because in tandem there's a huge amount of synergistic work on making LLMs more efficient and effective. But, to be honest, it still feels a little wrong, and I'm not sure if I'll be able to shake that feeling.

So, what's the best course of action when you used to be with it, but then they changed what “it” was? I think there were many ML researchers who successfully rode the tide from statistical ML to deep learning — they willingly embraced the new field while bringing their knowledge and sense of rigor to their deep learning research. In other words, they used their past knowledge to provide a broader and deeper perspective that newcomers may have lacked. An especially prominent product of this kind of research is arguably the Variational Autoencoder (VAE), which connected ideas from variational inference to the autoencoder neural network architecture. VAEs are still an important component of state-of-the-art diffusion-based generative models. Hopefully, those of us who were working on deep learning and NLP before the LLM era can bring a similar perspective (and avoid digging our feet in too much).

Thanks to Derek Tam, Haokun Liu, Chris Maddison, and Boaz Barak for feedback on this blog post.

formatted by Markdeep 1.03

✒