A Call to Build Models Like We Build Open-Source Software

Introduction

Over the past decade or so, it has become increasingly common to use transfer learning when tackling machine learning problems. In transfer learning, a model is first trained on a data-rich pre-training task before being fine-tuned through additional training on a downstream task of interest. The use of a pre-trained model (instead of starting from a model whose parameters were initialized randomly) tends to produce better performance from less labeled data on the downstream task. This synergy has made transfer learning a standard choice in many domains, and popular pre-trained models therefore see a staggering amount of use. For example, the BERT model on the Hugging Face Model Hub has been downloaded tens of millions of times, and loading a pre-trained image classification model in PyTorch is as simple as passing pretrained=True.

The widespread use of pre-trained models is a clear endorsement of their utility. However, pre-training often involves training a large model on a large amount of data. This incurs substantial computational (and therefore financial costs); for example, Lambda estimates that training the GPT-3 language model would cost around $4.6 million. As a result, most popular pre-trained models were created by small teams within large, resource-rich corporations. This means that the majority of the research community is excluded from participating in the design and creation of these valuable resources. To make matters worse, most pre-trained models are never updated — they are left as-released and reused until a better model comes along. There are many reasons why we might want to update a pre-trained model — for example, we might

To date, there is no standard approach for updating a released model to address these issues — standard practice is instead to leave them indefinitely “frozen” in the state they were released until they are supplanted by a new model.

Contrast this state of affairs with the development of open-source software. Imagine if the ubiquitous open-source Python programming language had remained frozen after its first release: It would not support boolean variables, sets, Unicode, context managers, generators, keyword arguments, list comprehensions, or many other widely-used features. It also would not include the thousands of bugfixes that have been added over the past few decades, the majority of which were contributed by members of the amorphous community of Python developers. For the most part, these improvements could be readily integrated into existing Python codebases with no changes required by end-users. Any backward-incompatible changes are communicated through a clear versioning system, along with detailed instructions to update existing code. This kind of large-scale distributed collaboration is made possible through a mature set of tools and concepts for developing open-source software, including version control, continuous integration, forking, merging, package management, and more.

This post argues that we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research.

A brief explanation of open-source software development

This section provides a short overview of the process used to develop open-source software. This treatment is intentionally “opinionated” in the sense that it is focused on motivating similar ideas in the development of machine learning models. Readers who are already intimately familiar with open-source development and version control concepts can skip this section.

The basic goal of open-source software development is to allow for a piece of software to be created collaboratively by a distributed community. This contrasts with more solitary modes of development, where a single programmer or small close-knit team works in isolation. The distributed and asynchronous nature of open-source collaborations has led to the creation of various tools and concepts that facilitate cohesive progress.

The backbone of open-source software development is version control. Version control provides a system for managing changes to a codebase or repository. The basic unit of open-source software development is the patch (or commit), which describes a single atomic change to the repository. For example, a patch might add a new feature, fix a bug, or refactor some code. The main purpose of a version control system is to keep a history of patches. In typical open-source development workflows, there is a single main repository that stores the “source of truth” and is used to produce the software, library, or other artifact that the codebase implements. This main repository is typically managed by a small group of maintainers, who help coordinate the distributed effort of developing the software.

Community development is made possible through pull (or merge) requests. A pull request is a patch that a contributor has submitted for consideration to be included in a repository. After a contributor submits a pull request, maintainers of the repository can review it to decide whether to incorporate the changes by merging it. This review is typically aided through the use of testing, i.e. the use of a series of automated tests that ensure the correctness of changes. One goal of testing is to confirm whether the code would remain backward-compatible if the pull request was accepted. Backward compatibility guarantees that if the software worked in a particular setting in the past, it will continue to work after the new changes have been applied. If a given pull request implements a useful addition to the codebase and passes existing tests (and, as appropriate, implements new tests to determine the correctness of new functionality), the maintainers then accept the pull request and add its changes to the history of the main repository.

The fact that open-source software can be developed in a distributed manner gives rise to the problem of merge conflicts. A merge conflict occurs when two developers independently change the same part of a given codebase in an incompatible way. While version control systems have ways of automatically resolving some merge conflicts, some cannot be automatically resolved (e.g. if one developer deletes some particular line of code while another modifies it) and must be manually fixed by the contributors.

In some cases, different contributors to a given open-source project have irreconcilable disagreements. For example, they might disagree on the scope of possible uses of the software or the set of features it should support. This can lead to forking, where a repository is copied to a new codebase that is independently developed. Forking allows for a somewhat amicable solution to disagreements: If you don’t like it, just fork and create your own version! As an extreme example, the Debian Linux distribution has been forked into dozens of child distributions, including the popular Ubuntu operating system.

The progressive improvement and modification of an open-source software project is typically a gradual process, and most end-users of the software are not interested in every incremental change. This has led to conventions around releasing versions of the codebase at meaningful intervals. In semantic versioning, patch releases correspond to bugfixes and other changes that do not add features, minor releases add new features but do not break backward-compatibility, and major releases can break backward-compatibility. End-users typically update their software through package management systems that provide a user-friendly way to install and upgrade software. A major feature of most package management systems is dependency management, which automatically installs any software required by a given package (typically at a specific version to ensure compatibility). This ecosystem makes it easy for a given open-source library to depend on functionality from another library and to ensure that the dependencies will function as expected. The net effect is effectively modular software, where each package implements specific functionality that can be readily incorporated and reused through installation and inclusion in another package.

Community-developed and continually-improved models

How can we make the development of pre-trained models a more collaborative process? And how can we make it possible to continually improve pre-trained models over time? The success of open-source software development provides a source of inspiration for tackling these questions. In this section, I'll outline how we might be able to port ideas from version control to the training of machine learning models. As a starting point, we assume that instead of managing a repository of code, the goal is to manage the parameters and architecture of a pre-trained model as it changes over time through contributions from a distributed community of individuals.

Incremental and cheaply-communicable updates

Modern machine learning models are often trained through gradient descent. Crucially, gradient descent typically involves updating all of the parameters at every training iteration. As such, communicating any change made during training requires communicating every parameter's value. Unlike the incremental patches used to update code in version control, gradient descent makes it infeasible to keep track of the complete history of every parameter value.

Fortunately, past work has shown that it is possible to effectively train models without updating every parameter at every training iteration. One motivation for this work has been to reduce costs in distributed or federated learning, where individual workers train local copies of a model and must communicate their changes to a centralized server. This has led to schemes to reduce communication costs, which can be done either by reducing the frequency of communication (i.e. having workers train for more iterations before they communicate their updates) or by reducing the size of each update. The latter strategy is particularly relevant for creating small model patches and includes techniques like updating a small subset of the model's parameters [aji2017], [sattler2019], [guo2020] or compressing updates [sattler2019], [konecny2016], [bernstein2018]. This existing work suggests that it should be possible to avoid much of the cost associated with storing and transmitting updates to a model produced over the course of training.

In recent work [sung2021], we introduced a method for choosing a small subset of the model's parameters to update over many training iterations. The subset of parameters is chosen according to which parameters have the largest (approximate) Fisher information, which captures how much changing a given parameter will affect the model's output. We demonstrate that our approach makes it possible to update a small fraction (as few as 0.5%) of the model's parameters while still attaining similar performance to training all parameters.

It is also possible to improve or change a model by adding or removing parameters. A notable example of this approach is the Net2Net framework of Chen et al. [chen2016], which describes ways of adding parameters to a neural network (e.g. by widening a layer or adding a new layer) that does not change the function the model originally implemented. Relatedly, “adapter modules” [rebuffi2017], [houlsby2019], [bapna2019] are tiny trainable subnetworks that are added to a model. By only updating the adapter module parameters and leaving the rest of the model’s parameters fixed, the model can be trained in a parameter-efficient manner. Any of these methods could be used to produce a “patch” that improves a model (e.g. making it applicable to a new task) by increasing or decreasing its size.

Finally, certain model architectures may be more amenable to cheaply-communicable updates because they are designed to allow focused updates. For example, models that involve conditional computation [bengio2013] among submodules (such as the sparsely-gated mixture-of-experts layer [shazeer2017]) can have individual submodels added, removed, or updated while the remainder of the parameters remain fixed. Alternatively, models that make predictions by querying a collection of data can be incrementally modified by changing the data itself. A simple example of this kind of model is a k-nearest-neighbor classifier, whose decision rule can be changed by adding or removing individual datapoints. Recently, more sophisticated models like REALM [guu2020] and RAG [lewis2020] have also demonstrated the feasibility of updating a model by modifying an external collection of text data.

Taken together, the above options represent preliminary evidence that it could be possible to create meaningful incremental “patches” for machine learning models by modifying a small subset of their parameters, adding/removing parameters, and/or by using a model architecture that explicitly facilitates incremental updates. These options could be further developed to provide a way for contributors to propose changes to a community-developed model.

Merging models

In distributed open-source software development, “merge conflicts” occur when contributors introduce conflicting changes to the same part of the codebase. If a contributor is developing modifications on an extremely out-of-date copy of the codebase, it similarly becomes increasingly likely that the proposed changes will result in a merge conflict. A similar situation arises in distributed optimization of machine learning models: Since individual workers compute many local updates before communicating their changes, the individual models on each worker can become significantly different over time. This issue is especially pronounced in federated learning, where the goal is often to delay worker communication as much as possible to reduce communication costs. A strong baseline for combining (or “merging”) disparate updates from different workers is simply to average together the updates from each of the workers when aggregating changes [mcmahan2017], though averaging can degrade performance when individual workers are training on differently-distributed data [sattler2019].

In recent work [matena2021], we develop an improved method of merging models that takes into account the importance of each parameter when computing the average of different models. Like our work on communication-efficient training, we use the Fisher information as an approximate measure of a parameter's importance. We find that this form of merging can efficiently transfer knowledge across models fine-tuned from the same pre-trained checkpoint. In addition, we show how our method enables paths of transfer that would be onerous with sequential gradient-based fine-tuning.

Using specialized model architectures could also provide a more principled way of combining updates. For example, updates to different adapters or experts (in a model utilizing a mixture-of-experts layer) would not conflict. Relatedly, Ryabinin and Gusev [ryabinin2020] recently showed how a specialized mixture-of-experts architecture can be used to allow extremely distributed and fault-tolerant training. The HOGWILD! algorithm [niu2011] has near-optimal convergence rates when the model can be decomposed as a sum of models that each operate on a sparse subset of the input features. Further investigation into specialized model architectures and optimization schemes will help clarify the situations where updates can be merged without a degradation in performance.

Vetting community contributions

Given the ability for contributors to propose incremental improvements through commits, a natural question arises as to how to decide when maintainers should accept a change. In open-source software development, this is typically done through the aid of automated testing, where a proposed change undergoes a battery of tests that confirm that the software would still work correctly and as expected. The utility and validity of pre-trained models is typically tested by evaluating performance after fine-tuning on downstream tasks. A natural analog to automated testing in open-source software development is therefore measuring a pre-trained model’s performance on downstream tasks.

A potential issue with using fine-tuned performance for automated testing would be the computational cost. While using a pre-trained model can greatly speed up convergence on a given task, fine-tuning nevertheless typically requires hundreds or thousands of gradient steps to converge on a given task (depending on the task’s size and complexity). Furthermore, the utility of a pre-trained model is typically measured on dozens of tasks. This suggests that vetting each change to a community-developed model through fine-tuning and evaluation on downstream tasks might be excessively computationally expensive.

One way to mitigate this cost would be to make fine-tuning dramatically cheaper. For example, if we fine-tune a pre-trained model, perform additional pre-training, and then fine-tune it again, we might hope to re-use the updates from the first fine-tuning run, thereby allowing the model to be evaluated without an additional fine-tuning run. This idea bears some similarity to “merging models”, in the sense that we want to merge updates from the first fine-tuning run into the updated pre-trained model. We also might hope to use some of the architectural modifications discussed for “incremental updates” in this setting - for example, adapter modules created for a given pre-trained model might be reusable after the model undergoes additional pre-training.

Alternatively, multi-task learning provides an alternative to the pre-train-then-fine-tune paradigm which could make testing as simple as evaluating the model on all of the tasks it targets. In multi-task learning, the goal is to train the model so that it can perform many tasks simultaneously. Recent results have shown that pre-trained language models can perform well on many tasks at once [brown2020] [wei2021] [sanh2021]. However, multi-task performance often lags behind the performance of models trained separately on each individual task.

Our discussion of testing so far has mainly focused on evaluating a model’s performance on downstream tasks. This is only one way to “test” a model; there is also a large body of work on measuring whether a given model has a specific capability. For example, the field of “BERTology” [rogers2020] endeavors to measure what kinds of “knowledge” (syntactic, semantic, etc.) the pre-trained BERT model [devlin2019] has. Diagnostic data sets have also been developed that endeavor to measure biases exhibited by a model (e.g. [rudinger2018], [caliskan2017]), i.e. differential treatment to certain demographic groups. These diagnostics could be included as part of a testing suite to better characterize the impact of a proposed change to a model.

Versioning and backward compatibility

The semantic versioning system provides a standardized way of communicating the significance of a new version of a piece of software. A similar system could likely be devised for community-developed models. For example, patch releases could correspond to changes that only modify (a small subset) of a model’s parameters while guaranteeing no significant degradation of performance on supported downstream tasks, minor releases could possibly change the architecture of the model while retaining the majority of its parameters, and major releases could completely replace the existing model and its parameters while targeting the same use-cases. “Backwards compatibility” of a pre-trained model most naturally maps to the notion of maintaining performance on targeted downstream tasks (as discussed previously in the context of vetting community contributions). At the least, backward compatibility should at least ensure the model is applicable to the same types of inputs and outputs. For example, if a computer vision model is applicable to both RGB and RGB-D images (which include a depth channel), an update to the model that removes the depth processing functionality would break backward compatibility. Shen et al. [shen2020] previously explored the possibility of “backward-compatible representation learning”, where precomputed representations from an old model can still be reliably compared to representations produced by a new model.

Modularity and distribution

The ability to incorporate open-source packages into a piece of software allows developers to easily add new functionality. This kind of modular re-use of subcomponents is currently rare in machine learning models. In a future where continuously-improved and backward-compatible models are commonplace, it might be possible to greatly improve the modularity of machine learning models. For example, a core “natural language understanding” model could be augmented with an input module that allows it to process a text in a new language, a “retrieval” module that allows it to look up information in Wikipedia, or a “generation” output module that allows it to conditionally generate text. Including a shared library in a software project is made significantly easier by package managers, which allow a piece of software to specify which libraries it relies on. Modularized machine learning models could also benefit from a system for specifying that a model relies on specific subcomponents. If a well-defined semantic versioning system is carefully followed, models could further specify which version of their dependencies they are compatible with.

An example future

The previous section provided a high-level overview of ways that ideas from open-source software development could be applied to building machine learning models. For a more concrete motivation, this section describes an aspirational and fictitious vision of the kind of model development that could be possible in the future.

Conclusion

The use of pre-trained models for transfer learning has ushered in a golden era in many applications of machine learning. However, the development of these models is still in the dark ages compared to best practices in software development. Well-established concepts from open-source software development provide inspiration for methods for building continually-improved and collaboratively-developed pre-trained models. These connections motivate research related to existing topics like continual learning, multitask learning, distributed optimization techniques, federated learning, modular architectures, and more. Undertaking this research program will help shift the power away from large corporate entities working in isolation and allow models to be developed democratically by a distributed community of researchers.

[ carlini2021] Carlini, Nicholas, et al. “Extracting Training Data From Large Language Models.” USENIX Security Symposium, 2021.

[ aji2017] Alham Fikri Aji and Kenneth Heafield. “Sparse Communication for Distributed Gradient Descent.” Empirical Methods in Natural Language Processing, 2017.

[ sattler2019] Felix Sattler et al. “Robust and Communication-Efficient Federated Learning from Non-IID Data.” IEEE Transactions on Neural Networks and Learning Systems, 2019.

[ guo2020] Demi Guo, Alexander M. Rush, and Yoon Kim. “Parameter-Efficient Transfer Learning With Diff Pruning.” arXiv preprint arXiv:2012.07463, 2020.

[ houlsby2019] Neil Houlsby et al. “Parameter-Efficient Transfer Learning for NLP.” International Conference on Machine Learning, 2019.

[ bapna2019] Ankur Bapna and Orhan Firat. “Simple, Scalable Adaptation for Neural Machine Translation.” Empirical Methods in Natural Language Processing, 2019.

[ rebuffi2017] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. “Learning Multiple Visual Domains With Residual Adapters.” Neural Information Processing Systems, 2017.

[ konecny2016] Jakub Konečný et al. “Federated learning: Strategies for improving communication efficiency.” arXiv preprint arXiv:1610.05492, 2016.

[ bernstein2018] Jeremy Bernstein et al. “signSGD: Compressed Optimisation for Non-Convex Problems.” International Conference on Machine Learning, 2018.

[ chen2016] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. “Net2net: Accelerating Learning Via Knowledge Transfer.” International Conference on Learning Representations, 2016.

[ bengio2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. “Estimating or Propagating Gradients Through Stochastic Neurons For Conditional Computation.” arXiv preprint arXiv:1308.3432, 2013.

[ shazeer2017] Noam Shazeer et al. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” International Conference on Learning Representations, 2017.

[ guu2020] Kelvin Guu et al. “REALM: Retrieval-Augmented Language Model Pre-Training.” International Conference on Machine Learning, 2020.

[ lewis2020] Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Neural Information Processing Systems, 2020.

[ sung2021] Yi-Lin Sung, Varun Nair, and Colin Raffel. “Training Neural Networks with Fixed Sparse Masks.” Neural Information Processing Systems, 2021.

[ mcmahan2017] H. Brendan McMahan et al. “Communication-Efficient Learning of Deep Networks From Decentralized Data.” Artificial Intelligence and Statistics, 2017.

[ matena2021] Michael Matena and Colin Raffel. “Merging Models with Fisher-Weighted Averaging.” arXiv preprint arXiv:2111.09832, 2021.

[ ryabinin2020] Max Ryabinin and Anton Gusev. “Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.” Neural Information Processing Systems, 2020.

[ niu2011] Feng Niu et al. “HOGWILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent.” Neural Information Processing Systems, 2011.

[ brown2020] Tom B. Brown et al. “Language Models are Few-Shot Learners.” arXiv preprint arXiv:2005.14165, 2020.

[ wei2021] Jason Wei et al. “Finetuned Language Models are Zero-Shot Learners.” arXiv preprint arXiv:2109.01652, 2021.

[ sanh2021] Victor Sanh et al. “Multitask Prompted Training Enables Zero-shot Task Generalization.” arXiv preprint arXiv:2110.08207, 2021.

[ rogers2020] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. “A Primer in BERTology: What We Know About How BERT Works.” Transactions of the Association for Computational Linguistics, 2020.

[ devlin2019] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Conference of the North American Chapter of the Association for Computational Linguistics, 2019.

[ rudinger2018] Rachel Rudinger et al. “Gender Bias in Coreference Resolution.” Conference of the North American Chapter of the Association for Computational Linguistics, 2018.

[ shen2020] Yantao Shen et al. “Towards Backward-Compatible Representation Learning.” Conference on Computer Vision and Pattern Recognition, 2020.

[ hooker2019] Sara Hooker et al. “What Do Compressed Deep Neural Networks Forget?” arXiv preprint arXiv:1911.05248, 2019.

[ caliskan2017] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. “Semantics Derived Automatically From Language Corpora Contain Human-like Biases.” Science, 2017.