How LLMs could be used to make research better, rather than worse

How LLMs could be used to make research better, rather than worse
Colin Raffel
March 6, 2026
colinraffel.com/blog

LLMs have been continually getting better for many years, but there's a increasingly broad consensus that recent improvements have catapulted them from “sometimes useful” to “often fully autonomous”, especially for software engineering. I won't recount the story told elsewhere. Instead I'd like to share some of my thoughts about the implications of this step change and likely near-term improvements on machine learning (and perhaps more broadly computational) research.

As I'm writing this, Gemini Flash 3 preview is running some experiments for me. I prompted it with a basic description of what I wanted done along with a few pointers to relevant documentation. It wrote the necessary scripts, debugged them, and launched the jobs. When the experiments finish running I'll ask it to analyze and plot the results and run some follow-up experiments. All of this would have taken me or one of my students a day or two of work. It took Gemini Flash 3 fifteen minutes and cost me less than $1.

As a personal aside, I'm still grappling a bit with this emotionally. I'm afraid to admit that a significant part of my self worth is staked in the fact that I've developed some degree of expertise in these skills. Seeing a model quickly, cheaply, and competently carry out these skills simultaneously fills me with both excitement and ennui.

Emotions aside, the fact of the matter is that many parts of the research cycle—writing code for experiments, implementing baselines, babysitting jobs, debugging, analyzing results, etc.—can be done by an LLM. I also get the impression that LLMs are helpful in formulating and proving theory and writing and editing papers, though I've used them much less for this. I have heard that LLMs are not as good at ideation, though I suspect this may change in the near term.

It used to be the case that these were all important skills for a researcher to hone and develop. Many years ago, one most successful junior ML researchers told me something like “at some point I looked around at all the researchers I knew and realized they were all smarter than I was. So I decided I would just get very good at running experiments quickly, and that's what has made me successful as a researcher.” Now anyone can perform these skills competently and quickly as long as they can prompt an LLM.

In the past, how did researchers put all these skills to use? At risk of sounding jaded, it was primarily to write mediocre "messy middle" papers that had a decent chance of getting published. LLMs can dramatically speed up most, if not all, aspects of the research progress. I think this has, or soon will, result in a dramatic drop in the total amount of effort and skill required to complete a mediocre paper. In the past few years, we have had to get used to the fact that the skill and effort required to write a personalized email, or create a promotional image, or produce a song, has dropped to near-zero. I think the same thing is happening to machine learning research. As a concrete example, prospective students have started sending me paper drafts where I would bet that an LLM came up with the project idea, wrote the code, ran the experiments, and wrote the paper, and the student just prompted it along. In the next year or so, I think conferences will be totally overloaded by a dramatic jump in the number of mediocre paper submissions.

In the worst case, research devolves into the exercise of using LLMs to mass-produce mediocre papers in hopes that the human prompters can accrue citations and credibility. What about the best case? I am hoping that LLMs actually improve the research process in two ways: First, by allowing us to stop pushing mediocre papers and more consistently strive for bigger contributions, and second, by significantly improving the peer-review process.

It takes a lot of ingenuity, effort, and luck to produce an impactful paper that stands out and significantly pushes the field forward. Most research projects do not end this way—more likely the results are promising but not groundbreaking. Historically, if a researcher had expended a year of effort to undertake the project, the understandable goal would be to try to publish the results even if they were not groundbreaking. But coding agents can now dramatically reduce the amount of human effort and time that it takes to undertake a project. Assuming compute is available, they also can help explore more ideas in parallel.

Rather than continue to expect most research contributions to be incremental, maybe we can instead raise the bar for the expected level of contribution for research papers. If agents speed up and parallelize the research cycle, then an impactful project becomes more likely under a certain timescale. This could free us from the cycle of submitting mediocre results for the sake of getting them published. I wouldn't expect this to happen in the short term, but if we do get overloaded by a deluge of mediocre papers, maybe this will help push us out of our current practices.

Similarly, I hope that LLMs can help push us out of current reviewing practices. As it stands, the reviewing of machine learning papers can be seen as incredibly shallow. Ideally, a review would involve reimplementing proposed algorithms and reproducing experiments to ensure validity, checking through the steps of any proofs, and doing a literature review to ensure there are no missing related works or baselines. But it is incredibly rare that a review can undertake even a small subset of these tasks—it's just too time consuming. Instead, most reviewers give the paper a quick (but hopefully thorough) read, take the claims at face value, rely on their own (incomplete and often faulty) knowledge of related work, and then write down their review and decision. Without opening up the can of worms that is “how can we make ML reviewing better”, I think we likely all agree in principle that we could do better if we had more time and/or person-power.

But LLMs can likely competently undertake many of these tasks. For example, it is already the case that LLMs can analyze a proof statement-by-statement and uncover issues (see e.g. section 3.2 of this paper). I suspect that, given a submission PDF, a coding agent could do a reasonable job reimplementing algorithms and reproducing experiments (and I hope to experiment with this in the. near future). LLMs equipped with web search can surface relevant past work and LLMs can reason through the relationship between two papers, ultimately suggesting whether a surfaced paper constitutes a missing reference or baseline. I'll emphasize that I am not advocating for using LLMs to lazily replicate our current surface-level reviewing practices, but again suggesting using their efficiency and parallelization to move our review process much closer to ideal practices.

I'm pretty confident that LLMs will cause an upheaval in the research process. But, at least for the time being, research is ultimately motivated by human needs and human curiosity, so we have control over the research process. Where we end up after the upheaval, then, is ultimately up to us.

formatted by Markdeep 1.03