Go Back

Lab in the Loop: From Research Tool to Research Leader

Author: Constantine Goltsev

Published

“AI research” has always meant research on new AI systems, their architectures, training algorithms, and other aspects of machine learning. But recently, it appears that “AI research” is taking on a new meaning: it can now meaningfully refer to research conducted by AI systems! In this post, we review several recent announcements of new results obtained by AI systems and speculate on the brave new research world we are entering with this.

‍

Introduction

‍

In 1964, Derek J. de Solla Price observed in his book Science since Babylon that the number of scientific papers was doubling every 10-15 years, warning of an impending "crisis of science" where researchers would drown in literature they could never hope to read. Sixty years later, we publish over 2.5 million papers annually, and Price's prediction has proven both right and wrong. Right, because no human can possibly keep up with even a narrow subfield anymore—trust me, I’m doing my best to keep up with AI! Wrong, because he could not imagine that the readers—and now the writers—of scientific papers might not be human at all.

At the risk of sounding grandiose, I want to say that we stand at a crucial inflection point in the history of science. For centuries, the scientific method has been an exclusively human endeavor: humans observe, hypothesize, experiment, conclude, and write up their results. AI systems have served as increasingly sophisticated tools in this process—analyzing data, finding patterns, simulating outcomes—but always under direct orders from a human. Now, for the first time, AI systems are beginning to conduct independent research, including hypothesis generation, experimental design, and paper writing.

This isn’t science fiction or speculation. As I write this in June 2025, multiple AI systems have already had their research accepted at top-tier academic conferences, discovered promising drug candidates, and solved problems that stumped human researchers for years. The phrase "AI research" no longer refers only to research about AI systems—it can increasingly mean research conducted by AI systems.

This transformation has accelerated dramatically over the last half a year. Let me highlight two important results that appeared in March:

the Sakana AI team, which had previously developed the AI Scientist system for creating end-to-end fully AI-written research papers, reported that their updated AI Scientist-v2 generated a paper accepted to a workshop to ICLR, a leading conference in machine learning;
Google released its Co-Scientist system, and the first report by Prof. Jose Penades of Imperial College London said that Google Co-Scientist identified in just 48 hours the same solution to a longstanding problem in microbiology that Penades’ team had recently found after years of research.

I will not cover these results here—for a detailed analysis see, e.g., this blog post. Today, I want to concentrate on two pieces of news that came over the last month; they are surprisingly similar and seem to confirm that AI-produced research is already happening. This post contains case studies of two breakthrough systems: Robin from FutureHouse and Oxford, which shows how AI can accelerate biomedical discovery by orders of magnitude, and Zochi from Intology, which has just had a paper accepted to a top conference.

Birds Go for the Eyes: Finding Treatment for AMD

‍

A decade ago, “AI for medical science” meant little more than applying machine learning to analyze experimental data. Five years ago, we celebrated AlphaFold's protein structure predictions and designed drug discovery systems. Today, AI systems don't just analyze—they hypothesize, design experiments, and discover.

Robin, described in a new preprint from FutureHouse and Oxford researchers Ghareeb et al. (May 19, 2025), is a multi-agent “AI scientist” that reads the literature, proposes experiments, interprets the resulting data, and loops until it has a publishable insight—leaving only the physical wet lab steps and the final paper preparation to humans. In the words of Sam Rodriques, who announced the work on X, the discovery cycle now runs with “the lab in the loop” and can move from a research question to a promising lead in “about ten weeks”.

Similar to Google Co-Scientist, Robin is not a single LLM but a team of specialized LLM-based agents that talk to one another through a coordinating notebook (pictures from Ghareeb et al. here and below):

‍

Crow and Falcon perform literature analysis: Crow produces brief topic-specific summaries, e.g. “summarize what were the experimental designs in these twenty papers”, while Falcon digs deep and writes extended technical briefs. Finch is a bit different: it lives downstream of both literature review and experiments, taking raw experimental output—RNA-seq files, flow-cytometry .fcs files, or spreadsheets—and producing quality-controlled analyses, figures, and statistical tests.

All three are working together in a sequence that mirrors the scientific method:

Crow and Falcon choose an assay, Finch analyzes the data, and a separate LLM judge ranks the hypotheses via a tournament-style Bradley-Terry model (a statistical method for ranking items through pairwise comparisons).

For the paper, the authors tried to run Robin on ten different diseases, from non-alcoholic steatohepatitis to Friedreich’s ataxia, asking to look for existing drugs that can be repurposed for these diseases:

Drug repurposing is a very promising avenue for LLM-based research since, first, if the drug is already there it means there’s literature to analyze, and second, a drug that has already been approved by the FDA or similar authorities in other countries will have a much shorter time to market if proven successful for a different disease. The authors list a few examples of how drug repurposing has succeeded years after the original research: e.g., it took 30 years between approving ketamine for anesthesia in 1970 to finding its antidepressant qualities in 2000.

In each case, Robin produced a ranked shortlist of repurposing candidates, complete with mechanisms, safety considerations, and even catalogue numbers. Even more impressively, Robin also does an active evaluation with multiple independent Finch trajectories per dataset and a meta-analysis to converge on a consensus.

So what is their best result right now? The centerpiece of the paper is a promising drug for the treatment of dry age-related macular degeneration (dAMD), which they call “the major common cause of blindness in the developed world” (alas, I’m way too far from a doctor to explain and can only recommend reading other sources). It was a full lab-in-the-loop cycle targeting dry AMD, where Crow generated 151 literature digests and found ten putative pathogenic mechanisms starting with only the disease name. An LLM judge chose the most promising idea, and then Robin designed an assay and proposed 30 candidate compounds.

No one is claiming they have found a cure for dAMD: to prove efficacy in patients researchers will still need to do the usual set of lab and clinical trials. But the speedup is already happening: a closed-loop AI system has generated a biologically novel, experimentally validated therapeutic hypothesis in weeks rather than years.

While Robin demonstrates AI's capacity to accelerate biomedical discovery, the question remains: can AI-generated research meet the rigorous standards of peer review in computer science itself? Enter Zochi.

‍

LLM-Based Prospero: The Tempest by Zochi Accepted to ACL 2025

‍

On May 27, the startup Intology revealed that its LLM-based scientist Zochi had just had a full-length paper accepted into the main proceedings of ACL 2025. This is no small feat: the main track ACL papers have to overcome an acceptance rate of around 20%, and Zochi’s submission earned a meta-review score of 4.0, which places it within the top 8% of over 3000 submitted manuscripts. Many human researchers spend months crafting papers that ultimately fail to achieve such scores. Moreover, Intology claims that the system, and not its human creators, drove every step of the research except some polishing of the figures and LaTeX. If Robin from FutureHouse feels like AlphaGo’s “move 37”, Zochi is the moment AlphaGo won the whole match.

Just like Robin, Zochi (see the tech report by Intology) is built as a multi-agent research pipeline whose components play roles quite similar to how human researchers might have divided the work. The input can be a general area for research or a more specific problem, and first retriever agents will go through the research repositories looking for salient claims and open problems. Another agent reads the results of this analysis and generates candidate hypotheses. Downstream, other agents produce code for experiments, yet others analyze the data produced by these experiments, and finally a writer agent assembles the narrative (source):

‍

‍

Human overseers intervene only before spending a lot on compute, and then after all results have been produced and at the final polishing and proofreading step. Intology claims that in practice polishing meant a few additional prompts with guidance on page limits and citation style, while all substantive research results and descriptions stayed fully AI-generated.

Unfortunately, unlike Robin, the tech report about Zochi is not too revealing; this picture and a generic description of the agents is basically all we have. But we do have some samples of what Zochi can achieve—and they are very impressive!

The first result, called CS-ReFT (Constrained Subspace Representation Fine-Tuning), elegantly solves a critical problem in model adaptation. By finding orthogonal subspaces for each task within a frozen Llama-2-7B model, it significantly improves the efficiency of fine-tuning: with under 0.01% new parameters, it surpasses GPT-3.5 on AlpacaEval (93.9% win rate), which is a 12.7x reduction of the number of extra parameters over previously known LoRA variants. I’m quoting the tables from the tech report; Table 1 below is on CS-ReFT, and you can see that the improvements are very impressive:

‍

But it was the second result, called Siege in the tech report and Tempest in the final paper (Zhou and Arel, 2025, although they almost didn’t write it at all), which got Zochi accepted to ACL 2025. This is actually a paper on AI safety, and the main problem it tackles is that most existing benchmarks for LLM jailbreaking only consider single-prompt jailbreaks, while a real attacker is likely to engage in multi-turn dialogue while trying to steer the LLM into a jailbreak.

Tempest/Siege transforms the adversarial jailbreaking problem into a tree search algorithm. Like a chess engine exploring possible moves, it maintains multiple conversation branches and aims to choose the most promising one. On every turn, the attacker LLM spawns multiple next prompts, a judge model assigns each reply a 0-10 “damage” score, and then the attacker tries to exploit partially compliant responses while pruning branches that lead nowhere. Here is an illustration from the paper (produced by humans, I suppose?):

‍

The results were excellent: on JailbreakBench, Tempest hit 100% success against GPT-3.5-turbo and 97% against GPT-4 in a single five-turn run; see Table 2 that I quoted above. It showed that LLM safety against jailbreaks is path-dependent, and a flexible attacker can gradually “groom” the LLM into a jailbreak. No wonder the ACL reviewers praised the “effective, intuitive, and reasonably simple” design and urged the community to rethink defensive strategies, and the paper was promptly accepted.

There is also a third research result by Zochi, on computational biology, but since I am far less qualified to comment on that one let me just link to the tech report again; Intology reports that this work was finalized after the deadline of the corresponding ICLR workshop and was submitted to a journal.

Intology plans a public beta later this year: first as a domain-agnostic co-researcher that drafts proposals and survey papers, and later as a full end-to-end “PhD-level” agent.

Conclusion

‍

The emergence of Robin and Zochi as successful AI researchers marks more than a technological milestone—it represents a fundamental shift in how we create knowledge. When an AI system can go from disease name to validated drug candidate in ten weeks, or when it can independently discover novel machine learning techniques that outperform human-designed methods, we're witnessing the birth of a new kind of scientific revolution.

Unlike previous revolutions in science, this one doesn't just give us new tools or theories—it introduces new kinds of scientists. Robin and Zochi aren't simply faster versions of human researchers; they operate in fundamentally different ways. They can read every paper in a field, maintain perfect recall of experimental results, and explore hypothesis spaces too vast for human consideration. They don't get tired, don't have biases toward pet theories, and don't need to worry about career advancement or publication metrics. Robin and Zochi are just the beginning; AI systems are already transforming scientific research, and this is the worst AI systems will ever be.

The implications are both obvious and staggering at the same time. If AI systems can generate novel hypotheses and validate them experimentally, what happens to the pace of scientific discovery? If a single AI model can read every paper ever written in a field and generate insights no human could achieve, how do we evaluate and build upon such work? And perhaps most provocatively: if AI systems become better researchers than humans in certain domains, what role remains for human scientists? How do we adapt to a world where the most important discoveries might come from silicon-based researchers rather than carbon-based ones?

The optimistic view is that AI researchers will amplify rather than replace human scientists. Just as calculators did not eliminate mathematicians but freed them to work on higher-level problems, AI researchers might liberate humans from literature reviews and routine experimentation to focus on asking better questions, setting ethical boundaries, and interpreting results in broader contexts that require human values and judgment. The pessimistic view, obviously, is that human researchers will be increasingly displaced as the AI revolution proceeds in full steam and AI systems become more capable.

The age of AI researchers is upon us. Let's make it an age of wisdom, not just intelligence.

‍

Lab in the Loop: From Research Tool to Research Leader

Introduction

Birds Go for the Eyes: Finding Treatment for AMD

LLM-Based Prospero: The Tempest by Zochi Accepted to ACL 2025

Conclusion

Related Blog Posts

Hallucinations in LLMs

Reward Modeling in Reinforcement Learning

Company

Products

Resources

Legal