The Year in AI - Best of 2025, Part I: Reasoning Models, LLM Agents and More | Apolo AI Launchpad - Secure, Industry-Tailored AI Tools for Regulated Enterprises

2025 has been a breakthrough year for artificial intelligence. To be honest, every year has been a breakthrough year for AI lately, and we don’t expect 2026 to, pardon the pun, break this trend. But the specific results and subfields do change from year to year, and the New Year is always a great opportunity to look back at the most important stuff.

I break this review up into several categories and discuss the most important directions inside each of them. For every category, apart from the most important breakthroughs that made the news throughout 2025, I also include a section with a couple of academic papers that seem to be most interesting to me, often with some potential to redefine the corresponding space.

The Year of Reasoning Models

If I have to single out one specific idea that has defined 2025, it is definitely reasoning, as in large reasoning models. OpenAI’s o1 series of models sparked this trend in late 2024, but it began in earnest when DeepSeek won the replication race with its R1 model (DeepSeek-AI, Jan 2025).

Then every notable AI lab followed suit; here I will only mention the latest offerings. Google launched Gemini 3 Pro and Deep Think in November 2025, becoming the first model to break the 1500 Elo barrier on LMArena (it’s always in flux, but at some point they did score over 1500). Anthropic rolled out Claude 4.5 Haiku, Sonnet, and Opus between September and November, with Sonnet achieving 77.2% on SWE-bench Verified—the highest score for real-world coding tasks. OpenAI responded in December with GPT-5.2 in three variants: Instant for quick responses, Thinking for deeper reasoning, and Pro for maximum accuracy. The Chinese labs continued to impress: DeepSeek-V3.2 integrated reasoning directly into tool use while Alibaba's Qwen3-235B became one of the strongest open-weight MoE models with 235 billion parameters (22 billion active). Meta iterated on Llama 4 with Scout and Maverick variants, while xAI's Grok-4.1 competed near the top of reasoning leaderboards.

The main idea of reasoning models is that you give an LLM a hidden scratchpad where they can write intermediate tokens that will not be counted for in the reward. This makes reasoning a natural fit for reinforcement learning: reasoning tokens are intermediate moves, and you get a reward (win or lose the game) only when the model writes out the final result.

‍

These models often come with a chain-of-thought mode or a slider to adjust how much reasoning the model applies to a query. GPT-5.2, for instance, can internally decide to perform a chain-of-thought and choose how many tokens to use for this reasoning. This leads to more accurate solutions on tasks like math word problems, code generation, and complex QA. The "adaptive reasoning" approach lets models deliver instant responses (in seconds) for simple tasks while taking 10+ seconds for complex reasoning, using approximately 50% fewer tokens than competitors at similar quality levels.

Reinforcement learning with verifiable rewards

Reasoning is enabled by reinforcement learning with verifiable rewards (RLVR). The key insight, as Andrej Karpathy explained in his “2025 LLM Year in Review” post, is that in RLVR, the reward can be computed automatically (like checking if a math problem answer is correct) rather than only by extrapolating human preferences, as in RLHF:

It turns out that by training LLMs against automatically verifiable rewards across environments like math and code puzzles, the LLMs spontaneously develop strategies that look like "reasoning" to humans. They learn to break down problem solving into intermediate calculations and develop strategies for iterating back and forth to figure things out. Running RLVR turned out to offer high capability per dollar, a better use for all of the compute that had originally been intended for pretraining—Karpathy marks this as perhaps the most important trend of 2025.

Reasoning also naturally leads to test-time compute scaling: giving models more "thinking time" can significantly improve results. Machine learning had not known many examples where you could effectively trade compute at inference time for better results; now, every frontier model has this capability. There are even theoretical models already developed for scaling laws for test-time compute. Below we will discuss test-time scaling laws in detail, but let’s start with the shiny new results.

‍

Reasoning Models for Math and Coding

RLVR especially applies to fields like mathematics and coding, where solutions can be verified programmatically. This brings us to some of the year's most impressive achievements.

First, the International Mathematical Olympiad (IMO) has become the de facto benchmark for AI mathematical reasoning. In 2025, both Google DeepMind and OpenAI achieved gold medal level performance, with their systems scoring 35 out of 42 points—enough to place in the top 8% of the world's most mathematically gifted high school students.

Google's advanced Gemini Deep Think solved five out of six problems perfectly within the 4.5-hour competition time limit. Unlike last year's AlphaProof and AlphaGeometry systems, which required experts to translate problems into domain-specific formal languages like Lean, Deep Think operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions. The key innovation seemed to be "parallel thinking", where the model can simultaneously explore and combine multiple possible solution strategies before giving a final answer, rather than pursuing a single linear chain of thought. OpenAI achieved an equivalent score with their IMO-focused model, reportedly using general-purpose RL and test-time compute scaling with minimal IMO-specific training.

Perhaps most remarkably, DeepSeek joined this elite club by open-sourcing DeepSeekMath-V2, the first openly available model to achieve gold-level IMO performance (Shao et al., Nov 2025). It solved 5 out of 6 problems and scored a near-perfect 118/120 on the Putnam 2024 competition, surpassing the highest human score of 90. The model's innovation is a self-verification framework where a proof verifier assesses the rigor and completeness of proofs generated by a proof generator, mimicking the self-checking process of human mathematicians. Results grow with the iterations of self-verification:

The mathematical reasoning revolution extended to competitive programming. In September, both OpenAI and Google pulled off similar feats at the International Collegiate Programming Contest (ICPC)—notably with novel, previously unpublished problems. DeepSeek-V3.2 achieved particularly impressive competition results: IMO 2025 Gold Medal (35/42), IOI 2025 Gold Medal (492/600, 10th place), ICPC World Finals 2nd place (10/12 problems), and CMO 2025 Gold Medal.

One could say these victories demonstrate that open-source models can match or exceed proprietary systems in specialized mathematical reasoning… but these are competitions, i.e., benchmarks intended to contain human-invented problems with known solutions. What about “real math”, as in proving new results? We will get back to AI used for research in a next installment, but for now let us press on with the LLMs.

Reasoning + Tools = Agents

The real power of these reasoning LLMs emerged when they were coupled with tool use. With built-in support for calling APIs, running code, or searching the web, a reasoning LLM becomes an autonomous agent that can break a task into sub-tasks, execute them, and loop until done.

Model Context Protocol

The Model Context Protocol (MCP), released by Anthropic in November 2024, achieved industry-wide adoption in 2025. OpenAI adopted MCP in March; Microsoft and GitHub joined the steering committee in May; and in December, the protocol was donated to the Linux Foundation's Agentic AI Foundation, co-founded by Anthropic, Block, and OpenAI with support from Google, Microsoft, AWS, and others. By the year's end, MCP had 97 million monthly SDK downloads, thousands of community-built MCP servers, and 75+ connectors in Claude alone. Here is a plot from the Latent Space podcast post whose authors were proud that they had predicted MCP’s success back in March:

MCP essentially standardizes how AI agents can access external tools, transforming them from single-model assistants into tool-orchestrating agents.

Computer use capabilities improved dramatically. Claude's performance on OSWorld (the benchmark for desktop automation) jumped from 14.9% to 61.4% over the course of the year—approaching the 70-75% human baseline. OpenAI's Operator, launched January 2025 as a research preview and integrated directly into ChatGPT by July, is already partnering with DoorDash, Instacart, Uber, and other services for real-world task execution.

Coding agents

Perhaps the most important field for LLM agents so far has been coding. This is a domain where it is clear where scaling is supposed to go, where it is relatively easy to verify the results. The famous METR time horizon graph now has Claude Opus 4.5 in the leading place, approaching 5-hour tasks done with 50% accuracy. This is frighteningly close to replacing human programmers altogether:

So how has this progress been achieved? Let me begin with a few influential papers.

ReTool (Feng et al., Apr 2025) demonstrated that RL can teach models when and how to invoke code interpreters during reasoning—a capability that supervised learning alone cannot provide. They use a PPO-based RL training modified somewhat to better reflect integrated reasoning:

Using a deliberately simple binary accuracy reward, ReTool achieves 72.5% on AIME2024 compared to 44.6% for o1-preview. Perhaps more remarkably, the model exhibits emergent metacognitive capabilities: it learns to recognize and fix erroneous code based on interpreter error messages, reflecting in natural language ("Oops, the functions need to be defined in the same scope") before generating corrected versions. This "code self-correction" was never explicitly trained but emerges from outcome-driven RL.

Search-R1 (Jin et al., Mar 2025) applied similar principles to web search, training LLMs to autonomously formulate queries during multi-turn reasoning with real-time retrieval. Unlike RAG, which retrieves once and hopes for the best, Search-R1 learns to search iteratively, refining queries based on what it finds. A key technical innovation is "retrieved token masking"—excluding retrieved content from the RL loss computation to prevent unintended learning dynamics. The result is 24% improvement over RAG baselines on question-answering benchmarks.

A conceptual reframing came from "Thinking vs. Doing" (Shen et al., Jun 2025), which argues that for interactive agents, "test-time compute" should include not just longer reasoning traces but also more interaction steps:

Scaling environment interactions allows agents to acquire new information, explore alternatives, backtrack, and dynamically re-plan—capabilities that no amount of internal reasoning can provide. Their TTI (Test-Time Interaction) framework achieves state-of-the-art results on WebVoyager (64.8%) and WebArena (26.1%) using only a 12B Gemma model, outperforming agents trained with traditional approaches focused on per-step reasoning depth.

However, unsolved problems still remain. For example, τ²-Bench (Barres et al., Jun 2025) introduced a crucial distinction for agent evaluation: dual-control environments where both the agent and the user can act, as in real customer support scenarios. Current benchmarks assume the user is a passive information provider, but τ²-Bench models the more realistic case where agents must guide users through actions while users execute on their own devices:

The results are humbling: state-of-the-art LLMs show 18-25% performance drops when moving from autonomous operation to collaborative mode, pinpointing communication and coordination as critical bottlenecks beyond pure reasoning capabilities.

Claude Code

In practice, the leading agentic release of 2025 has undeniably been Claude Code. It operates directly in your terminal, understands your codebase through agentic search, and executes multi-file changes without manual context selection.

As Andrej Karpathy put it, it's "a little spirit/ghost that lives on your computer". Unlike traditional code assistants that suggest code for humans to review, Claude Code takes autonomous action—it runs commands, creates pull requests, and handles git workflows through natural language commands.

The key insight is that Claude Code works best when treated like a junior engineer with tools, memory, and iteration—not a magic code generator. Test-driven development becomes particularly powerful: you ask Claude to write tests based on expected input/output pairs, explicitly tell it to confirm the tests fail, then ask it to implement the code to make them pass. METR estimates that Opus 4.5 can sometimes complete tasks that take humans almost five hours. Projects with well-maintained CLAUDE.md files—which encode project-specific knowledge, repository etiquette, and environment setup—get significantly better results.

Beyond coding, Claude Code has proven to be a general-purpose agent. Users have had it prepare tax filings by analyzing bank statements and invoices, book theater tickets by checking calendar availability and browsing websites, and process business documents. The "Code" in its name undersells the product: it's really a general-purpose AI agent that can do almost anything on your computer, using code as its interface to digital tasks.

I have been focusing on Claude Code but it’s just the currently best offering among many; e.g., OpenAI’s GPT-Codex series (e.g., GPT-5.1-Codex-Max) is also excellent in autonomous coding. I expect 2026 to be the year when browser agents and general computer use agents cross the threshold into wide usage. Anthropic’s CoWork, just announced as a research preview, may well be the first really important piece of AI news in 2026.

Scaling Laws for Test-Time Compute and Beyond

I’ve mentioned test-time compute scaling laws in the first section, and it certainly deserves expansion. 2025 saw significant research clarifying how to scale inference compute effectively—a question that turns out to have no single answer.

There is no optimal strategy

The Art of Scaling Test-Time Compute (Agarwal et al., Dec 2025) conducted the first large-scale systematic comparison of test-time scaling strategies, generating over 30 billion tokens across eight open-source models. Their central finding is that no single strategy universally dominates: the optimal approach depends critically on model type and compute budget. They introduce a useful distinction between "short-horizon" models (which benefit from shorter reasoning traces regardless of difficulty, often trained with GRPO) and "long-horizon" models (which benefit from extended reasoning on hard problems, trained with alternative RL methods like GSPO).

For practitioners, this means the choice between strategies like majority voting, beam search, and "first finish search" should be informed by knowing which category your model falls into.

A striking demonstration of compute-optimal inference came from "Can 1B LLM Surpass 405B LLM?" (Liu et al., Feb 2025). The answer is yes—with the right test-time scaling strategy, a 1B parameter model can outperform a 405B model on MATH-500, and a 7B model can exceed both o1 and DeepSeek-R1 on AIME2024:

The key insight is that the optimal scaling method depends on both model size (search-based for small models, Best-of-N for large) and problem difficulty. Their compute-optimal strategies achieve up to 256x better efficiency than majority voting, suggesting that the future of inference may involve smaller, efficiently-scaled models rather than simply deploying the largest available.

Where should you spend your computational budget?

The question of when to allocate compute to generation versus verification was formalized by "When To Solve, When To Verify" (Singhi et al., Apr 2025). Counter to intuition, they found that Self-Consistency (generating many solutions and majority voting) outperforms Generative Reward Models (GenRM) at practical compute budgets. GenRM requires approximately 8x more compute just to match Self-Consistency, and 128x more to achieve a modest 3.8% improvement:

This suggests that for most deployments, scaling solution generation remains more efficient than investing in verification—though the balance shifts for very hard problems where GenRM shows up to 30% relative improvement.

Process reward models themselves can be scaled at test time, as shown by GenPRM (Zhao et al., Apr 2025). By reformulating step-level verification as a generative reasoning task with explicit chain-of-thought and code verification, GenPRM-7B achieves 80.5% F1 on ProcessBench—surpassing Qwen2.5-Math-PRM-72B (78.3%) despite being 10x smaller.

Their novelty is relative progress estimation, which evaluates whether each reasoning step makes beneficial progress toward the solution rather than simply checking if it lies on any correct path.

S*: Test Time Scaling for Code Generation (Li et al., Feb 2025) introduced the first hybrid test-time scaling framework specifically for code. The key insight is that code uniquely allows programmatic verification, enabling a two-stage approach: (1) generate multiple solutions with iterative debugging guided by public test execution, then (2) select the best through "adaptive input synthesis"—prompting an LLM to generate distinguishing test cases tailored to each pair of candidate solutions.

DeepSeek-R1-Distill-Qwen-32B + S* achieves 85.7% on LiveCodeBench, approaching o1-high at 88.5%. Notably, S* enables instruction-based models to surpass reasoning models (GPT-4o-mini + S* beats o1-preview by 3.7%), suggesting that careful inference-time strategies can substitute for expensive reasoning-focused training.

New Architectures

It's quite possible that in 2026 we'll see Transformers, if not replaced, then at least supplemented by other architectures. And this, of course, requires a separate post, or even several.

Luckily, I've already written this separate post: “Beyond Transformers: Promising Ideas for Future LLMs”. So here let me refer to that post and simply name three areas that seem the most promising to me at the moment:

Google Titans, released at the turn of 2025; let this architecture be a stand-in for explicit memory mechanisms and memory-as-test-time-learning, where different approaches are being developed;
Mamba-like state-space models, which have continued their development in 2025 with, e.g., Mamba 3 (Sep 2025);
diffusion LLMs, where in 2025, long-standing diffusion language models (Li et al., 2022) finally began to scale in the form of LLaDA (Large Language Diffusion Models; Nie et al., 2025).

In short, neural network architectures are constantly evolving, and new ideas are constantly emerging.

Honorable Mentions

Looking forward, we might see 2026 as the year when Transformers are, if not replaced, then at least augmented by other architectures. I have already done a review of three promising directions—diffusion-based LLMs, Mamba-like state-space models, and Google Titans—in a previous post, “Beyond Transformers”. So in this section, I will concentrate on other ideas.

Does RL improve reasoning?

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (Yue et al., Apr 2025). This NeurIPS 2025 paper fundamentally challenges assumptions underlying the reasoning model paradigm. Using pass@k at large k values to assess reasoning boundaries, the authors systematically demonstrate that while RLVR improves average-case accuracy (pass@1), base models consistently achieve broader reasoning coverage at high k values across all tested benchmarks (MATH500, AIME24, LiveCodeBench, and more), model families (7B-32B), and RL algorithms (PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO). The plots inevitably cross at some point:

Perplexity analysis confirms that RLVR-generated reasoning paths already exist within the base model's distribution—the model learns to sample them more efficiently rather than discovering new reasoning capabilities. Crucially, distillation from stronger teachers can genuinely expand reasoning boundaries, whereas RLVR appears limited to improving sampling efficiency within pre-existing capabilities. This suggests current RLVR methods function more as elicitation techniques than capability-building approaches—a significant consideration for understanding both the potential and limits of reasoning model training.

Useful cognitive patterns for self-improvement

Cognitive Behaviors that Enable Self-Improving Reasoners (Gandhi et al., Mar 2025) provides a mechanistic explanation for why some models self-improve dramatically through RL while others plateau. The authors identify four key cognitive behaviors—verification, backtracking, subgoal setting, and backward chaining—that mirror expert human problem-solving. Models naturally exhibiting these behaviors (like Qwen-2.5-3B) achieve 60% accuracy on the Countdown benchmark after RL training, while models lacking them (like Llama-3.2-3B) plateau at 30% under identical conditions.

‍

The most striking finding: models primed with incorrect solutions that exhibit proper reasoning patterns achieve identical performance to those trained on correct solutions. What matters is the presence of cognitive behaviors, not access to correct answers. This has practical implications for data curation—reasoning structure from weaker models can effectively bootstrap stronger reasoners.

Gated attention

Gated Attention for LLMs (Qiu et al., Sep 2025). New model architectures that could improve training dynamics are always welcome in deep learning. An important (and very practical) result came from this work, which won a Best Paper award at NeurIPS 2025. This was both an experimental and theoretical investigation of adding a simple sigmoid gating mechanism to each attention head’s output. The authors tested over 30 variations and converged on one that gave consistent gains across the board:

The theoretical insight here is that this gating introduces a mild non-linearity inside the softmax attention, which otherwise is a mostly linear operation (apart from the softmax normalization). That non-linearity, combined with making the gate query-dependent and sparse (since sigmoid outputs between 0 and 1, many heads can effectively turn “off”), helps avoid the “attention sink” problem where a few tokens dominate the attention and gradients. It also improves long-context extrapolation, meaning models could handle inputs longer than they saw in training. This suggests that even slight departures from linearity in architectures can change the scaling of what they can represent.

The fact that this idea was immediately used in a production model—Alibaba’s Qwen-3 uses it—proves its practical value and shows a tight feedback loop between theory and deployment. We expect more such architectural tuning results going forward, potentially guided by theoretical analysis of the limits of Transformers.

Conclusion

So, what do we have in the field of large language models by 2025?

First, reasoning models are real. This isn't marketing or hype. RLVR really works (although there are some interesting objections), models have truly learned to "think" (albeit in quotes for now—perhaps we’ll come back to this discussion in a later installment), and this leads to a qualitative leap on tasks requiring multi-step reasoning. Gold medals at IMO and ICPC are no longer just pretty numbers for press releases, and I'll probably save an even more convincing demonstration of progress for the section on AI in science.

Second, test-time compute scaling has become a new dimension of optimization. Previously, we mainly thought about how to scale training. Now, inference can also be scaled, and it often turns out to be more effective. Small models with the right test-time scaling can outperform models hundreds of times larger.

‍

Third, LLM agents have finally begun to work in practice. MCP standardized interaction with tools, computer use approached human-level performance, and Claude Code took autonomous agents to a new level, and not just in programming. A METR graph with 5-hour tasks to solve already looks daunting to those who make a living programming—and what will it be like in a year?

Fourth, open-source models aren't that far behind. DeepSeek continues to show that open-source models can compete with the best proprietary ones, at least in specialized tasks; but their "specialization" is mathematical reasoning, which isn't all that narrow.

What's next? I believe that even if no new revolutions (like the end of the Transformers era) occur, 2026 will be the year of widespread use and development of what was developed in 2025. Browser agents, computer use, agentic coding—all of these have already become very popular and should firmly establish themselves in the mass market within a year. Furthermore, we may yet see the first serious alternatives to Transformers in the LLM architecture—Mamba, Titans, and diffusion LLMs are all waiting in the wings.

In the next parts of this review, we'll discuss other aspects of artificial intelligence: image-based models, AI safety, robotics, and so on. Happy 2026, and may it be the year of AI progress and diffusion!

‍

The Year in AI - Best of 2025, Part I: Reasoning Models, LLM Agents and More