2025 has been a breakthrough year in image-related tasks too. Flow matching has displaced traditional diffusion training, 3D Gaussian splatting now dominates neural scene representation, and foundation models have found their way into medical imaging with over a thousand new FDA-approved devices. Conference best papers show a decisive shift toward feed-forward neural networks replacing iterative optimization: for example, VGGT eliminates bundle adjustment, while autoregressive generation adopted by frontier labs challenges diffusion models. In this second part of our “AI in 2025” review, we cover the mathematical foundations, architectural innovations, and breakthrough applications in this direction.
Generative Models for Images and Video

Consumer-Grade Advances
Image generation has advanced by leaps and bounds in 2025. In March, OpenAI replaced DALL-E 3 with GPT-4o's native image generation capabilities. This wasn't just an upgrade but a fundamental reimagining of how image generation integrates with language models—GPT-4o generates images as part of its multimodal understanding, maintaining conversation context and enabling iterative refinement that standalone image models could not do.
But OpenAI's lead was challenged in August when Google released Gemini 2.5 Flash Image, nicknamed Nano Banana. Built on Google's Gemini architecture, Nano Banana excels at natural language edits, multi-image fusion, and fast iteration. It generates images approximately three times faster than GPT-4o (around 13 seconds versus 44 seconds) while achieving exceptional subject consistency and realistic human faces with fewer artifacts. In November, Google doubled down with Nano Banana Pro (built on Gemini 3 Pro), adding 4K resolution output, perfect text rendering in any style or language, and professional-grade controls.
OpenAI responded in December with GPT-Image-1.5, which the company says is four times faster than its predecessor and significantly better at following prompts and making precise edits. The model-vs-model competition has been fierce: ChatGPT maintains the only "perfect" text generation in AI image tools, while Nano Banana Pro leads in resolution and professional controls. For developers and content creators, the practical recommendation has become to use both: generate initial concepts with the faster Nano Banana, then switch to ChatGPT for text-heavy refinements or character consistency work.
This is a more debatable point, but I do believe that video generation has experienced its "ChatGPT moment” in 2025. OpenAI's Sora 2 (September 30, 2025) delivered synchronized audio—dialogue, music, and sound effects generated in-scene—along with accurate physics simulation, OpenAI persistent world states across multi-shot sequences, OpenAI and a "Cameo" feature for identity injection. A major validation came with Disney's $1 billion equity investment and three-year character licensing deal (December 2025): mainstream entertainment is starting to adopt generative models.
It’s not only OpenAI, though. Google's Veo 3 (May 2025) and Veo 3.1 matched Sora's audio synchronization capabilities while generating 70+ million videos within months of launch. CEO Demis Hassabis described it as "AI video generation leaving the era of silent film". Runway’s Gen-4.5 and General World Model (GWM-1) advanced real-time explorable environments and conversational avatars, while Pika Labs continued to iterate on its accessible video generation tools.
Alongside this, 2025 saw diffusion models become more controllable and more multimodal. Preexisting techniques like ControlNet (Zhang et al., 2023) were extended; now it’s common to condition image generation on sketches, segmentation maps, depth maps, or text prompts simultaneously. MIT's HART model achieved 9x faster generation with reduced energy demands on consumer hardware (Tang et al., Feb 2025), and Midjourney V7 refined artistic coherence.
But you have not come here to read about customer-facing models that everyone knows about already. What happened in the world of image-related AI research in 2025?
Diffusion models switch to flow matching
Mathematically speaking, in 2025 diffusion models have definitely moved from discrete Markov chain diffusion to models of continuous evolution. For example, FLUX.1 Kontext (May 2025) used flow matching to enable sophisticated in-context image editing, and this approach was recently scaled up to the FLUX.2 family after a careful study of the resulting latent space (Black Forest Labs, Nov 2025).
Basically, in 2025 the theoretical foundations of diffusion models switched to flow matching as the dominant training paradigm. I recommend an MIT tutorial by Holderrieth and Erives (Jun 2025) that shows how flow matching and diffusion models are mathematically equivalent for Gaussian sources, with the choice of network output parametrization and noise schedule being the primary practical differences.

Diff2Flow (Schusterbauer et al., Jun 2025) bridges these paradigms by deriving flow-matching-compatible velocity fields from diffusion predictions, enabling efficient transfer from Stable Diffusion models. This is also a good time to remind that generative models for images can also be adapted for conditional tasks such as depth estimation—and Diff2Flow translates all this into flow matching models.

A very recent theoretical analysis from Liu et al. (Dec 2025) reveals that flow-based models exhibit two-stage dynamics: an early "navigation" phase guided by data mixtures, followed by a "refinement" phase dominated by nearest samples:

In one of NeurIPS 2025 best papers, "Why Diffusion Models Don't Memorize", Bonnaire et al. (May 2025) address a long-standing puzzle: why don't diffusion models simply memorize their training data? Given their massive capacity and the ability to overfit, we might expect them to simply remember training images. This paper provides theoretical and empirical evidence that the iterative denoising process itself acts as an implicit regularizer, biasing the model toward learning generalizable features rather than memorizing specific examples. Using random matrix theory, the authors identify two predictable timescales: a dataset-independent generalization phase followed by linear dataset-size-dependent memorization.

Rectified flow advances continued with Yang et al. (Feb 2025) introducing RFDS (Rectified Flow Distillation Sampling), a method similar to SDS loss for diffusion:

Diffusion Transformers and the scaling revolution
The Diffusion Transformer (DiT) architecture has decisively replaced U-Net for large-scale generation. This revolution was already started in 2024 with works such as the Dynamic Diffusion Transformer (DyDiT; Zhao et al., Oct 2024) and Representation Alignment for generation (REPA; Yu et al., Oct 2024), but it seems like in 2025 the transformation has been completed.
Apple's EC-DIT (Sun et al., Jan 2025) scaled to 97 billion parameters using a Mixture-of-Experts architecture with expert-choice routing, achieving a state-of-the-art 71.68% GenEval score. This demonstrates that adaptive compute allocation based on text-image complexity can lead to significant quality gains:

Zheng et al. (May 2025) used hyperparameter transfer as an important way to improve efficiency. They generalized Maximal Update Parametrization (µP), a method recently proposed for regular Transformers to transfer hyperparameters from small to large models (see, e.g., this guide), to DiT variants including PixArt-α and MMDiT. As a result, e.g., DiT-XL-2-μP converged 2.9× faster, and the authors scaled MMDiT from 0.18B to 18B parameters with only 3% of the tuning costs.

All major 2025 video generators—Sora 2, CogVideoX, HunyuanVideo (13B parameters), and FLUX—now use DiT architectures. The Transformer's predictable loss-to-quality correlation and superior scaling properties have ended the U-Net era for frontier models.
3D Vision and Generation

In 3D vision, the results are not quite as exciting in my opinion as they are for regular image generation, but 2025 has also been a year of great progress.
CVPR best papers went to 3D vision and generation
For instance, a team from Oxford’s Visual Geometry Group (VGG) and Meta AI won CVPR 2025’s best paper prize for “VGGT: Visual Geometry Grounded Transformer” (Wang et al., Mar 2025). This work tackled the notoriously hard problem of reconstructing a 3D scene from multiple 2D images—a task that traditionally involves many stages of geometry: camera calibration, feature matching, bundle adjustment, and so on. VGGT achieved 3D reconstruction from an arbitrary number of images (from 1 up to 100) using a single Transformer network, in a matter of seconds, by jointly solving for camera pose, depth maps, and cross-image correspondences within one unified model.

Perhaps we can now replace complex, brittle, and computationally heavy multi-view stereo pipelines with this much simplified unified model.
The CVPR 2025 Best Student Paper went to a project that crossed vision and physics: “Neural Inverse Rendering from Propagating Light” (Malik et al., Jun 2025). The work addresses the problem of recovering scene properties (geometry, materials, lighting) from images by modeling how light propagates through a scene. Unlike traditional rendering that goes from scene to image, this inverse problem is notoriously ill-posed. The paper introduces differentiable formulations of light transport that enable gradient-based optimization, recovering physically accurate scene decompositions from photographs.
Neural rendering switches from NeRF to Gaussian splatting
Following the NeRF (Neural Radiance Fields) boom of 2020-2022, researchers in 2025 continued to refine neural rendering, but it looks like the field is switching to a new class of methods. 3D Gaussian splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, as documented in Bernhard Kerbl's survey (Oct 2025). It looks like 3DGS is achieving what NeRF had pioneered, but with real-time rendering, minute-scale training, and direct editability.
The most important part is the performance gap: NeRFs render scenes in seconds per frame, while 3DGS can achieve 100+ FPS on modern hardware and train in minutes; this makes direct scene manipulation possible for 3DGS. The ICCV 2025 work "NeRF Is a Valuable Assistant for 3D Gaussian Splatting" (Fang et al., Jul 2025) reflects this new reality: NeRF techniques now complement rather than compete with 3DGS, providing initialization and regularization:

A notable contribution in Gaussian splatting was “3D Student Splatting and Scooping (SSS)”, which won an Honorable Mention at CVPR (Zhu et al., Mar 2025). This work revisited the 3D Gaussian Splatting method, which represents a scene as a cloud of Gaussian density blobs for fast rendering, and fixed two key limitations: the tendency to overfit with too many Gaussians and difficulty handling thin structures and fine details. SSS introduces a "scooping" operation to remove redundant Gaussians and a "student" network to learn efficient representations, resulting in cleaner, more compact scene representations that render faster while maintaining quality.

One weak point of Gaussian splatting had always been storing a lot of Gaussians, often up to millions of them for complex scenes, which amounts to gigabytes per scene without additional compression. But compression advances in 2025 have also been dramatic. HAC++ (Chen et al., Jan 2025) achieves 100× compression using hash-grid assisted context modeling, while Fast Feedforward 3D Gaussian Splatting Compression (FCGS; Chen et al., Jan 2025) provides optimization-free compression at 10× faster speeds than prior methods.

Real-time rendering reached new heights with NVIDIA's 3DGUT (Wu et al., Mar 2025), supporting distorted cameras (fisheye, rolling shutter) and secondary rays (reflections) at 265+ FPS via the Unscented Transform.

Other notable quality improvements include:
- half-Gaussian kernels for better discontinuity handling in 3D-HGS (Li et al., May 2025),
- high-fidelity reconstruction from just three training views using diffusion priors in generative sparse-view Gaussian splatting (GS-GS; Kong et al., Jun 2025), and
- DashGaussian that offers GS optimization in 200 seconds (Chen et al., Mar 2025), a CVPR 2025 Highlight paper.
Here is a comparison of what you can do with GS-GS from three training views compared to standard Gaussian splatting (on the left):

Dynamic scenes and 4D Gaussian representations
Temporal extensions of 3DGS have been developed to enable practical dynamic scene capture. For example, the recently published Anchored 4D Gaussian Splatting (Li et al., Dec 2025) uses anchor points to regulate temporal Gaussian attributes, reducing storage while improving quality:

MEGA (Zhang et al., Jul 2025) achieved memory efficiency breakthroughs similar to the ones we described above for 3DGS, reducing Gaussians from 13M to 0.91M for the Birthday scene and thus reducing storage from 7.79GB to under 1GB.

But probably the most interesting developments have been related to the same kind of feedforward approaches as FCGS discussed above. For example, Diff4Splat (Pan et al., Nov 2025) achieves 4D synthesis in 30 seconds using a video latent Transformer with no test-time optimization:

Video-based neural rendering has recently advanced through Prior-Enhanced Gaussian Splatting (Shih et al., Dec 2025) for dynamic scene reconstruction from casual video, 7DGS for unified spatial-temporal-angular representation (Gao et al., Mar 2025), and Deformable 2D Gaussian Splatting (D2GV; Liu et al., Mar 2025) for video representation at 400 FPS. I also recommend this collection with nearly 500 papers on Gaussian splatting.
Text-to-3D generation becomes much faster
The text-to-3D pipeline has been revolutionized by feedforward approaches. In particular, Turbo3D (CVPR 2025, Hu et al., Dec 2024) achieved sub-second generation through a four-step, four-view diffusion combined with latent-space Gaussian reconstruction. Dual-teacher distillation ensures both view consistency and photorealism:

DiffSplat (Lin et al., Jan 2025) generates 3D Gaussians directly from text-to-image diffusion models in 1-2 seconds, using 3D rendering loss for cross-view coherence. Their main image gives a nice visual overview of different text-to-3D approaches:

In the second half of 2025, we have seen these important advances scale up and diffuse towards more ambitious applications. For example, Meta's WorldGen (Wang et al., Nov 2025) extends this to end-to-end text-to-3D world generation, creating 50×50 meter navigable environments while maintaining geometric integrity:

Overall, it seems like we are already at the point where you can generate 3D assets from a text prompt quickly and efficiently, but perhaps not yet at the point where the resulting assets can be used as is for a game or virtual environment. We’ll see what the near future brings.
Medical imaging reaches clinical scale

In 3D generation, we are at the stage of “almost production level”. On the contrary, medical imaging has long been a field where in many applications AI models are “fully production ready” and outperform human doctors, but they have not been used as widely as they should due to regulatory reasons or problems with assigning blame.
Finally, it looks like AI models are breaking through in medicine. A recent report (Casey, Dec 2025) summarizes that FDA authorizations for AI-enabled medical devices surpassed 1,356 devices by September 2025, with over a thousand specifically for radiology (77% of approvals).
Notable 2025 FDA clearances include:
- Philips SmartSpeed Precise (Jul 2025), the first integrated dual AI MRI solution which offers 3x faster scanning and 80% sharper images;
- ArteraAI Prostate (Jul 2025), a de novo pathology AI for prostate cancer treatment prediction with a first-ever predetermined change control plan for digital pathology;
- Galen Second Read (Feb 2025), a pathology AI for cancer detection.
Overall, while I am very much not an expert here, digital pathology reviews of 2025 indicate that the field is finally achieving wide-scale adoption in actual medicine; here is one report that calls 2025 “The Year of Industrialization”.
Here’s hoping that FDA picks up the pace: while it may sound like a US-only problem, the U.S. is indeed one of the largest markets for medicine, and getting there is usually a life-and-death problem for any medical startup.
But there are other reasons for optimism too. In particular, foundation models in medical imaging are transitioning from research to deployment. For example, the Generalist Medical Segmentation model (MedSegX; Zhang et al., Sep 2025) introduced a Contextual Mixture-of-Adapter-Experts approach for open-world segmentation across 39 organs/tissues. I recommend a comprehensive review by van Veldhuizen et al. (Jun 2025) that covers over 150 studies on foundation models for pathology, radiology, and ophthalmology.
Honorable Mentions
In this section, I will briefly go over several important computer vision subfields that have not been considered above. Generative models tend to overshadow everything these days, but I want to remind myself and you that “regular” vision problems also remain important.
Document and video understanding
DeepSeek-OCR (Wei et al., Oct 2025) introduces "Contexts Optical Compression", compressing documents into compact vision tokens while preserving spatial grounding:

It achieves ~97% decoding precision at 10× compression, processing 200,000+ pages daily on a single A100. Even before that, Alibaba's Qwen2.5-VL (Bai et al., Feb 2025) matched GPT-4o on document tasks with robust structured extraction from invoices, forms, and tables.
Video understanding scaled to hour-length content. The Video-XL family of models that first appeared last year (Shu et al., Sep 2024) was continued with Video-XL-2 (Qin et al., Jun 2025) that introduced the method of task-aware KV sparsification. Hour-LLaVA/VideoMarathon (Lin et al., Jun 2025) introduced a 9,700-hour instruction dataset with 3.3M QA pairs. Apple's SlowFast-LLaVA-1.5 (Xu et al., Mar 2025) achieves state-of-the-art on LongVideoBench using a two-stream architecture: slow for spatial detail, fast for temporal motion.

In another interesting work, Video-EM (Wang et al., Aug 2025) combines the ideas of human episodic memory and video reasoning with chain-of-thought to allow a video LLM to operate with long context:

Finally, VideoRAG (Ren et al., Feb 2025) enables retrieval-augmented generation for multi-hour videos, combining graph-based textual knowledge grounding with multimodal context encoding for semantic reasoning across extended content.
Segmentation
Although you hardly hear about classical computer vision problems such as object detection and segmentation in the news, they also remain vibrant areas of research!
Here let me highlight Meta's SegmentAnything family of models (SAM). The first SAM, released in 2023 (Kirillov et al., 2023), defined the state of the art in segmentation for a long time. In a way, you could say SAM “solved” segmentation, but there is always a possibility to continue research for harder or more general problems.
In 2025, this line first continued with SAM 2 (Ravi et al., Jan 2025) which extended segmentation to video with a streaming memory architecture enabling 6x faster inference. They also released the SA-V dataset with 50.9K videos and 35.5M segmentation masks:

Its 848M-parameter multimodal architecture uses a DETR-based decoder with presence tokens for discriminating closely related prompts. It turns out one can still find improvements and new architectural components for a segmentation model even at the end of 2025:

Video object segmentation also advanced, e.g., through LiVOS with linear complexity attention (Liu et al., CVPR 2025), DMVS decoupling motion expression video understanding (Fang et al., CVPR 2025), and M³-VOS introducing benchmarks for objects with phase transitions across 479 high-resolution videos (Chen et al., CVPR 2025).
Image editing
Region-Aware Diffusion Models (RAD; Kim et al., CVPR 2025) introduces region-aware diffusion with asynchronous per-pixel noise schedules, achieving up to 100x faster inference than prior methods with plain reverse processes:

TurboFill (Xie et al., Apr 2025) adapts few-step text-to-image models for fast inpainting, while HD-Painter (Manukyan et al., Jan 2025) enables high-resolution prompt-faithful completion.
Style transfer continued improving with StyDiff combining diffusion with AdaIN to mitigate over-stylization (Sun, Meng, Sep 2025). For restoration, Defusion (Luo et al., Jun 2025) provides all-in-one restoration with visual instructions for degradation.
Conclusion
The year 2025 has seen several decisive transitions in the field of AI models for images and video. Flow matching and rectified flow have become the default training paradigm, with theoretical understanding now rigorous enough to explain generalization versus memorization dynamics. DiT architectures at 10-100B scale have completely displaced U-Net, while consistency models and IMM enable practical 1-8 step generation without quality loss.
In 3D vision, Gaussian splatting has basically won; the question is no longer NeRF versus 3DGS but how to best use 3DGS's real-time rendering and direct editing capabilities while alleviating drawbacks like a lot of space taken by the models. Feed-forward text-to-3D in under one second represents perhaps the year's most transformative capability for content creation.
The replacement of iterative optimization with feedforward neural prediction, which we have seen in this review in VGGT, Turbo3D, and FCGS, may prove to become 2025's most consequential architectural shift, promising speed improvements by orders of magnitude across many areas of computer vision.
There has also been, of course, significant progress in foundational vision-language and other multimodal models, including world models, which I will save a more detailed discussion in a later installment. For now, let us conclude that we are rapidly heading towards more and more general visual artificial intelligence.

