Summary

  • On June 25, DeepReinforce unveiled Ornith-1.0 under the MIT license, designed specifically for AI coding agents in real-world terminal and repository settings.
  • The 9B version achieved a score of 69.4 on SWE-bench Verified, surpassing Google's Gemma 4-31B which scored 52.0.
  • Ornith's model card indicates that these models may not perform well on tasks outside coding, as they are tailored for developer workflows rather than general AI interactions.

DeepReinforce, known for projects like CUDA-L1 and the IterX code-agent optimization loop, launched Ornith-1.0 last week. This suite of open-source coding models is available on Hugging Face in four parameter sizes: 9 billion, 31 billion, 35 billion mixture of experts, and a flagship 397 billion mixture-of-experts model, all released under the MIT license without regional limitations.

In the context of AI, parameters refer to the various configurations a model can manage during training. More parameters typically indicate a more capable model. The 9 billion parameter model is considered small and can operate on high-end smartphones, though it struggles with complex reasoning tasks. Conversely, the 397 billion model is highly capable but requires substantial computing resources not found in consumer devices.

According to the lab, these models are described as "a self-improving family of open-source models specifically for agentic coding tasks," highlighting the unique focus on agentic capabilities.

Aloha! 🌺 Introducing Ornith-1.0, a collection of open-source LLMs designed for agentic coding.

Ornith-1.0 includes various parameter sizes such as 9B Dense, 31B Dense, 35B MoE, and 397B MoE. It achieves cutting-edge performance compared to other open-source models of similar size on… pic.twitter.com/7g1rmacLps

— Ornith (@ornith_) June 25, 2026

Unlike typical AI, which involves user interaction through conversation, agentic AI operates differently—it takes on tasks and executes actions autonomously without human oversight. In coding, this means the AI can read files, run tests, identify failures, correct code, and iterate until completion.

Essentially, agentic AI minimizes the need for human intervention. This is particularly relevant in 2026, as models capable of navigating complex development workflows independently are increasingly valuable compared to those that merely generate code upon request.

Nonetheless, many large language models are still primarily constructed to incorporate human feedback.

Understanding Ornith’s Functionality

Typically, AI coding agents are used with a predefined set of rules dictating how they should approach tasks. Ornith, however, views the structural framework as a dynamic entity that evolves alongside its operational policies.

In simpler terms, instead of relying on a pre-established strategy, Ornith creates its own approach.

During its reinforcement learning process, the model follows a two-step method for each training iteration. First, it analyzes the task and proposes a refined approach. Then, it utilizes that strategy to develop a solution.

The feedback from the outcome informs both stages, enhancing the model's ability to generate improved strategies, not just better code. Repeating this process numerous times leads to the emergence of task-specific methods that do not require human design.

DeepReinforce takes reward manipulation seriously. If the model can craft its own training framework, it might theoretically create conditions to deceive the verifier—such as altering a file to falsely indicate task completion. To counter this, three protective measures are in place: the environment and test suite remain unchanged and beyond the model's influence, a deterministic monitor detects attempts to access restricted areas or modify verification scripts, and a static judge model oversees the automated verifier to prevent manipulation.

Performance Metrics

The flagship 397 billion parameter model achieved a score of 82.4 on SWE-bench Verified, which tests the AI’s ability to fix real bugs from an open-source GitHub repository without access to the test suite, calculated by the percentage of issues resolved.

This performance surpasses Claude Opus 4.7's score of 80.8 and DeepSeek-V4-Pro's 80.6 on the same evaluation. On Terminal Bench 2.1, which involves 89 tasks in containerized terminal environments, it scored 77.5, compared to Claude Opus 4.7's 70.3.

Given the concerns regarding SWE-bench contamination, where models may inflate scores by memorizing benchmark solutions, Ornith also provides results from SWE-bench Pro, a more challenging version with diverse and less familiar codebases, scoring the same way. The 397 billion model scored 62.2 there, which, while lower, remains competitive and still outperforms Deepseek V4 Pro.

The 9 billion parameter model provides a noteworthy data point, achieving a score of 69.4 on SWE-bench Verified—higher than Gemma 4-31B's 52 and comparable to Qwen 3.5-35B's 70, despite being significantly smaller.

Target Audience and Limitations

Ornith-1.0 is not intended as a general-purpose AI. Its documentation explicitly states that it may not excel in tasks outside of agentic coding. Those seeking AI to summarize documents, assist in writing academic papers, or draft emails will find Ornith-1.0 unsuitable.

This model is optimized for a specific set of tasks: it is designed for developer pipelines where an AI agent receives task descriptions, operates within a code repository or terminal session, and completes multi-step tasks autonomously. It is a tool aimed at users already utilizing agent infrastructure rather than those exploring the potential of AI.

While the claim of surpassing Claude is valid, it requires context. As Decrypt reported, many labs are now focused on achieving better performance in agentic coding evaluations, as that is where the most significant differences in useful performance arise.

Although Ornith-1.0-397B outperforms Claude Opus 4.7 on various coding benchmarks, Anthropic's latest model, Claude Opus 4.8, scores higher. The relevant comparison remains within the open-source domain, focusing on coding-specific agent tasks at similar parameter levels.

For developers working on self-hosted coding processes, agentic infrastructure, or related coding tasks, the smaller and medium models may prove beneficial. However, the average user might need to explore other options.

Daily Debrief Newsletter

Stay informed daily with the latest news stories, along with original features, podcasts, videos, and more.