OpenAI’s next frontier model, OpenAI o3, is garnering global attention given its complex reasoning capabilities. But model performance on benchmark datasets doesn’t necessarily align with real-world applications, performance, or business value. Here’s our take.
AI continues to evolve at warp speed. In December 2024, OpenAI announced its next frontier model, OpenAI o3. The model is garnering global attention due to its capabilities to complete complex reasoning tasks.
The testing was carried out with benchmark datasets, which are used to test and evaluate AI and other computational models. They’re key to advancing machine learning and AI research. But model performance on benchmark datasets doesn’t necessarily align with real-world tasks and applications, on-the-ground performance, or business value. Here’s our take.
OpenAI o3 performance on benchmark datasets
The reported performance of OpenAI o3 is remarkable. According to a video from OpenAI, o3 has demonstrated exceptional performance on benchmark datasets: 96.7% accuracy in competition-level math problems, 87.7% accuracy on PhD-level science questions, and 71.7% in software programming.
These results clearly outperform the OpenAI o1 model and set a new industry standard. Open AI o3 also scored between 75.7% and 87.5% accuracy on the ARC-AGI datasets — Abstract and Reasoning Corpus for Artificial General Intelligence — which are considered one of the most important benchmarks for artificial general intelligence (AGI). This performance is comparable to human performance at about 85% accuracy.
The ARC-AGI datasets test models’ abilities in spatial reasoning, pattern recognition, and adapting knowledge to unfamiliar challenges — abilities to mimic human-level intelligence. In this way, the ARC-AGI datasets are intended to evaluate “intelligence” rather than “technical skills,” or the ability to understand, learn, and adapt versus simply carrying out a task as prompted.
As AI models and systems continue to improve, organizations might be feeling FOMO and pressure to implement AI. And some, with good intentions, might view these advancements as another step toward automating most of our work — and thus, more opportunity for improvement across the business. But there’s more to the story, and important considerations to keep in mind. Consider the following ideas:
1. High benchmark performance doesn’t necessarily translate to real-world effectiveness
Advancements on benchmarks are widely viewed as key signals for AI progress. That said, benchmark datasets don’t necessarily align with real-world tasks, which are usually much more complicated than what’s defined in the datasets. This can also lead to design decisions that maximize benchmark performance but don’t inevitably improve real-world language processing. These are some of the major reasons models can perform well on benchmarks but can fall short in real-world scenarios. Additionally, recent audits have found biases and other problematic issues in benchmark datasets, which have direct implications for bias and harms in algorithms and models.
What this means is, we should be careful not to assume models will always maintain their high performance on benchmarks in our specific use cases. It’s important to set the right expectations and rigorously test the AI models and tools before incorporating new applications into our workflows.
2. High benchmark performance doesn’t guarantee business value
The impressive performance of o3 comes with a high retail cost. For example, it costs about $17–20 every time o3 completes a visual reasoning task, such as filling in a missing puzzle piece, in its low-compute mode. The cost of its high-compute mode is believed to be even higher, potentially running into thousands of dollars per task. As with any technology advance, the high cost raises questions about the comparative affordability of resources, such as AI versus human talent. And remember that quality and value are linked; don’t just ask “can AI do this?” Ask “can AI do this and meet my standards?” While costs may decrease over time, it’s crucial for us to understand the pricing models of AI vendors and whether, and for which use cases, the investment in AI brings returns in business value. In other words, it’s critical to calculate the ROI.
3. High benchmark performance doesn’t mean it's easier for human review
We’ve observed an interesting pattern in human-AI interaction: the better AI performs, the harder it is for humans to identify and fix errors. When AI outputs over time are mostly correct, it’s common for us to develop the mindset that the AI is reliable, which can lead to overlooked errors.
We’ve also seen cases where humans discard AI output that has 98% accuracy and start over because they need 100% accuracy and lack the tools or expertise to find the 2% errors. As AI performance continues to improve, we must focus on user experience and create intuitive, user-friendly workflows for our clients and staff if we’re to truly benefit from AI to improve efficiency and add business value.
While o3’s benchmark performance is impressive, its real-world accuracy remains to be seen. As of this publication date, OpenAI hasn’t yet provided the system card with detailed information on its functionality, but researchers suggest it uses techniques such as “backtracking,” based on chains of thought. This involves calculating multiple chains of thought processes and selecting the one with the most likely successful outcome. (This complexity likely contributes to the model’s high cost.) Whether this technique constitutes “true reasoning” is still up for debate. OpenAI plans to release o3-mini to the public at the end of January 2025 and o3 after that. As with any new model — and any new technology — it’s important to stay informed and be ready for further performance and user experience testing as we consider integrating it with our work.