NIX Solutions: OpenAI Suspected of Manipulating Tests of AI Model o3

Last December, OpenAI introduced its large language model, o3, claiming it could tackle over 25% of the FrontierMath set of complex math problems. At the time, this announcement was presented as a significant achievement, especially since other AI models could only handle 2% of the problems. However, subsequent independent tests and discrepancies between OpenAI’s internal results and those from external researchers have raised questions about the transparency of the company’s testing methods and the accuracy of the claims.

Discrepancy in Test Results

When OpenAI announced the performance of the o3 AI model, they highlighted the algorithm’s success in solving FrontierMath problems. However, the consumer version of the algorithm released to the public last week is far less capable at performing these complex calculations. This has led some to believe that OpenAI either overstated the performance of the AI or used a different, more capable version of o3 for internal tests.

NIX Solutions

Epoch AI, the organization behind FrontierMath, published independent tests of the public version of o3. These tests revealed that the algorithm could only handle 10% of the problems, a far cry from the 25% performance figure that OpenAI initially claimed. The researchers also tested the o4-mini model, a smaller, more affordable successor to o3-mini, and noted the performance gaps between these models. While this discrepancy does not necessarily suggest that OpenAI intentionally inflated the capabilities of o3, it raises valid concerns regarding the testing protocols and the version of the model that was evaluated.

Possible Explanations and Future Updates

Epoch AI highlighted that the public version of o3 may differ from the version OpenAI initially tested. The organization pointed out that OpenAI’s results, when compared to theirs, were almost identical in terms of the lower bounds. Epoch AI also noted that the FrontierMath problem set had been updated for their testing, which could have contributed to the differences. The researchers suggested that the differences in test outcomes might stem from OpenAI evaluating the AI using a more powerful internal version, more computation time, or a different subset of FrontierMath problems.

The ARC Foundation, which also conducted tests on the o3 preview, stated that the public version of the model “is a different model” optimized for real-world applications, including chat and product use, adds NIX Solutions. The ARC team confirmed that the compute power of the released versions of o3 is lower than the version they tested. In response to these discrepancies, OpenAI’s Wenda Zhou clarified that the public version of o3 is “more optimized for real-world use cases” and improves query speed compared to the version tested in December. Zhou explained that this optimization could be a reason why benchmark results differ from the initial claims.

As more tests are conducted and further optimizations are made, we’ll keep you updated on the progress and any additional insights into the o3 AI model’s true capabilities.