GPT o1 Model: Major Advancements in AI Reasoning Performance

Evals

IMG_0231.jpeg

Reasoning Improvements

  • Overview: The image evaluates the reasoning improvement of a new model, labelled as "o1," compared to GPT-4o, on various human exams and machine learning benchmarks.
    • Thoughts: Demonstrating the capability of AI models on complex reasoning tasks exhibits advancements in artificial intelligence.

Exam Settings

  • Maximal Test-Time Compute Setting: o1 was tested under these conditions unless specified otherwise.
    • Explanation: This ensures that o1 is evaluated under consistent and optimal settings to reflect its true capabilities.

Key Performance Insights

Competition Math (AIME 2024)

MetricGPT-4oo1 Previewo1
Accuracy13.4%83.3%
  • Thoughts: The significant improvement from 13.4% to 83.3% underscores the enhanced reasoning capabilities of o1.

Competition Code (CodeForces)

MetricGPT-4oo1 Previewo1
Percentile11.062.0%
  • Thoughts: Highlighting coding benchmarks can be crucial for evaluating models used in software development.

PhD-Level Science Questions (GPOA Diamond)

MetricGPT-4oo1 Previewo1Expert Human
Accuracy56.1%78.3%78.0%69.7%
  • Thoughts: Showing comparison with human experts provides a benchmark for human-level reasoning abilities in science.

ML Benchmarks

BenchmarkGPT-4oo1 Improvement
MATH-50060.394.8
MathVista63.873.2
MMML69.178.1
MMLU88.092.3
  • Explanation: ML benchmarks are essential to evaluate the AI model’s performance in machine learning tasks.

PhD-Level Science Questions (GPOA Diamond)

SubjectGPT-4oo1 Improvement
Chemistry40.264.7
Physics59.592.8
Biology61.669.2
  • Explanation: Assessing performance across different fields of science demonstrates the model's breadth of knowledge.

Exams

ExamGPT-4oo1 Improvement
AP English Lang52.064.0
AP English Lit68.769.0
AP Physics 265.989.0
AP Calculus71.385.2
AP Chemistry83.093.0
LSAT87.898.9
SAT EBRW91.393.8
SAT Math100.0100.0
  • Thoughts: Performance on standardized exams shows applicability in educational assessments.

MMLU Categories

CategoryGPT-4oo1 Improvement
Global Facts65.178.4
College Chemistry68.978.1
College Mathematics75.698.1
Professional Law75.685.0
Public Relations76.887.1
Econometrics79.897.0
Formal Logic85.392.3
Moral Scenarios83.085.8
  • Explanation: Varied categories indicate the model’s capability across different fields of knowledge and logic.

Summary

  • o1's Superiority: On a majority of benchmarks, o1 significantly outperforms GPT-4o, indicating a noticeable improvement in reasoning tasks.
    • Thoughts: This signifies progressive evolution in AI models towards robust and accurate problem-solving abilities.

Reference:

openai.com
Learning to Reason with LLMs | OpenAI
www.cognition.ai
A review of OpenAI o1 and how we evaluate coding agents
venturebeat.com
OpenAI launches new AI model o1 with PhD-level performance