GPT-5.2

View on HN ↗4 messages44 viewsShared Dec 11, 2025

TL;DR

OpenAI released GPT-5.2, touting significant improvements in benchmarks across reasoning and vision tasks, alongside an updated knowledge cutoff of August 2025. However, the release sparked extensive discussion and skepticism within the Hacker News community, primarily concerning the model's increased pricing, perceived inaccuracies in its promotional materials, and the ongoing debate about the real-world applicability and integrity of AI benchmarks against competitors like Gemini and Claude.

Key Perspectives

  • Benchmark Performance: GPT-5.2 demonstrated notable improvements on various benchmarks, with its high-reasoning version achieving a 77.9% score on the Extended NYT Connections benchmark, up from 69.9% for GPT-5.1. The model also showed strong performance on ARC AGI v2 and SWE Verified.
  • Cost-Effectiveness on Benchmarks: GPT-5.2 Pro achieved a new high of 54.2% on ARC AGI v2 at a cost of $15.72/task, which is noted as more cost-effective than Gemini 3 Pro's 54% at $30.57/task. The "xhigh" reasoning level of GPT-5.2 also offers good value at 52.9% for $1.90.
  • Vision Capabilities: OpenAI highlighted GPT-5.2's enhanced vision, but users criticized the promotional motherboard image for mislabeling several components (e.g., RAM slots identified as CMOS battery, DisplayPort as HDMI), leading to questions about the model's accuracy.
  • Knowledge Cutoff and Training: The model features a significant knowledge cutoff jump to August 2025, suggesting a new pre-trained model or extensive continuous pre-training, which contrasts with the perception that OpenAI hadn't completed a major pre-training run since GPT-4o.
  • Adaptive Reasoning: GPT-5.2 continues to incorporate adaptive reasoning, designed to respond faster to simpler queries and dedicate more processing time to complex ones, a feature also present in GPT-5.1 and Codex-Max.
  • Coding Experience: Initial user feedback suggests GPT-5.2, particularly through Codex, offers a "really nice" and improved experience for coding discussions, though performance on first coding attempts varied.
  • Voice Chat Comparisons: ChatGPT's voice chat is praised by some for its quality and near-perfect speech-to-text, while others find Gemini's voice features superior for specific use cases (e.g., Latin transcription) or prefer Claude's voice chat due to fewer interruptions.
  • Context Management: A new "compaction" feature for context management is introduced, which is seen as a significant usability improvement for maintaining context in long conversations without manual intervention.

Controversies & Debates

  • Benchmark Reliability and Gaming: A recurring theme is skepticism regarding the value of benchmarks, with users arguing they are often "gamed," do not reflect real-world task completion without errors or hallucinations, and contribute to "benchmark saturation."
  • OpenAI's Transparency and Competitor Comparisons: Many users questioned why OpenAI's announcement did not prominently feature comparisons with competitor models like Claude Opus or Gemini 3 Pro, suggesting it might be due to competitors catching up or surpassing GPT-5.2 in certain areas.
  • Accuracy of Promotional Imagery: The errors in the promotional motherboard image sparked a debate about OpenAI's honesty, with some users feeling misled despite an OpenAI employee acknowledging the feedback and agreeing the text could be clearer about the model's mistakes.
  • Price Increase Justification: The increase in pricing for GPT-5.2 compared to 5.1 (a 40% increase in some tiers) led to discussions about whether the performance gains truly justified the higher cost, especially given that flagship models tend to be more expensive upon release.
  • "Rushed" Release and Internal State: Some community members speculated that GPT-5.2's release felt rushed, possibly in response to the success of Gemini 3, questioning if OpenAI truly had a "better model ready to go" or if it was deployed prematurely.
  • "Enshittification" Concerns: One user commented that the price increase and evolving model landscape signal an entry into the "first phase of the enshittification process" for AI services.

Standout Insights

  • A user successfully used Gemini-3-pro-preview to produce an accurate, macronized Latin transcription from an 8-minute, 48k mono MP3 of a nuanced Latin speaker, despite Gemini's voice chat responding in Spanish/Portuguese in other contexts.
  • An OpenAI employee (tedsanders) directly engaged in the discussion, acknowledging user feedback about the promotional image's inaccuracies and committing to making a "small edit" to the text to clarify the model's limitations.
  • A user attempted to contact OpenAI support to inquire about early access to GPT-5 fine-tuning, only to find the support channel was 100% AI-driven and unable to escalate requests to a human team member.

User

Can you dive deeper into comparisons with other models?

AI Assistant

The discussion thread provides various comparisons between GPT-5.2 (and other GPT models) and its competitors, including Gemini (Google), Claude/Opus (Anthropic), and Grok (xAI).

Here's a deeper dive into the comparisons:

1. General Performance & Benchmarks:

  • NYT Connections Benchmark: User "zone411" reports that GPT-5.2 shows improvement over GPT-5.1 across reasoning levels (e.g., high-reasoning: 69.9 → 77.9). However, Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher. "Donald" notes Gemini 3 Pro Preview gets an impressive 96.8% on this benchmark. "bigyabai" quips that GPT-5.2 might be Google's best Gemini advertisement yet due to these results.
  • ARC AGI v2: "josalhor" highlights significant jumps from GPT 5.1 Thinking to 5.2 (17.6% -> 52.9%). "granzymes" notes that a new gpt-5.2-pro scores 54.2% (a new high), slightly surpassing Gemini 3 Pro (54%) on the leaderboard, with a better cost-per-task ($15.72 vs $30.57). The new xhigh reasoning on GPT-5.2 (52.9% for $1.90) is also noted as a big improvement over Opus 4.5 (37.6% for $2.40) in terms of value.
  • Other Logic Puzzles: "tikotus" links to a blog post where GPT 5 Pro was a winner on "Clues by Sam," though "thanhhaimai" points out it lacks Gemini 3 performance, and "crapple8430" notes GPT 5 Pro is much more expensive (10x), making it an "apples to oranges" comparison.
  • AIME: "Tiberium" shares an OpenAI graphic showing GPT-5.2 performing at 100% on AIME (no tools), which "varenc" finds impressive.
  • SWE Verified: GPT 5.2 improves from 76.3% to 80%, though "fuddle" expresses skepticism, suggesting solutions might be in the training dataset. "enlyth" also claims Claude Opus had a higher score on SWE-Bench Verified but was conveniently left out of OpenAI's comparison graphic.

2. Subjective User Experience & Specific Use Cases:

  • Overall General Use & Web Search:
    • "tenpoundhammer" switched from ChatGPT to Gemini 3, finding it "much better" and consistently giving "better answers" for coding, news, stock analysis, and daily problems, especially for current web information.
    • "dmd" has the "exactly opposite experience," finding ChatGPT "extremely willing" to do searches and thinking, while Gemini is "extremely reluctant."
    • "lxgr" had the "opposite experience" to "tenpoundhammer," finding Gemini 3 Pro "worse than either" GPT-5.0 or 5.1 for web search and hallucinating at "alarming rates." They state Opus 4.5 has been a "step above both."
    • "hbarka" finds the "intuition in answers from Gemini just fits," and is "now unwilling to just rely on ChatGPT."
    • "nullbound" (using paid ChatGPT vs. free Gemini) subjectively found ChatGPT better for product search, suggesting variables like context might influence response quality.
  • OCR & Vision:
    • "tenpoundhammer" praises Gemini's "phenomenal" OCR, stating ChatGPT "can't read my bad handwriting but Gemini can easily."
    • "dmd" contradicts this, saying ChatGPT is willing to open PDFs, screenshot, OCR them, and use as input, whereas Gemini "just ignores them."
    • Despite claimed improvements, "breakingcups" points out that GPT-5.2's vision demo image on the promotional page still contains "completely wrong" placements and labels (e.g., RAM slots identified as CMOS battery, DisplayPort as HDMI), leading "g947o" to decide against using it until it's "at least 5x better."
  • Coding Assistance:
    • "tenpoundhammer" found Gemini better for coding.
    • "bayarearefugee" agrees, stating Gemini "performs so much better than ChatGPT for giving actual practical, working answers" for coding, when fully operational despite UI issues.
    • "ipsum2" notes that models like codex-5.1-max and opus 4.5 are "absolutely useless writing shaders or ML training code," but "really good at basic web development."
    • "JanSt" believes Codex and Opus 4.5 are "really good coders already."
  • Voice Chat:
    • "zug_zug" considers ChatGPT's voice chat its "last remaining killer feature."
    • "hbarka" argues Gemini 3 Live mode is "much much better than ChatGPT" due to natural voices, responsiveness, and interruptibility.
    • "simondotau" "loathe[s] ChatGPT's voice chat" for being too conversational.
    • "josephwegner" is a fan of Perplexity's voice chat (which "solarkraft" points out is based on OpenAI's Realtime Assistant API).
    • "joshmarlow" says Grok's voice chat is "almost there" but slower.
    • "Robdel12" finds Claude's voice chat better, as ChatGPT interrupts too much.
    • "spudlyo" provides an interesting example: ChatGPT attempts Latin pronunciation, while Gemini responds in Spanish/Portuguese. However, for transcribing nuanced Latin audio, Gemini-3-pro-preview is "pretty mind blowing."
    • "sundarurfriend" criticizes ChatGPT's voice chat for "meh" voices, poor noise filtering, unreliable speech-to-text, and inferior AI backend quality compared to text chat. They find Grok better at speech-to-text and noise handling, but its general answers aren't great.
    • "CamelCaseName" finds ChatGPT's voice to text "the absolute best in the world," while Gemini's is frustrating.

3. Pricing:

  • "preetamjinka" notes that GPT-5.2 is more expensive than GPT-5.1, going against a perceived trend of prices decreasing.
  • "PhilippGille" points out that Gemini 3 Pro Preview also became more expensive than 2.5 Pro.
  • "arnaudsm" suggests #1 models are typically priced at 2x more than the competition.

4. OpenAI's Comparison Strategy:

  • Several users ("minadotcom," "mlmonkey," "yousif_123123") express concern or amusement that OpenAI did not initially include comparisons to competing models like Anthropic, Google DeepMind, or DeepSeek in their public release. They question if this indicates GPT-series is performing worse.
  • "Tiberium" later shares an X (Twitter) post by OpenAI that does include a table comparing GPT-5.2 against Opus 4.5 and Gemini 3 on selected benchmarks (AIME, HumanEval, GDPval).
  • "enlyth" criticizes this shared comparison as "cherry-picked," claiming Claude Opus had a higher score on SWE-Bench Verified (which was omitted), and GDPval is an OpenAI-made benchmark.
  • "poormathskills" incorrectly states OpenAI has "never compared their models to models from other labs in their blog post," which "boole1854" corrects by pointing to the GPT-4o post's comparisons.

User

What about coding performance?

AI Assistant

The discussion thread contains several points regarding the coding performance of GPT-5.2 and its competitors:

  • Gemini vs. ChatGPT for Coding:

    • tenpoundhammer states they use ChatGPT "a ton" for coding but found that Gemini handles coding use cases "much better" and consistently gives "better answers" after trying version three.
    • bayarearefugee echoes this experience, saying that for LLM use in coding assistance, Gemini "performs so much better than ChatGPT for giving actual practical, working answers without having to be so explicit with the prompting that I might as well have just written the code myself." This is qualified by noting Gemini's UI and "continuity" issues, but when it's fully operational, its coding performance is superior.
  • Codex Variants and Specifics:

    • The article (and "xd1936" quoting it) mentions that "GPT‑5.2 will work well out of the box in Codex," and "we expect to release a version of GPT‑5.2 optimized for Codex in the coming weeks."
    • "jstummbillig" highlights a seemingly contradictory statement from the article that "GPT-5.1-Codex-Max is a faster, more capable, and more token-efficient coding variant." They question how this makes sense with the reported benchmarks for 5.2.
    • "tempaccount420" explains that coding requires a "mindset shift that the -codex fine-tunes provide," suggesting that Codex versions perform specific actions like looking up documentation or trying out code in isolation to improve capability.
    • "dmos62" finds the "biggest advantage of codex variants" to be "terseness and reduced sicophany," along with better adherence to requested output formats.
  • General Model Capabilities for Coding:

    • "JanSt" remarks that "Codex and Opus 4.5 are really good coders already and they keep getting better." They believe we might have "crossed the threshold of models being as good or better than most engineers already."
    • However, "dinobones" shares a more critical view, stating they've "tried using Codex with max reasoning for doing PRs and gotten laughable results too many times," despite Codex with Max reasoning being "apparently near-SOTA on code." They add that "Claude Code/Opus is also sometimes equally as bad" at complex tasks like implementing ideas in large codebases across many files while passing tests.
    • "ipsum2" notes that models (specifically codex-5.1-max, opus 4.5) can be "absolutely useless writing shaders or ML training code," but "really good at basic web development." "sheeshe" attributes this to the abundance of web development data available on the internet for training.
    • "speedgoose" experienced GPT-5.2 in Vscode Insiders with Github Copilot (Codex), noting it "eventually started using sed and grep in shells instead of using the better tools it has access to."
    • "ghosty141" working in a C++ environment, finds Codex "works REALLY well when it has things to copy" like refactorings, documentation, and code review, potentially saving ~50% of time. However, it's "not that great at writing new code" and "doesn't create abstractions, use abstractions or try to find general solution to problems."
  • Alternative Models:

    • "cc62cf4a4f20" mentions using Devstral 2 (Ollama) with OpenCode, finding it "good enough" despite not being as good as Claude Code, and appreciating the data sovereignty it offers by running locally.