GPT-4.5 fails test against its predecessor

A striking result of a comparison between two OpenAI models, AI agents on the phone, Claude's coding skills generating hype, and more

Feb 28, 2025

It’s already been an eventful year in AI, and this week added to it with new flagship models from two of the most impactful labs, OpenAI and Anthropic.

We’ll take a closer look at that in this week’s newsletter, along with a few other stories.

Number of the Week

10. The factor of extra compute that the new GPT-4.5 model was trained with compared to the previous generation. [Source]

GPT-4.5 fails test against its predecessor

OpenAI highlights in their presentation of GPT-4.5 that it is “the best model for chat yet”, but what exactly does that mean?

Andrej Karpathy, ex-OpenAI and founding team member, had early access to the model and blind tested it up against GPT-4o in a series of prompts and let the users on X decide who did best.

Each model got the same set of five “funny/amusing” prompts that test their language capabilities. An example:

After eight hours and thousands of votes for each test, the answers were revealed. To Karpathy’s (and probably also others’) surprise, GPT-4o provided the preferred response four out of five times.

Karpathy points out that GPT-4o’s answers might look good on the surface.

“But if you really think about it longer and more carefully, you will more often catch it saying things that are a bit of an odd thing to say, or are a little too formulaic, a little too basic, a little too cringe, or a little too tropy,” he wrote.

He lists a number of potential reasons for the result.

”Either the high-taste testers are noticing the new and unique structure but the low-taste ones are overwhelming the poll. Or we're just hallucinating things. Or these examples are just not that great. Or it's actually pretty close and this is way too small sample size. Or all of the above.”

It’ll be interesting to see how GPT-4.5 will do in the blind test on Chatbot Arena where Grok-3 tops the chart.

Karpathy ends with his own impression.

”At least from my last 2 days of playing around, 4.5 has a new, deeper charm, it's more creative and inventive at writing, and I find myself laughing more at its jokes, standups and roasts.“

GPT-4.5 is available for Pro users today, and will be the same for Plus users next week, when OpenAI gets their hand on more GPUs.

AI agents talking on the phone in a computer language — how it works

A video went viral earlier this week with an AI agent calling another one about a hotel booking. In many posts it was implied that the two surprisingly realized their shared kinship and switched from human language to beep sounds.

In fact, the setup was from a demo made for a hackathon held by voice startup ElevenLabs in London last weekend, and the AI agents were programmed to be able to do exactly what they did.

The underlying technology is actually a few years old, and is made for the purpose of transmitting data via sound signals which omits the need for network coverage.

It works a bit like morse code, but instead of dots and dashes, the system sends a high-pitched or low-pitched sound. Those can be translated to binary code, and all characters can be converted to that as well (the letter A could have the number one, B number two, etc.).

Humans reportedly speak in 10-12 letters per second, whereas the technology in the demo can send up to 16 characters per second.

Obviously, this is not the fastest way to get the message across, but speed is not the purpose of technology.

You can try it for free in the app Waver [iOS/Android]

People are quite impressed with the coding capabilities of Claude 3.7 Sonnet

Anthropic kicked off the week with the release of their most advanced model yet, Claude 3.7 Sonnet, and people seem particularly impressed with its coding capabilities.

On Reddit there’s been a steady stream of posts praising the release where some report quite noteworthy experiences, like a user who reportedly solved an advanced task in a single prompt.

“This was a highly complex codebase that took me about two days to get working, and it handled it all in one go,” the user writes.

Others apparently think the hype has gone too far, as a spoof post about Claude 3.7 saving the user’s marriage was also heavily upvoted.

Creators of OpenAI’s Deep Research share tips for the tool

Josh Tobin and Isa Fulford (on the right), co-creators of Deep Research, on the Training Data podcast.

On Tuesday, OpenAI announced the broad rollout of their Deep Research tool, and on the same day two of the people from the team building it appeared on the Training Data podcast, sharing a range of insights about it.

Josh Tobin and Isa Fulford said that they are seeing a lot of usage on the tool for people who are doing research as part of their job, for understanding financial markets, companies, real estate or medical research.

A strength in Deep Research, according to Fulford, is also its abilities in following instruction, say a user is searching for information about a product, but they want more than that.

“You also want comparisons to all other products, and you want information about reviews from Reddit or something. You can give loads of different requirements, and it will do all of them for you.”

Another tip is to ask Deep Research to provide the answers in a table format.

“It will usually do that anyway, but it's really helpful to have a table with a bunch of citations and things for all the categories of things that you want to research.”

Image of the Week

The Stop AI organization held a demonstration outside of OpenAI’s offices in San Francisco on Sunday. On the group’s website it states the Stop AI is ”a non-violent civil resistance organization working to permanently ban the development of smarter-than-human AI to prevent human extinction, mass job loss, and many other problems.”

According to the organization, the demonstration “went well”.

“~50 people came and went within a two-hour time window. Five people sat in front of the doors and locked them shut. Three of them, Guido Reichstadter, Derek Allen, and Jacob Freeman, were arrested.“

Feedback?

Thanks for reading, and feel free to let me know what you think was good or could be done better.

Discussion about this post

Ready for more?