Transcript15:02

Hey what's up Zev here from esy.com. So I just got the alert Claude Fable 5 and Claude Mythos 5. It seems Claude Fable is the new segment of Mythos. It's a piece of Mythos. It's not as big and bad but it's going to be some kind of variation of Mythos but it's better than Opus which is crazy. So I just got the update on youtube.

I wanted to check it out and do a little review. I'm going to test it later in Claude code as well as cursor because I'm already doing some work on that. So it's going to be interesting to see how this works out because we've been pretty disappointed with 4.7 and 4.8. We see some marginal improvements. Some people have seen some regressions. So you can't really say if the additional models 4.7 and 4.8 are dramatically better than 4.6. So a couple of months ago when 4.6 came out it was a massive hit. People were disappointed with 4.7 and then from my read on Twitter and so forth 4.8 wasn't that much better in terms of how other people benchmark these models.

Which is really hard to do because you can't really trust the companies word for it when they actually list the results that they get and the benchmarks. So the usual best case is that you'll plug these models into your own workflows and then judge for yourself how much better it is at being able to do work compared to the previous models for your special use case. And that's another benefit of what I'm working on at esy.com where we essentially build agentic workflow templates around digital assets like clipart, essays. These are just some of the few that I've needed so far. Infographics, I'll be building more out.

But the ability to be able to take something like say a template like this for a generic clipart asset, that's not really very useful for Fable. But to be able to plug and play with the models and then compare them to each other is what I'm trying to say. And to be able to do that with templates is a lot easier. So if you already have a template that is known to produce a certain quality of work for a certain type of asset, being able to plug in a new model and produce a new template and compare it to the old ones is something that could simplify that kind of benchmark. But let's get back into it. So I haven't actually even looked into this yet. So today we're launching Claude Fable 5, a Mythos class model that we've made safe for general use. Okay, let's get straight into the benchmarks and see what they're saying. So they're saying that agentic coding, they're giving it 80%, 80.3% for Mythos Fable

5. Mythos preview, I've never tested this so I don't know. I don't even know when that was released, if it was released at all. Opus 4.8, 69.2%. Interesting. So for agentic coding, they're saying that Mythos 5 gets 80.3% compared to 69.2% for Opus 4.8. They gave 58.6% for GPT 5.5. I don't know how many codex fans is going to agree with this. That's pretty hilarious though. Gemini 3.1, yeah, Gemini, I actually have never used Gemini for coding. It's never been a thought of mine. I should see... Man, once you're just getting the flow with these quality models, you don't even want to test your repo on stuff like this. I would never put Grok on my repo, let loose on my repo like that. Of course you could do a git tree and then let it run loose, but... And you should. You should. I'm more interested in the open source

solutions that claim to be on par with Opus and GPT than to come down to Gemini and Grok. Grok is never in the conversation, but... So agentic coding, software engineering, bench pro, frontier code. I wonder, do people really understand this stuff or are they just all faking it? Because I get that the frontier models putting these charts up. I get that. But people reading it, like a lot of these news guys reading it and claiming, I don't think these people know what they're talking about, to be honest with you. Frontier code, agentic coding, I don't even know what that means. I got to look in... Because to actually study if these benchmarks are accurate, you have to go in. You have to go in on it. You can't just be reading these charts. You have to... Frontier code, you got to look into frontier code. When I look at the news guys, they

never really go into that stuff. GDPval for knowledge work, what does that even mean? So we would have to go to this source here and we have to go look at what they're saying constitutes knowledge work and how they're measuring that. This got 1932. This means nothing to anybody. Of course, Gemini 3.1, man. You can't catch a break.

I do like Gemini. I like talking to Gemini to be honest, but knowledge work vision, 29 of course. Of course, Claude Mythos is going to be the best. We'll see. Maybe it is. Okay. At least they gave this win to GPT. Spatial reasoning blueprint. Spatial reasoning. 38.6. These are some serious claims, but they admit that Opus 4.8 only had 14.5%.

And then GPT-5 got 36.2 here. So they're keeping you honest. Some people like to keep it honest just so they can lie a little better on some other metrics. But the people will decide. You can put all these benchmarks up as much as you want, but when the people get a hold of the model, they're going to test it in the wild and it'll be that. That'll be that. Tool use.

Okay. Man, it's just kind of wondering what does Gemini's goal at this point, if this is true. Legal and Grok. For all of the claims on X about Grok, I like talking with Grok. Grok is funny, but Grok is kind of like shits and giggles. You know what I mean? It's like Grok is who you go to for shits and giggles. When you're trying to do serious work, you got to pull out Opus. You got to pull out GPT. You're not using Grok for serious work. And in Gemini, if you happen to be on Google and you need a quick AI overview or if you want to dive deep into the AI overview, you're comfortable using Gemini. It's Google at the end of the day. But Grok is never in the conversation for anything but shits and giggles, essentially. Okay.

So this is interesting. And then there's a lot of talk about this whole computer use thing. I haven't tested computer use yet. So I don't know. Have any of you guys tested computer use yet? Perplexity changed their whole... Perplexity actually changed their whole thing. I just checked out Perplexity today. They used to have a nice UI over here where they used to show news. I would love to go into Perplexity to just like, you know, go in on whatever the latest news was. Now they just like a ChatGPT clone. So why would I even use that and just go to ChatGPT? You don't see anything anymore.

You don't see the news that they used to have. You used to be able to chat with it regularly. And now they have this whole computer. Everyone's going the whole computer route. Perplexity computer. Or if you even check out Higgs Field. I like to use Higgsfield in order to generate some images and produce videos. And they're going hard on this whole supercomputer buzzword now. Which turn a simple chat into production ready content at scale.

Production ready content at scale. I mean I like Higgsfield. I don't understand the whole computer thing. I have to look into that. I have to look into it. So Frontier code. Okay, so we got some charts here. Agented coding. They got a video here. Let's see what this video is about. What the hell? Okay, is that Pokemon?

It's just, what is this even doing? Claude Fable 5 beats Pokemon FireRed. Oh, what was the point of that? A time lapse of Claude playing Pokemon. Oh, FireRed from start to finish using only raw game screenshots with no maps, navigation aids, or extra game state information. Earlier Claude models needed a complex helper harness to play Pokemon. Claude Fable 5 completed the game with Vision Along. Oh, that's pretty sick. That's sick. If that's true, that's sick. Memory in long context, Fable 5 stays focused across millions of tokens in long running tasks and improves its outputs using its own notes. When we had the model play the deck building game Slay the Spire, giving it access to persistent file based memory, improved its performance three times more than for Opus 4.8. Fable also reached the game's final act three times more often.

So this is a simulation, simulates the solar system and predicts a solar eclipse. Okay, this is all cool marketing stuff. What's this? Protein complex. Okay, it's kind of deep. Misaligned behavior. Early feedback for Claude Fable 5. Okay, so let's see what Cursor is saying. Well, Cursor kind of, there's no reason why Cursor would not pump it up. You know what I mean? Cursor pumps up every model. Claude Fable 5 is the state of the art model. I mean, they said that about many other things.

It's opened up, I mean, they said that when GPT 5.5, well, GPT 5.5 is a great model though, but I feel like they say that every single time there's a new model. And of course they have the incentive to say that. It's opened up a class of long horizon problems that were out of reach for early models. That says absolutely nothing. Okay, Claude Fable 5 is the state of the art model on Cursor Bench. Okay, that's meaningful, but it's opened up a class of long horizon problems that were out of reach for early models. Okay, let's see what we get. Claude Fable 5 is a real step forward for the developers GitHub service.

In our early testing, it took on complex long horizon coding tasks with a level of autonomy and reliability that exceeded previous benchmarks. But what excites us most is the direction it points. A future where developers can hand increasingly ambitious work to agents and trust the results across the software side. Okay, safeguards, cool. Offensive cyber evaluations.

Okay. All right, it's available everywhere today. Pricing for both models, there we go. Let's get into the pricing. Let's get into it. For both models is 10 per million input tokens and $50, I'm sorry, $10 per million input tokens and $50 per million output. Oh, that's expensive. That's expensive. Much more expensive than Opus 4.8. We expect demand for Fable 5 to be very high and difficult to predict. I mean, let's see if it really delivers because Mythos was really like, you know, this kind of like big bad, big and bad, what's that? Terminator like threat to cybersecurity, had to shut it down.

Could only give it to the US government in order to use for mission critical exercises and stuff like that. So let's see if this is really it because there was a lot of hype around it. If it's not, it's going to be pretty disappointing. So, but I think, I think I actually don't know what to think. So I'm going to try this out later today. I just wanted to kind of look into this and to do this improv and but I'm going to test this out in Claude code and do a video on that and then we can see how it works. But we got to come up with the right benchmarks because I don't want to just do like what a lot of I see a lot of other and then there's nothing wrong with it. I just don't want to kind of like make some kind of like silly little animation or anything like that. I want to actually see like, does it really work? So it takes time to kind of come up with a good benchmark. I think though, if I come up with a couple of them, then at least when these new models come out, I have

a good idea and I can actually compare them properly. So it will be in the form of a workflow template and I'll share it. Okay. So I'll see you guys in the next one. Take care.

Esy Research

Engineering deep dives and AI tool breakdowns weekly

First Impressions: A Peek into Claude Fable 5 and Claude Mythos Docs

Zev UhuruAgentic Engineer
June 9, 202615:11
claude-fable-5anthropicfirst-impressionsfrontier-modelsmodel-research

A launch-day read-through of Anthropic's Claude Fable 5 and Mythos 5 announcement — the first generally available Mythos-class model, its classifier safeguards with Opus 4.8 fallback, pricing, and what the release signals for long-horizon agentic work.

Anthropic shipped Claude Fable 5 and Claude Mythos 5 today, and this video is exactly what it sounds like: a first scroll through the announcement on launch day, reacting to what's actually in it. This is not a deep dive — no benchmarks of my own, no workflow runs yet. It's a read of what Anthropic is claiming, what's structurally new about this release, and what I want to test next.

What Was Announced

Fable 5 is the first Mythos-class model — Anthropic's new tier above Opus — made generally available. The headline claims: state-of-the-art on nearly every tested benchmark, with the lead growing as tasks get longer and more complex. Mythos 5 is the same underlying model with cyber safeguards lifted, restricted to Project Glasswing partners and, soon, a trusted access program.

The naming footnote is worth a pause: Fable is from the Latin fabula, akin to the Greek mythos. Same model, two names — the safeguards are the only difference, and that's the most interesting design decision in the whole release.

The Safeguard Architecture

Instead of refusing flagged requests, Fable 5 falls back: when classifiers detect cybersecurity, biology/chemistry, or distillation-related queries, the response is handled by Claude Opus 4.8 and the user is told it happened. Anthropic says fallback triggers in under 5% of sessions, and that for the other 95%+ Fable 5's performance is effectively Mythos 5's.

Graceful degradation to a still-frontier model is a much better failure mode than a refusal wall — but it makes "which model actually answered me" a real provenance question for anyone building on the API.

Claims That Stood Out

  • Software engineering: Stripe reports a codebase-wide migration in a 50-million-line Ruby codebase done in a day versus an estimated two-plus team-months. Cursor calls it state of the art on CursorBench; Cognition says it tops FrontierBench.
  • Vision: rebuilding a web app's source from screenshots alone, and beating Pokémon FireRed with a minimal vision-only harness where earlier models needed elaborate scaffolding.
  • Memory and long-context: persistent file-based memory improved its Slay the Spire performance three times more than it did for Opus 4.8 — directly relevant to long-running agentic workflows.
  • Science (Mythos 5): ~10x acceleration claims in protein design tasks and novel hypotheses preferred ~80% of the time over Opus-class output in blinded comparisons.

Pricing and Rollout

$10 per million input tokens, $50 per million output — less than half of Mythos Preview. Subscription access is staged: included on paid plans through June 22, then moved to usage credits until capacity allows restoring it. There's also a new 30-day retention requirement on all Mythos-class traffic, which enterprise users will want to read closely.

What I Want to Test

The claims that matter for Esy are the long-horizon ones: token efficiency at medium effort, memory-assisted multi-step runs, and whether the classifier fallback ever trips on benign workflow-engineering prompts. That's the follow-up deep dive.