One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.
Oh, we do admit impact and even have a name for it: AI slop.
(Speaking on LLMs now since AI is a broad term and it has many extremely useful applications in various areas)
They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.
Haha, so that's the first derivative of goalpost position. You could take the derivative of that to see if the rate of change is speeding up or slowing.
Both books that have outsold the Harry Potter series claim divine authorship, not purely human. I am prepared to bet quite a lot that the next isn't human-written, either.
I don't know. It's a question relevant to all generative AI applications in entertainment - whether books, art, music, film or videogames. To the extent the value of these works is mostly in being social objects (i.e. shared experience to talk about with other people), being able to generate clones and personalized variants freely via GenAI destroys that value.
You may be right, on the other hand it always feels like the next goalpost is the final one.
I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.
After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.
https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.
What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?
Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?
I recently watched some Claude Plays Pokemon and believe it's better measure than all those AI benchmarks. The game could be beaten by a 8yo which obviously doesn't have all that knowledge that even small local LLMs posess, but has actual intelligence and could figure out the game within < 100h. So far Claude can't even get past the first half and I doubt any other AI could get much further.
Now I want to watch Claude play Pokemon Go, hitching a ride on self-driving cars to random destinations and then trying to autonomously interpret a live video feed to spin the ball at the right pixels...
2026 news feed: Anthropic cited as AI agents simultaneously block traffic across 42 major cities while trying to capture a not-even-that-rare pokemon
We humans love quantifiability. Since you used the word "measure", do you believe the measurement you're aspiring for is quantifiable?
I currently assert that it's not, but I would also say that trying to follow your suggestion is better than our current approach of measuring everything by money.
No. Screw quantifiability. I don't want "we've improved the sota by 1.931%" on basically anything that matters. Show me improvements that are obvious, improvements that stand out.
Claude Plays Pokemon is one of the few really important "benchmarks". No numbers, just the progress and the mood.
the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.
Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.
Of course, this is just some pedantry.
I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.
I don’t know why I keep submitting myself to hacker news but every few months I get the itch, and it only takes a few minutes to be turned off by the cynicism. I get that it’s from potentialy wizened tech heads who have been in the trenches and are being realistic. It’s great for that, but any new bright eyed and bushy tailed dev/techy, whatever, should stay far away until much later in their journey
People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.
Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
> Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.
The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.
We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?
I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.
I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.
I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.
In my consulting career I sometimes get to tune database servers for performance. I have a bag of tricks that yield about +10-20% performance each. I get arguments about this from customers, typically along the lines of "that doesn't seem worth it."
Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!
AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.
Except at some point the low hanging fruit is gone and it becomes +1%, +3% in some benchmarked use case and -1% in the general case, etc. and then come the benchmarking lies that we are seeing right now, where everyone picks a benchmark that makes them look good and its correlation to real world performance is questionable.
Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".
In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.
I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.
There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.
Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.
Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.
And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).
It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.
From a technical standpoint, this is incredible. A few years ago, computers had problems creating grammatically correct sentences. Producing a consistent narrative like this was science fiction.
From an artistic standpoint, the result is... I'd say: incredibly mediocre, with some glaring errors in between. This does not mean that an average person could produce a similar chapter. Gemini can clearly produce better prose than the vast majority of people. However, the vast majority of people does not publish books. Gemini would have to be on par with the best professional writers, and it clearly isn't. Why would you read this when there is no shortage of great books out there? It's the same with music, movies, paintings, etc. There is more great art than you could ever consume in your lifetime. All LLMs/GenAI do in art is pollute everything with their incredible mediocrity. For art (and artists), these are sad times.
It's more nuanced than that. There are certain material/content where it is mandatory/necessary to read them.
Ideally I'd prefer to read material written by a the top 1%ile expert in that field, but due to constraints you almost always get to read material written by a midwit, intern, junior associate. In which case AI written content is much better especially as I can interrogate the material and match the top 1%ile quality.
Quality is its own property separate from its creator. If a machine writing something bothers you irrespective of quality then don't read it. You think i would care ? I would not.
If this ever gets good enough to write your next bestseller or award winner, i might not even share it and if i did, i wouldn't care if some stranger read it or not because it was created entirely for my pleasure.
> Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it.
Using the AI in multiple phases is the approach that can handle this.
Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.
But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.
Yes, and that's why many people in the discussion here are very optimistic that chatbots will have solve this problem very soon. Either with the approach you suggest, or with something else (and perhaps more general, and less directly programmed in).
It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.
I dunno, there's a certain amount of fun in "writing" a book with ChatGPT. Like playing a video game with a bunch of different endings instead of a watch a movie with only one. does the hero save the day? or turn into a villian! you decide!
The etymology is pretty much irrelevant. In eg German, the word for novel is 'Roman'. But German readers don't expect their novels to be anymore romantic, nor do English readers expect their novels to be more novel.
LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.
I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.
I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.
Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
> There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
10000s is still much smaller than the space of possibilities for even a short prompt.
You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.
And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.
Authors tell stories they want to tell and Readers read stories they want to read. The two don't necessarily overlap or overlap strongly enough. If you're even a little bit specific (nowhere near as specific as the above prompt, even just something like the dynamic between protagonists) then you don't actually have 10,000s of actual human written books. Not even close. Maybe it exists and maybe you'll find it good enough but if it's only been read by a few hundred or thousand people ? Good luck getting it recommended.
I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.
It's very hard to find good books written by humans. GoodReads is okay, but you quickly run out of high-end recommendations. I read mostly sci-fi, and the books that everyone recommends rarely end up being 10/10. But then I see some random recommendation on Reddit or HN, and it ends up being amazing.
That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.
I envisioned that one day, a framework will be created that can persist LLM current state on disk and then "fragments of memories" can be paged in and out into memory.
When that happened, LLM will be able to remember everything.
I had Grok summarize + evaluate the first chapter with thinking mode enabled. The output was actually pretty solid: https://pastebin.com/pLjHJF8E.
I wouldn't be surprised if someone figured out a solid mixture of models working as a writer (team of writers?) + editor(s) and managed to generate a full book from it.
Maybe some mixture of general outlining + maintaining a wiki with a basic writing and editing flow would be enough. I think you could probably find a way to maintain plot consistency, but I'm not so sure about maintaining writing style.
I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.
Planning generated content is often more important to invest in than the writing of it.
Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).
I'm terrible at writing, but I love reading. I've got ideas for novels, but I struggle to put them down.
What I have found that works is to give the LLM the "world" outline at the beginning and then just feed it one line summary of each chapter and get it to write a chapter at a time.
The problem is that the quality of results drastically decreases as the context length increases. After about 10 chapters the dialogue will start to get real snippy. I've tried getting it to summarize all the previous chapters and feed that back in, but it never includes enough detail.
The only way to get better at something is to do it. Start writing short stories or small novels, and you will get there over time. You don't even have to be a great writer to write a great book as well :). It helps, but readers will forgive a lot along your journey.
Brandon Sanderson has a great series of lectures on how he approaches it that are awesome ->
I am working on some world-building for something I want to write one day, but I am trying just to write little things to help. I write a lot of nonfiction stuff for work, but I am worried that it might not translate as well to characters...
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.
Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).
I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.
I'm eagerly awaiting an article about RAG caching strategies though!
There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.
- This kid is a genius! - you yell
- But wait, the kid has just picked an answer from the ground, it didn't actually come up...
- But the other toddlers could do it also but didn't!
Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.
Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.
I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...
Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?
It's entirely unclear what are you trying to get across, at least to me.
Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.
>> So it’s likely that it’s part of the training data by now.
> I don't think this means what you think it means.
> I did some interacting with the Tencent model that showed up here a couple days ago [...]
> This is a question that obviously was in the training data. How do you get the answer back out of the training data?
What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.
How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.
Your intentions are good, but your execution is poor.
I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.
I have translated the Chinese. I still have no idea what point you're trying to make. You ask it questions about some kind of band, and it answers. Are you saying the answers are wrong?
I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.
This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.
Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.
This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).
The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.
I agree with your general point that this "random person" is probably not representative of anything close to an average person off the street, but I think the phrasing "very very intelligent" and "exceedingly clever" is kinda misleading.
In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.
I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.
I solved it in less than 15 minutes while walking my dog, no pen or paper. But I wouldn't claim to be a random person without math skills. And my very first guess was correct.
It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.
So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?
That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.
Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.
> This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).
I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".
Side note: couldn't even find pen and paper around in half an hour.
This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.
What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)
>Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.
Why is this a great riddle? It sounds like incomplete nonsense to me:
It doesnt say anything about the skill levels of the participants, whether their answers are just guessing, or why they arent just guessing the sum of the other two people each time asked to provide more information?
It doesnt say the guy saying 65 is even correct
How could three statements of "no new information" give information to the first guy that didn't know the first time he was asked?
2 and 3 saying they don't know eliminates some uncertainties 1 had about their own number (any combination where the other two would see numbers that could tell them their own). After those possibilities were eliminated, the 1st person has narrowed it down enough to actually know based on the numbers shown above the other 2. The puzzle could instead have been done in order 2, 3, 1 and 1 would not have needed to go twice.
I guess really the only missing information is that they have the exact same information you do, plus the numbers above their friends heads.
everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt
There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).
Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).
Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).
I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.
Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.
If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.
I'm not sure what you mean but I'm trying to say our current LLMs are not artificially intelligent and calling them "AI" has confused a lot of the lay public.
You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.
i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?
In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.
Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.
Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.
You can also put the AI in the first person's shoes.
Prompt:
You are standing in a circle, there are 2 other people in the circle with you, everyone in the circle, has a positive integer above their head, no one knows what the number above their own head is but can see the numbers above the heads of the other people. You see that the person infront of you on the left has 26 above their head. The person on the right has 39 above their head. You are told that the sum of two of the numbers is the third number. You are asked what the number above your head is, the option is the sum, 65, or 13, as 26 + 13 = 39. You don't know which one it is, and you say so. The second person is asked the number above their head. They also say they dont know, the third person also says they dont know. What is your number?
Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong
I just asked it this twice and it gave me 65×65×130=549250. Both times. The first time I made it about ducks instead of people and mentioned that there was a thunderstorm. The second time I c/p your exact text and it gave me the same answer.
Again we find that the failure state of LLMs is a problem – yeah, when you know the answer already and it gets it right, that's impressive! When it fails, it still acts the same exact way and someone who doesn't already know the answer is now a lil stupider.
I use an algorithmic question that I'd been working on for years and that I'm finally writing up the answer to.
It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?
A prolog program, swipl (it takes less than a second to solve your puzzle)
N is number of turns of don't know answers.
the bad predicate means that the person can know its number at turn N.
bad(_,_,_,-1) :- !,false.
bad(_,A,A,0) :- !.
bad(A,_,A,0) :- !.
bad(A,A,_,0) :- !.
bad(B,C,A,N) :- D is abs(B-A),D<C,N1 is N-1, bad(B,D,A,N1),!.
bad(C,A,B,N) :- D is abs(B-A),D<C,N1 is N-1, bad(D,A,B,N1),!.
bad(A,B,C,N) :- D is abs(B-A),D<C,N1 is N-1, bad(A,B,D,N1),!.
solve(X,Y,Z) :- Y1 is X-1, between(1,Y1,Y),
between(0,2,N), Z is X-Y,bad(X,Y,Z,N).
?- solve(65,X,Y).
X = 26,
Y = 39 ;
X = 39,
Y = 26 .
Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.
Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.
When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.
That’s all I have so far. Anyone have other ideas?
P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].
P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.
P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.
Now we're back at P1. Who now knows.
So P2 and P3 are in the eliminated sets. Which means we're one of these
[2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.
"Which means we're one of these [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]"
Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?
EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.
"They're not eliminated; they correspond to a = 1."
I think that answer was poorly phrased because those possibilities are eliminated in a sense. There is a better answer further in the thread that explains "If the solution was not one of the flipped triplets, then the first player would not have worked out the solution." Thus if it was one of your other infinite triplets (eg. 65, 12, 53) then round 2 player 1 would've still answered 'I don't know'. Since they did respond with a definitive answer it had to be one of the formula solutions, since those were the only solutions they could prove. And since the only formula with a factor in 65 is 5 the correct formula must be [5A, 2A, 3A] and thus [65, 26, 39].
You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.
Look at it this way: Person 1 sees the numbers 26 and 39, and has to guess his own number. It must be one of only 2 possibilities: 13 or 65. All he has to do is eliminate one of those possibilities.
I think it has something to do with applying the lower bound of 1.
If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two
One of the cases has to be true, not all 3. (as you show, they're mutually exclusive for positive integers) i.e. "either" is important in the parent comment.
Which is why I indicated that it would be a misreading of the problem.
The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.
> The original problem is a little ambiguously worded.
No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".
There's a certain mind that either doesn't realize they're sidestepping the problem and turning it into a editing review, or realizes it, and doesn't understand why it seems off-topic/trivial to others.
What's especially strange here is, they repeatedly demonstrate if you interpret it that way, the problem is obviously, trivially, unsolvable, in a way that a beginner in algebra could intuit. (roughly 12 years old, at least, we started touching algebra in 7th grade)
I really don't get it.
When I've seen this sort of thing play out this way, the talking-down is usually for the benefit of demonstrating something to an observer (i.e. I am smart look at this thing I figured out; I can hold my own when the haters chirp; look they say $INTERLOCUTOR is a thinker but they can't even understand me!), but ~0 of that would apply here, at least traditionally.
We have three people in a circle. Let’s call them:
• Person A (first person)
• Person B (second person)
• Person C (third person)
Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy:
A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B
Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.
⸻
Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
Initial conditions:
The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.
Let’s denote clearly the conditions again:
• Exactly one of the following three equations is true:
• A + B = C, or
• B + C = A, or
• C + A = B.
Person A sees B and C, but not A. Similarly for Persons B and C.
⸻
Step 3: Logical deductions from the repeated “I don’t know”
First round deductions:
• Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers.
• Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions.
• Person C also doesn’t know, same logic as above.
Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.
Second round deductions:
After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.
Crucially, in the second round:
• Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.
⸻
Step 4: Deriving the Numbers (Critical deduction)
Let’s denote clearly again: Person A’s number is now known to be 65.
We have three possible equations. Let’s test each carefully given A = 65:
1. Case 1: A + B = C
Then we have:
• 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two.
• However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario.
• Let’s hold this scenario and continue examining others first.
2. Case 2: B + C = A (So, B + C = 65)
• Then Person A sees numbers B and C, adding exactly to 65.
• Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense.
• Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible.
3. Case 3: C + A = B (So, C + 65 = B)
• Then Person A sees numbers B = C + 65 and C.
• Person A sees C and C+65. Similar reasoning as Case 1.
⸻
Step 5: Identifying the Correct Scenario
The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.
Here’s why clearly stated:
• Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers.
• Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C.
• After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).
Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.
But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:
A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple:
• Numbers: 65, 39, 26
• Check quickly: indeed, 39 + 26 = 65.
Checking logic quickly:
• Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13).
• Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous.
• Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.
After each person says they don’t know:
• Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.
Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!
⸻
Step 6: Final Answer (Product of the Numbers)
The three numbers identified clearly are:
• A = 65
• B = 39
• C = 26
Then, the product is:
65 \times 39 \times 26
= 65 \times 1014
= 65,910
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.
Plus it drew me a very decent pelican riding a bicycle.
Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.
I think a competent 5yro could make a better pelican on a bicycle than that. Which to me feels like the hallmark of AI.
I mean, hell, I have drawings from when I was eight of leaves and they are botanically-accurate enough to still be used for plant identification, which itself is a very difficult task that people study decades for. I don't see why this is interesting or noteworthy, call me a neo-luddite if you must.
I wonder how far away we are from models which, given this prompt, generate that image in the first step in their chain-of-thought and then use it as a reference to generate SVG code.
It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.
Someone managed to get ChatGPT to render the image using GPT-4o, then save that image to a Code Interpreter container and run Python code with OpenCV to trace the edges and produce an SVG: https://bsky.app/profile/btucker.net/post/3lla7extk5c2u
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?
In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.
It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.
Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.
Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.
It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.
Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.
These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.
If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.
For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.
So, I think these results are very interesting, if you know what features specifically you are using.
But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.
Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.
Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.
If you've seen DeepSeek R1's <think> output, you'll understand why OpenAI hides their own. It can be pretty "unsafe" relative to their squeaky-clean public image.
I was looking at this the other day. I'm pretty sure OpenAI run the internal reasoning into a model that purges the reasoning and makes it worse to train other models from.
I might be mistaken, but originally the reasoning was fully hidden? Or maybe it was just far more aggressively purged. I agree that today the reasoning output seems higher quality then originally.
Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.
The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.
The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.
Serious question: Has anyone tested how much money you can actually make doing a month of Amazon Mechanical Turk? (It would make for an interesting YouTube video!) I am curious if it is middle class wages in very poor countries (like Nigeria). Some light Googling tells me that middle class salary in Nigeria is about 6K USD, so about 3 USD/hour (assuming: 50 weeks/year * 40 hours/week = 2000 hours/year). Is this possible with MTurk?
That's ok. AI will kill those off soon enough, and like all winners, rewrite history enough so that that inconvenient theft never happened anyway. It's manifest destiny, or something.
I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.
> I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".
To clarify, by "doing the opposite" I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that.
Glaringly missing from the announcements:
concrete use cases and products.
The Achilles heel of LLMs is the distinct lack of practical real-world applications.
Yes, Google and Microsoft have been shoving the tech into everything they can fit,
but that doesn't a product make.
I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.
Is that article trying to argue that 500M people every week are visiting ChatGPT for the first (or second) time after reading about it in the news?
If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.
It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.
AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.
Was going to comment the same thing, which has been bugging me off lately on all announcements that start with "our" followed by empty superlatives. Happy to not be alone on this!
Cancelled my account long time ago. Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands...
If you plan to use Gemini, be warned, here are the usual Big Tech dragons:
Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
The full extract of the terms of usage:
How human reviewers improve Google AI
To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.
Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
Emphasis on "retained for up to three years" even if you delete it!!
If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.
Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.
I believe this is out of date. There’s a very explicit opt in/out slider for permitting training on conversations that doesn’t seem to affect conversation history retention.
You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.
Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
360k tokens = how many lines of code approximately ?
and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?
Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.
Impressive. I tend to think it managed to find the bug by itself which is pretty crazy without being able to debug anything. Then again I haven't seen the bug description, perhaps the description makes it super obvious where the problem lies.
How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!
Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.
> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.
I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.
I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.
I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.
As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.
Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology
Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.
Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
Traditionally at Google experimental models are 100% free to use on https://aistudio.google.com (this is also where you can see the pricing) with a quite generous rate limit.
This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”
Right but the tweet I was responding to says: "This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!"
I assumed that meant there was a paid version with a higher rate limit coming out today
From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.
Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).
Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
Update: I tried a different version of the prompt and it's doing really well! Well, so far it's gotten out of its house and into Professor Oak's lab, which is not so impressive compared to ClaudePlaysPokemon, but it's a lot more than Gemini 2.0 was able to do with the same prompt.
Yea I get a little bummed but I guess a lot of HNers have reasons to not like google. I've had a Google One membership forever so opted for the higher subscription with Gemini access since the beginning (plus a free year with new Pixel phone). and I think it is awesome.
Most of us care only about coding performance, and Sonnet 3.5 has been such a giant winner that we don't get too excited about the latest model from Google.
For me personally - rate limit of 50/day means that I can't use it as daily driver so I'll have to go back to Sonnet which will gladly accept my money for more. Then I just forget it exists.
Yeah, if I don’t have higher rate limits, it’s useless. This just sounds like a gimmick launch where they want to gather feedback. It will be a couple of months before this will be GA.
> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).
Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.
Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.
You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."
I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.
I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.
I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.
For as amazing as these things are, AGI they are not.
I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output
Recommend trying it out at aistudio dot google dot com
This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.
A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.
On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
From https://x.com/OfficialLoganK/status/1904583353954882046
This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard:
https://github.com/vectara/hallucination-leaderboard
They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly.
My assumption was that the pricing was because it really was that expensive for whatever reason. I'm keeping fingers crossed that they're going to do some kind of 4.5 mini at some point that will be more affordable.
You're not wrong, but that just means the <adjective> is where the bulk of information resides. The trade-off matters. Maybe it's a model with good enough quality but really cheap to serve. Maybe it's a model that only plays poker really well but sucks at everything else because it bluffs too much. Etc. etc.
With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?
I think an argument could be made for hardware too. Perhaps in absolute terms Nvidia is ahead, but in terms of knowing how to get the most out of the hardware, Google making its own chips, building on their networking, etc, is a pretty big advantage.
(Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)
The problem Goog has is its insane bureaucracy and lack of vision from Sundar, which isn't very attractive from an employee position. If you're working close to Demis I imagine the situation is better though.
UX is actually increasingly the bottleneck. Most of the top models are very good if you micromanage their context and prompts. But people aren't very good at that stuff.
Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.
But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.
I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.
Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.
That's a good UX.
I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.
The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.
I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.
While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently.
For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini)
The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini.
Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)
I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]
It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.
I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.
The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.
Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.
Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.
curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
jq -r 'recurse(.children[]) | .author + ": " + .text' | \
llm -m "gemini-2.5-pro-exp-03-25" -s \
'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
I do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.
But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
It's a complex proposition.
I think Satya was talking about actual gdp growth right ?
In theory lets say all knowledge work is now 50% faster due to A.I. Well then I would assume this should affect civil society as well - planning a bridge, a railway etc should happen faster and more efficiently (the actual building of thins won't, but a lot of time is spent on planning a red tape).
Healthcare in general should become way more efficient with people getting better treatment; this should have a positive economic effect.
It does seem to me like it should be able to speed things up in the real world but of course a lot will have to do with how well the models can reason / how often they make catastrophic mistakes + the will of the governments and people to start using them seriously.
But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.
And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.
Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)
However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.
> Unemployment hasn't really picked up, and is unlikely to do so
That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).
I can feel this already with my own use of language models.
All the questions I had before language models, I have answered with language models.
That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.
In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.
As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.
> As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005
How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.
No, it won't (necessarily) be AI that's creating the new jobs. In general, when a new technology comes along and automates away some jobs, you can't expect the same technology to provide the new jobs.
To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.
But I doubt you'll be able to find any technology that automated some other job but created barista jobs.
It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.
(The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)
> It's just that the market will find stuff for people to do for money
Will it ?
Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I.
What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally.
It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.
Well, we have many examples where in the past technology (and to a lesser extent trade) have let to some sectors of the economy using fewer people than before.
The situation you describe isn't all that special.
Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.
> Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created.
It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8.
But letting the market sort this out ? I don't think so.
> It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
So I admit that this is a serious possibility that we need to consider.
But for the argument to make sense, we can't just talk about the general 'Oh, new technology will make a bunch of jobs obsolete.' We have to specifically talk about what (might) make AI special in that it might be even more general than electricity.
You didn't mention these special factors in your original comments.
I am not sure whether AI will be different or not, or rather I don't know how different it will be.
So far I see it as a good sign that we have many relatively equally competitive models from different providers, and some of them have open weights and some of them even have completely open sources (including training algorithms). So at least it's unlikely for the technology to be monopolised by any one entity.
> There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
If you want to involve the government, I'd rather give everyone a basic income, than to give our pupils inferior teachers and our sick people inferior nurses. (After all, we are assuming that humans will be worse at these jobs than the AI.) Also, I'd rather have people enjoy whatever it is they want to do, instead of being forced into some government provided make-work programme.
As it stands currently I'd say this is difficult to measure.
They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.
It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.
That’s true, but productivity has many factors and takes a long time to get confidence on. Any productivity value that could be stated clearly would have similar downsides to a benchmark, and take far longer.
Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.
In a fast moving space like this it’s reasonable to make use of leading indicators.
they have access to o3, I do. Thousands of people do(tens of thousands at this point?). Come on. Compare to SOTA, when you're saying it's the best AI you have.
I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.
why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..
It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.
There's probably a sweet spot here. On the flip side, ChatGPT currently doesn't indicate whether a given image generation request was serviced by multimodal GPT-4o [1] or Dall-E.
Personally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.
Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.
By extension it should also be slightly more helpful for research, R&D?
Have been using them for non-interactive coding where latency is not an issue. Specifically, turning a set of many free-text requirements into SQL statements, so that later when an item's data is entered into the system, we can efficiently find which requirements it meets. The reasoning models' output quality is much better than the non-reasoning models like 3.5 Sonnet, it's not a subtle difference.
I found reasoning models are much more faithful at text related tasks too (i.e. 1. translating long key-value pairs (i.e. Localizable.strings), 2. long transcript fixing and verification; 3. look at csv / tabular data and fix) probably due to the reflection mechanism built into these reasoning models. Using prompts such as "check your output to make sure it covers everything in the input" letting the model to double-check its work, avoiding more manual checks on my end.
Seriously? That doesn't require a human?! Are we talking about some kind of "generic" incident? (Type 3: forgot to manually update the xxxx file.) Or what's going on?
I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on
i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!
I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710
> Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
I tried the beta version of this model to write a business plan (long story).
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town
"My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.
There is no point in asking such questions, the model doesn't know what it is on its own, and you could get many different answers if you repeat it a few more times.
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
It's "experimental", which means that it is not fully released. In particular, the "experimental" tag means that it is subject to a different privacy policy and that they reserve the right to train on your prompts.
2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.
Thanks. I think my post lacked clarity of what I was talking about. I meant that most people care about API access to use with their favorite editor. It's a big limiter with grok, for example.
But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.
One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
with pastebin - https://pastebin.com/aiWuYcrF
I like how critique of LLMs evolved on this site over the last few years.
We are currently at nonsensical pacing while writing novels.
The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.
Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.
Oh, we do admit impact and even have a name for it: AI slop. (Speaking on LLMs now since AI is a broad term and it has many extremely useful applications in various areas)
AI slop is soon to be "AI output that no one wanted to take credit for".
I've asked claude to explain what you meant... https://claude.ai/share/391160c5-d74d-47e9-a963-0c19a9c7489a
I’m not source outsourcing even the comprehension of HN comments to an LLM is going to work out well for your mind
They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.
Haha, so that's the first derivative of goalpost position. You could take the derivative of that to see if the rate of change is speeding up or slowing.
I love this comment.
It's not really passing the Turing Test until it outsells Harry Potter.
> It's not really passing the Turing Test until it outsells Harry Potter.
Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.
Both books that have outsold the Harry Potter series claim divine authorship, not purely human. I am prepared to bet quite a lot that the next isn't human-written, either.
The joke is that the goalpost is constantly moving.
This subgoal post can't move much further after it passes "outsells the Bible" mark.
Why would the book be worth buying tough. If AI can generate a fresh new one just for you?
I don't know. It's a question relevant to all generative AI applications in entertainment - whether books, art, music, film or videogames. To the extent the value of these works is mostly in being social objects (i.e. shared experience to talk about with other people), being able to generate clones and personalized variants freely via GenAI destroys that value.
You may be right, on the other hand it always feels like the next goalpost is the final one.
I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.
After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.
https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.
From Gary Marcus' (notable AI skeptic) predictions of what AI won't do in 2027:
> With little or no human involvement, write Pulitzer-caliber books, fiction and non-fiction.
So, yeah. I know you made a joke, but you have the same issue as the Onion I guess.
Let me toss a grenade in here.
What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?
Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?
I recently watched some Claude Plays Pokemon and believe it's better measure than all those AI benchmarks. The game could be beaten by a 8yo which obviously doesn't have all that knowledge that even small local LLMs posess, but has actual intelligence and could figure out the game within < 100h. So far Claude can't even get past the first half and I doubt any other AI could get much further.
Now I want to watch Claude play Pokemon Go, hitching a ride on self-driving cars to random destinations and then trying to autonomously interpret a live video feed to spin the ball at the right pixels...
2026 news feed: Anthropic cited as AI agents simultaneously block traffic across 42 major cities while trying to capture a not-even-that-rare pokemon
the true measure of AI: does it have fun playing pokemon? did it make friends along the way?
We humans love quantifiability. Since you used the word "measure", do you believe the measurement you're aspiring for is quantifiable?
I currently assert that it's not, but I would also say that trying to follow your suggestion is better than our current approach of measuring everything by money.
> We humans love quantifiability.
No. Screw quantifiability. I don't want "we've improved the sota by 1.931%" on basically anything that matters. Show me improvements that are obvious, improvements that stand out.
Claude Plays Pokemon is one of the few really important "benchmarks". No numbers, just the progress and the mood.
This is difficult to do because one of the juiciest parts of AI is being able to take credit for it's work.
the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.
Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.
Of course, this is just some pedantry.
I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.
To be fair, pacing as a big flaw of LLMs has been a constant complaint from writers for a long time.
There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...
This was written on march 15. Deepseek came out in January. "Era" is not a language I would use for something that happened few days ago
I don’t know why I keep submitting myself to hacker news but every few months I get the itch, and it only takes a few minutes to be turned off by the cynicism. I get that it’s from potentialy wizened tech heads who have been in the trenches and are being realistic. It’s great for that, but any new bright eyed and bushy tailed dev/techy, whatever, should stay far away until much later in their journey
This either ends at "better than 50% of human novels" garbage or at unimaginably compelling works of art that completely obsoletes fiction writing.
Not sure what is better for humanity in long term.
That could only obsolete fiction-writing if you take a very narrow, essentially commercial view of what fiction-writing is for.
I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.
Ahh, now this would be a great premise for a short story (from the mom's POV).
People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.
Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
> Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.
The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.
We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?
I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.
I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.
I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.
In my consulting career I sometimes get to tune database servers for performance. I have a bag of tricks that yield about +10-20% performance each. I get arguments about this from customers, typically along the lines of "that doesn't seem worth it."
Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!
AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.
Yes, if you are Mary Poppins, each individual trick in your bag doesn't have to be large.
(For those too young or unfamiliar: Mary Poppins famously had a bag that she could keep pulling things out of.)
Except at some point the low hanging fruit is gone and it becomes +1%, +3% in some benchmarked use case and -1% in the general case, etc. and then come the benchmarking lies that we are seeing right now, where everyone picks a benchmark that makes them look good and its correlation to real world performance is questionable.
What exactly is the problem with moving the goalposts? Who is trying to win arguments over this stuff?
Yes, Z is indeed a big advance over Y was a big advance over X. Also yes, Z is just as underwhelming.
Are customers hurting the AI companies' feelings?
> Are customers hurting the AI companies' feelings?
No. It's the critics' feelings that are being hurt by continued advances, so they keep moving goalposts so they can keep believing they're right.
The goalposts should keep moving. That's called progress. Like you, I'm not sure why it seems to irritate or even amuse people.
Do we have any simple benchmarks ( and I know benchmarks are not everything ) that tests all the LLMs?
The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?
It’s more a bellwether or symptom of a flaw where the context becomes poisoned and continually regurgitates the same thought over and over.
Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".
In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.
lol wouldn’t that be great to read this comment in 2022
I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.
There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.
Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.
Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.
And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).
It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.
Regarding your last sentence, I agree. My stance is this: If you didn't bother to write it, why should I bother to read it?
From a technical standpoint, this is incredible. A few years ago, computers had problems creating grammatically correct sentences. Producing a consistent narrative like this was science fiction.
From an artistic standpoint, the result is... I'd say: incredibly mediocre, with some glaring errors in between. This does not mean that an average person could produce a similar chapter. Gemini can clearly produce better prose than the vast majority of people. However, the vast majority of people does not publish books. Gemini would have to be on par with the best professional writers, and it clearly isn't. Why would you read this when there is no shortage of great books out there? It's the same with music, movies, paintings, etc. There is more great art than you could ever consume in your lifetime. All LLMs/GenAI do in art is pollute everything with their incredible mediocrity. For art (and artists), these are sad times.
It's more nuanced than that. There are certain material/content where it is mandatory/necessary to read them.
Ideally I'd prefer to read material written by a the top 1%ile expert in that field, but due to constraints you almost always get to read material written by a midwit, intern, junior associate. In which case AI written content is much better especially as I can interrogate the material and match the top 1%ile quality.
Quality is its own property separate from its creator. If a machine writing something bothers you irrespective of quality then don't read it. You think i would care ? I would not.
If this ever gets good enough to write your next bestseller or award winner, i might not even share it and if i did, i wouldn't care if some stranger read it or not because it was created entirely for my pleasure.
Yeah I just focused on how well it was paced and didn't give any instructions on style or try a second pass to spot any inconsistencies.
That would be the next step but I'd previously never thought going any further might be worth it.
> Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it.
When was this time you speak of?
Using the AI in multiple phases is the approach that can handle this. Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.
Yup -- asking a chatbot to create a novel in one shot is very similar to asking a human to improvise a novel in one shot.
But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.
Yes, and that's why many people in the discussion here are very optimistic that chatbots will have solve this problem very soon. Either with the approach you suggest, or with something else (and perhaps more general, and less directly programmed in).
It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.
I dunno, there's a certain amount of fun in "writing" a book with ChatGPT. Like playing a video game with a bunch of different endings instead of a watch a movie with only one. does the hero save the day? or turn into a villian! you decide!
Doesn't novel literally mean something new? Can we really expect an LLM to produce a novel?
The etymology is pretty much irrelevant. In eg German, the word for novel is 'Roman'. But German readers don't expect their novels to be anymore romantic, nor do English readers expect their novels to be more novel.
LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.
Yes
I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.
I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.
Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
> There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?
10000s is still much smaller than the space of possibilities for even a short prompt.
You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.
And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.
Authors tell stories they want to tell and Readers read stories they want to read. The two don't necessarily overlap or overlap strongly enough. If you're even a little bit specific (nowhere near as specific as the above prompt, even just something like the dynamic between protagonists) then you don't actually have 10,000s of actual human written books. Not even close. Maybe it exists and maybe you'll find it good enough but if it's only been read by a few hundred or thousand people ? Good luck getting it recommended.
I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.
Usually porn and fan fiction.
> There are literally 10000s of actual human written books out there
Tens-of-thousands is probably low by something in the neighborhood of four orders of magnitude.
It's very hard to find good books written by humans. GoodReads is okay, but you quickly run out of high-end recommendations. I read mostly sci-fi, and the books that everyone recommends rarely end up being 10/10. But then I see some random recommendation on Reddit or HN, and it ends up being amazing.
Human-generated slop is real.
You could ask your LLM for a recommendation.
That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.
[0] https://github.com/tluyben/bad-writer
My prompt is nowhere near yours.
Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.
https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...
Give it time, this will be solved.
I envisioned that one day, a framework will be created that can persist LLM current state on disk and then "fragments of memories" can be paged in and out into memory.
When that happened, LLM will be able to remember everything.
I had Grok summarize + evaluate the first chapter with thinking mode enabled. The output was actually pretty solid: https://pastebin.com/pLjHJF8E.
I wouldn't be surprised if someone figured out a solid mixture of models working as a writer (team of writers?) + editor(s) and managed to generate a full book from it.
Maybe some mixture of general outlining + maintaining a wiki with a basic writing and editing flow would be enough. I think you could probably find a way to maintain plot consistency, but I'm not so sure about maintaining writing style.
I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.
Planning generated content is often more important to invest in than the writing of it.
Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).
Opening with "like a struck flint carried on a wind that wasn’t blowing." <chuckles>
I don't know why, but that is just such a literal thing to say that it seems almost random.
why would you ever want to write a novel with AI, that is human stuff right? :)
I'm terrible at writing, but I love reading. I've got ideas for novels, but I struggle to put them down.
What I have found that works is to give the LLM the "world" outline at the beginning and then just feed it one line summary of each chapter and get it to write a chapter at a time.
The problem is that the quality of results drastically decreases as the context length increases. After about 10 chapters the dialogue will start to get real snippy. I've tried getting it to summarize all the previous chapters and feed that back in, but it never includes enough detail.
The only way to get better at something is to do it. Start writing short stories or small novels, and you will get there over time. You don't even have to be a great writer to write a great book as well :). It helps, but readers will forgive a lot along your journey.
Brandon Sanderson has a great series of lectures on how he approaches it that are awesome ->
https://www.youtube.com/playlist?list=PLSH_xM-KC3ZvzkfVo_Dls...
You will get so many mental benefits from writing, too. I promise it is worth it. AI is a great tool if you hit a block and need to brainstorm.
No, you are absolutely right. A lot of the things people think they can't do are literally just lack of practice.
My other problem is... lack of time :)
ack, I also have this problem :)
I am working on some world-building for something I want to write one day, but I am trying just to write little things to help. I write a lot of nonfiction stuff for work, but I am worried that it might not translate as well to characters...
this seems like something that planning would fix. i wonder if that's how it's doing it
like, if it decides to <think> a table of contents, or chapter summaries, rather than just diving in at page 1
That is mind blowing. To this fantasy reader that’s pure magic.
Can you share it on a text sharing site? It seems you hit your share quota
Sure - https://pastebin.com/aiWuYcrF
19 pages?! Am I the only one who prefers an AI that jumps straight to the point?
- Buildup and happy background world-building
- Subtle foreshadowing
- Orcs attack
- Hero is saved by unlikely warrior of astounding beauty
- Evil is defeated until sales justify unnecessary sequel
That's the kind of story fit for the modern attention span...
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
This looks like it’s been posted on Reddit 10 years ago:
https://www.reddit.com/r/math/comments/32m611/logic_question...
So it’s likely that it’s part of the training data by now.
You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.
Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."
Whoa. Math is not AI's strong suit...
Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."
Hats???
Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.
That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.
Yes, putting an expensive LLM response atop each search query would be quite stupid.
You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.
Google placed its "AI overview" answer at the top of the page.
The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.
Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).
I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.
I'm eagerly awaiting an article about RAG caching strategies though!
The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5
There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.
- This kid is a genius! - you yell
- But wait, the kid has just picked an answer from the ground, it didn't actually come up...
- But the other toddlers could do it also but didn't!
Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
Google Gemini 2.5 is able to search the web, so if you're able to find the answer on reddit, maybe it can too.
I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.
Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.
Doesn’t seem like a role with long-term prospects.
Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.
It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.
with 728 the puzzle doesn't work since it's divisible by 8
But then the AI should tell you that, too, if it really understand the problem?
Fair, the question is what possible solutions exists.
Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
Have you given a reasoning model a novel problem and watched its chain of thought process?
I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.
I would say that 99.99% of humans do the same. Most people never come up with anything novel.
I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...
Reasoning != coming up with something novel.
And if it wasn’t, it is now
[flagged]
Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?
It's entirely unclear what are you trying to get across, at least to me.
Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.
>> So it’s likely that it’s part of the training data by now.
> I don't think this means what you think it means.
> I did some interacting with the Tencent model that showed up here a couple days ago [...]
> This is a question that obviously was in the training data. How do you get the answer back out of the training data?
What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.
How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.
Your intentions are good, but your execution is poor.
I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.
> How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese.
I can't, and I imagine most of the people who downvoted you couldn't either.
I think asking people to go to Google Translate to parse a random comment that seems to be 90% LLM output by volume is a bit much.
I have translated the Chinese. I still have no idea what point you're trying to make. You ask it questions about some kind of band, and it answers. Are you saying the answers are wrong?
No clue. Perhaps people object to the untranslated Chinese?
> Is there a reason for the downvotes here?
I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.
This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.
Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.
This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).
The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.
I agree with your general point that this "random person" is probably not representative of anything close to an average person off the street, but I think the phrasing "very very intelligent" and "exceedingly clever" is kinda misleading.
In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.
I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.
I solved it in less than 15 minutes while walking my dog, no pen or paper. But I wouldn't claim to be a random person without math skills. And my very first guess was correct.
It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.
My dog solved it in less than 14 minutes, no pen or paper, and no fingers.
Seriously though, nice work.
So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?
That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.
Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.
Read: just a heavy flex about puzzle solving.
> This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).
I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".
Side note: couldn't even find pen and paper around in half an hour.
This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.
What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)
65×26×39=65910
>Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.
I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.
Why is this a great riddle? It sounds like incomplete nonsense to me:
It doesnt say anything about the skill levels of the participants, whether their answers are just guessing, or why they arent just guessing the sum of the other two people each time asked to provide more information?
It doesnt say the guy saying 65 is even correct
How could three statements of "no new information" give information to the first guy that didn't know the first time he was asked?
2 and 3 saying they don't know eliminates some uncertainties 1 had about their own number (any combination where the other two would see numbers that could tell them their own). After those possibilities were eliminated, the 1st person has narrowed it down enough to actually know based on the numbers shown above the other 2. The puzzle could instead have been done in order 2, 3, 1 and 1 would not have needed to go twice.
I guess really the only missing information is that they have the exact same information you do, plus the numbers above their friends heads.
> The puzzle could instead have been done in order 2, 3, 1 and 1 would not have needed to go twice.
If this is true, then back in the original 1->2->3->1 form, shouldn't person #3 have been able to answer it?
Seems like we might need a section of internet that is off limits to robots.
everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt
There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).
What prevents someone from getting access and then running an authenticated headless browser to scoop the data?
Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).
Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).
That's entirely possible, especially if the site is small and not run by people with access to resources like physical security, legal etc.
It’s here and it’s called discord.
I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.
Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.
If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.
Honestly someone should scrape the algebraic topology Discord to AI it'll be a nice training set
Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.
Why would you accept a lie?
I'm not sure what you mean but I'm trying to say our current LLMs are not artificially intelligent and calling them "AI" has confused a lot of the lay public.
You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.
i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?
And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.
I love how the entire comment section is getting one-shotted by your math riddle instead of the original post topic.
In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.
It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/
Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.
[1] https://en.wikipedia.org/wiki/100_prisoners_problem
Wow
Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off
I haven’t seen a question loop this long ever.
Very impressed
Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.
Just tried it on Deepseek (not R1, maybe V3-0324) and got the correct answer after 7-8 pages of reasoning. Incredible!
You can also put the AI in the first person's shoes. Prompt: You are standing in a circle, there are 2 other people in the circle with you, everyone in the circle, has a positive integer above their head, no one knows what the number above their own head is but can see the numbers above the heads of the other people. You see that the person infront of you on the left has 26 above their head. The person on the right has 39 above their head. You are told that the sum of two of the numbers is the third number. You are asked what the number above your head is, the option is the sum, 65, or 13, as 26 + 13 = 39. You don't know which one it is, and you say so. The second person is asked the number above their head. They also say they dont know, the third person also says they dont know. What is your number?
Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong
I just asked it this twice and it gave me 65×65×130=549250. Both times. The first time I made it about ducks instead of people and mentioned that there was a thunderstorm. The second time I c/p your exact text and it gave me the same answer.
Again we find that the failure state of LLMs is a problem – yeah, when you know the answer already and it gets it right, that's impressive! When it fails, it still acts the same exact way and someone who doesn't already know the answer is now a lil stupider.
I also tried one-shot.
https://g.co/gemini/share/badd00a824d2
I use an algorithmic question that I'd been working on for years and that I'm finally writing up the answer to.
It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?
(The answer is surprisingly: Yes.)
A prolog program, swipl (it takes less than a second to solve your puzzle)
N is number of turns of don't know answers. the bad predicate means that the person can know its number at turn N.
Interactive playground for the puzzle: https://claude.site/artifacts/832e77d7-5f46-477c-a411-bdad10...
(All state is stored in localStorage so you can come back to it :) ).
The riddle certainly nerd-sniped GPT 4.5
After a couple of minutes it decided on the answer being 65000. (S = {65, 40, 25)}
> I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
It's not an exaggeration it's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do.
Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.
Could you explain "The sum of two of the numbers is equal to the third"??
I think:
Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.
When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.
That’s all I have so far. Anyone have other ideas?
The answer is online and it's clever.
P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].
P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.
P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.
Now we're back at P1. Who now knows.
So P2 and P3 are in the eliminated sets. Which means we're one of these
[2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.
"Which means we're one of these [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]"
Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?
EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.
"They're not eliminated; they correspond to a = 1."
I think that answer was poorly phrased because those possibilities are eliminated in a sense. There is a better answer further in the thread that explains "If the solution was not one of the flipped triplets, then the first player would not have worked out the solution." Thus if it was one of your other infinite triplets (eg. 65, 12, 53) then round 2 player 1 would've still answered 'I don't know'. Since they did respond with a definitive answer it had to be one of the formula solutions, since those were the only solutions they could prove. And since the only formula with a factor in 65 is 5 the correct formula must be [5A, 2A, 3A] and thus [65, 26, 39].
You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.
Look at it this way: Person 1 sees the numbers 26 and 39, and has to guess his own number. It must be one of only 2 possibilities: 13 or 65. All he has to do is eliminate one of those possibilities.
I think it has something to do with applying the lower bound of 1.
If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two
But p2 and p3 don't yet know what they are themselves just because they see a 1:
If p2 sees 1 and 33, s/he would wonder if s/he is 32 or 34.
P3 would consider 31 or 33.
if the three numbers are a, b, and c, then either a+b=c, a+c=b, or b+c=a
And they must all be positive integers.
So A + B = C and A + C = B. But we know that A + B = C, so we can replace C with (A + B). So we know that A + A + B = B.
So 2A + B = B. Or 2A = 0.
And this holds any way you slice it.
Even if you were to try and brute force it.
A = 1
B = 2
Then C = 3. But A + C has to equal B. That's 1 + 3 = 2? That's not true.
I don't see a case where you can add to the sum of two numbers one of the numbers and get the other number.
I'm guessing that's a misreading of the problem. Because it looks like the third number is the sum of the first two.
One of the cases has to be true, not all 3. (as you show, they're mutually exclusive for positive integers) i.e. "either" is important in the parent comment.
Which is why I indicated that it would be a misreading of the problem.
The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.
> The original problem is a little ambiguously worded.
No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".
I have three items.
Buying two of the items gets you the third for free.
The implication is any two.
It’s ok that it’s ambiguous. It happens. In most cases, we clarify and move on. There’s no need to defend it.
Why look for ambiguity that isn't there?
There's a certain mind that either doesn't realize they're sidestepping the problem and turning it into a editing review, or realizes it, and doesn't understand why it seems off-topic/trivial to others.
What's especially strange here is, they repeatedly demonstrate if you interpret it that way, the problem is obviously, trivially, unsolvable, in a way that a beginner in algebra could intuit. (roughly 12 years old, at least, we started touching algebra in 7th grade)
I really don't get it.
When I've seen this sort of thing play out this way, the talking-down is usually for the benefit of demonstrating something to an observer (i.e. I am smart look at this thing I figured out; I can hold my own when the haters chirp; look they say $INTERLOCUTOR is a thinker but they can't even understand me!), but ~0 of that would apply here, at least traditionally.
One often doesn't look for ambiguity. It is there. It is fine.
Given #s x,y, and z, either x + y = z, x + z = y, or y + z = x.
Fun puzzle! I’m curious how you managed to structure the problem such that a computer could solve it but it took 3 days of computation?
Is the answer somehow {65, 20, 45} with the product 58,500?
That’s one-shot for o1 pro.
That's wrong. From player 1's perspective {25 20 45} is entirely consistent with the calls made in the first three rounds.
Please can you enlighten me, I'm a mathematic plebian?
ChatGPT 4.5:
Let’s carefully analyze this puzzle step-by-step:
⸻
Step 1: Define the Problem Clearly
We have three people in a circle. Let’s call them: • Person A (first person) • Person B (second person) • Person C (third person)
Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy: A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B
Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.
⸻
Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
Initial conditions: The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.
Let’s denote clearly the conditions again: • Exactly one of the following three equations is true: • A + B = C, or • B + C = A, or • C + A = B.
Person A sees B and C, but not A. Similarly for Persons B and C.
⸻
Step 3: Logical deductions from the repeated “I don’t know”
First round deductions: • Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers. • Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions. • Person C also doesn’t know, same logic as above.
Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.
Second round deductions:
After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.
Crucially, in the second round: • Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.
⸻
Step 4: Deriving the Numbers (Critical deduction)
Let’s denote clearly again: Person A’s number is now known to be 65.
We have three possible equations. Let’s test each carefully given A = 65: 1. Case 1: A + B = C Then we have: • 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two. • However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario. • Let’s hold this scenario and continue examining others first. 2. Case 2: B + C = A (So, B + C = 65) • Then Person A sees numbers B and C, adding exactly to 65. • Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense. • Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible. 3. Case 3: C + A = B (So, C + 65 = B) • Then Person A sees numbers B = C + 65 and C. • Person A sees C and C+65. Similar reasoning as Case 1.
⸻
Step 5: Identifying the Correct Scenario
The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.
Here’s why clearly stated: • Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers. • Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C. • After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).
Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.
But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:
A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26 • Check quickly: indeed, 39 + 26 = 65.
Checking logic quickly: • Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13). • Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous. • Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.
After each person says they don’t know: • Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.
Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!
⸻
Step 6: Final Answer (Product of the Numbers)
The three numbers identified clearly are: • A = 65 • B = 39 • C = 26
Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 = 65,910
⸻
Final Answer:
\boxed{65,910}
This was not logic, this was recall from the training set disguised as logic:
Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.
Plus it drew me a very decent pelican riding a bicycle.
Notes here: https://simonwillison.net/2025/Mar/25/gemini/
Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.
Claude's pelican is way better than Gemini's
I'm not so sure. I've run it a bunch of times. It makes a great pelican.
Personally I'm convinced this model is the best out there right now.
https://www.reddit.com/r/Bard/comments/1jjobaz/pelican_on_a_...
I think a competent 5yro could make a better pelican on a bicycle than that. Which to me feels like the hallmark of AI.
I mean, hell, I have drawings from when I was eight of leaves and they are botanically-accurate enough to still be used for plant identification, which itself is a very difficult task that people study decades for. I don't see why this is interesting or noteworthy, call me a neo-luddite if you must.
The complexity is that it's not a drawing : It's SVG. So it's code that must, in the end, display a pelican, so it's one step further.
I've been following your blog for a while now, great stuff!
I just tried your trademark benchmark on the new 4o Image Output, though it's not the same test:
https://imgur.com/a/xuPn8Yq
And the same thing with gemini 2.0 flash native image output.
https://imgur.com/a/V4YAkX5
It's sort of irrelevant though as the test is about SVGs.
Was that an actual SVG?
No that's GPT-4o native image output.
I wonder how far away we are from models which, given this prompt, generate that image in the first step in their chain-of-thought and then use it as a reference to generate SVG code.
It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.
Someone managed to get ChatGPT to render the image using GPT-4o, then save that image to a Code Interpreter container and run Python code with OpenCV to trace the edges and produce an SVG: https://bsky.app/profile/btucker.net/post/3lla7extk5c2u
Does this match the rules of your test, or is it cheating? :)
Tops our benchmark in an unprecedented way.
https://help.kagi.com/kagi/ai/llm-benchmark.html
High quality, to the point. Bit on the slow side. Indeed a very strong model.
Google is back in the game big time.
It should be in the "reasoning" category, right? (still topping the charts there)
Remarkable how few tokens it needed to get a much better score than other reasoning models. Any chance of contamination?
It makes me wonder how the token counting was implemented and if it missed the (not sent in API) reasoning.
Vaild concern, most likely thinking tokens were not counted due to API reporting changes.
That is some wide gap!
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
[0] https://aider.chat/docs/leaderboards/
Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?
In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.
It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.
Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.
Ok, thanks for clearing that up.
The only benchmark I care about. Thanks!
These announcements have started to look like a template.
- Our state-of-the-art model.
- Benchmarks comparing to X,Y,Z.
- "Better" reasoning.
It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.
It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.
I think the greatest issue with buying a new phone today is ironically the seamless migration.
once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing
100% agree with you.
But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.
So what I usually do is buy a different colored phone and change the wallpaper.
My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.
Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.
These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.
If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.
For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.
So, I think these results are very interesting, if you know what features specifically you are using.
But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.
Excited to see if it works this time.
> It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.
They are all interchangeable. This is great for users as it adds to price pressure.
Man, I hope those benchmarks actually measure something.
I would say they are a fairly good measure of how well the model has integrated information from pretraining.
They are not so good at measuring reasoning, out-of-domain performance, or creativity.
Sooner or later someone is going to find "secret sauce" that provides a step-up in capability, and it will be closely guarded by whoever finds it.
As big players look to start monetizing, they are going to desperately be searching for moats.
Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.
Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.
If you've seen DeepSeek R1's <think> output, you'll understand why OpenAI hides their own. It can be pretty "unsafe" relative to their squeaky-clean public image.
They don’t hide reasoning output anymore?
I was looking at this the other day. I'm pretty sure OpenAI run the internal reasoning into a model that purges the reasoning and makes it worse to train other models from.
I might be mistaken, but originally the reasoning was fully hidden? Or maybe it was just far more aggressively purged. I agree that today the reasoning output seems higher quality then originally.
Sooner or later someone is going to find the "secret sauce" that allows building a stepladder tall enough to reach the moon.
It's called the "first step fallacy", and AI hype believers continue to fall for it.
Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.
The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.
The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.
I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence.
If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.
The improvements have been marginal at best. I wouldn't call that beating.
Maybe they just asked Gemini 2.5 to write the announcement.
And it was trained on the previous announcements.
... which were also written by earlier Gemini versions.
LLMs all the way down
Not all the way. At the bottom are a bunch of unpaid writers and artists and a horde of low-paid mturk workers in Nigeria.
I love this comment. It made me laugh.
Serious question: Has anyone tested how much money you can actually make doing a month of Amazon Mechanical Turk? (It would make for an interesting YouTube video!) I am curious if it is middle class wages in very poor countries (like Nigeria). Some light Googling tells me that middle class salary in Nigeria is about 6K USD, so about 3 USD/hour (assuming: 50 weeks/year * 40 hours/week = 2000 hours/year). Is this possible with MTurk?That's ok. AI will kill those off soon enough, and like all winners, rewrite history enough so that that inconvenient theft never happened anyway. It's manifest destiny, or something.
which was written by ChatGPT3.5
I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.
> can you share a concrete example?
Upload a complicated PDF of presentation and ask for insights that require some critical thinking about them.
> Do you think these "moved entire workflows" offset a full worker, or X% of a full worker
It can replace many junior analysts IMO.
We have incrementally improved 1% better then we were yesterday. Our competition is 1 day behind us now.
Like! No trolling: This could be a sarcastic comment written by an LLM!
Well hey, OpenAI did the exact opposite, and nobody liked that either.
I think people were fine with OpenAI demos. They were less fine with not actually ever releasing the demoed tech.
To clarify, by "doing the opposite" I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that.
I liked their announcements and demos and continue to like them.
How did you measure “nobody” liked OpenAI announcements?
Glaringly missing from the announcements: concrete use cases and products.
The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.
I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.
https://www.osmos.io/fabric
Practical, real-world application.
ChatGPT has like 500M weekly active users, what are you on about?
"Well, Ed, there are 300 million weekly users of ChatGPT. That surely proves that this is a very real industry!" https://www.wheresyoured.at/longcon/
Is that article trying to argue that 500M people every week are visiting ChatGPT for the first (or second) time after reading about it in the news?
If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.
It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.
AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.
Was going to comment the same thing, which has been bugging me off lately on all announcements that start with "our" followed by empty superlatives. Happy to not be alone on this!
I’m sure the AI helps write the announcements.
Cancelled my account long time ago. Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands...
If you plan to use Gemini, be warned, here are the usual Big Tech dragons:
The full extract of the terms of usage:Google is the best of these. You either pay per token and there is no training on your inputs, or it’s free/a small monthly fee and there is training.
And even worse:
Emphasis on "retained for up to three years" even if you delete it!!Well they can't delete a user's Gemini conversations because they don't know which user a particular conversation comes from.
This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.
How does it compare to OpenAI and anthropic’s user data retention policy?
If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.
Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.
They seem to use it as long as the chat history is enabled, similar to Gemini. https://help.openai.com/en/articles/7792795-how-do-i-turn-of...
I believe this is out of date. There’s a very explicit opt in/out slider for permitting training on conversations that doesn’t seem to affect conversation history retention.
I don't think this is the same as the AI studio and API terms. This looks like your consumer facing Gemini T&C's.
You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.
I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?
More of a reason for new privacy guidelines specially for big tech and AI
I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?
Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
360k tokens = how many lines of code approximately ? and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?
Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.
Impressive. I tend to think it managed to find the bug by itself which is pretty crazy without being able to debug anything. Then again I haven't seen the bug description, perhaps the description makes it super obvious where the problem lies.
How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!
Yep! AI studio I think is the only way you can actually use it right now and AFAIK it's free.
Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.
Open the pod bay doors Hal.
I'm sorry Dave, I'm afraid I can't do that.
Wow holy smokes that is exciting
How long did it take to sift through those?
> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.
I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.
I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.
I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.
The number is purely for marketing.
If you could get much better performance without changing the algorithm (eg just by scaling), you'd still bump the number.
Funnily enough, from early indications (user feedback) this new model would've been worthy of the 3.0 moniker, despite what the benchmarks say.
I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade.
At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5.
It may indicate a Tick-Tock [1] process.
[1] https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model
The elo jump and big benchmark gains could be justification
Agreed, can't everyone just use semantic versioning, with 0.1 increments for regular updates?
Regarding semantic versioning: what would constitute a breaking change?
I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.
As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.
Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology
Would be confusing for non-tech people once you did x.9 -> x.10
What would a major version bump look like for an llm?
Going from English to Chinese, I guess? Because that would not be a compatible version for most previous users.
Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Breakthrough moment.
Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.
Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.
Wow, was able to nail the pelican riding on a bicycle test:
https://www.svgviewer.dev/s/FImn7kAo
That's actually too good to believe. I have a feeling simonw's favorite test has been special-cased...
It seems pretty good at it. The hair on the boy is messed up, but still decent.
"A boy eating a sandwhich"
https://www.svgviewer.dev/s/VhcGxnIR
"A multimeter"
https://www.svgviewer.dev/s/N5Dzrmyt
I doubt it is explicitly special cased, but now that it's all over twitter etc. it will have ended up many times in the training data.
They could've RLed on SVGs - wouldn't be hard to render them, test adherence through Gemini or CLIP, and reward fittingly
What does nail mean? That's not a bicycle.
To be honest, it's in good company with real humans there: https://www.behance.net/gallery/35437979/Velocipedia
Maybe it learned from Gianluca's gallery!
I'm most impressed by the improvement on Aider Polyglot; I wasn't expecting it to get saturated so quickly.
I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
From https://x.com/OfficialLoganK/status/1904583353954882046
The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.
Any word on what that pricing is? I can't seem to find it
Traditionally at Google experimental models are 100% free to use on https://aistudio.google.com (this is also where you can see the pricing) with a quite generous rate limit.
This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”
Right but the tweet I was responding to says: "This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!"
I assumed that meant there was a paid version with a higher rate limit coming out today
The parent Twitter post mentions:
And the pricing page [1] still does not show 2.5 yet.[1]: https://ai.google.dev/gemini-api/docs/pricing
I expect this might be pricier. Hoping not unusable level expensive.
Currently free, but only 50 requests/day.
Any idea what is RPM for this model?
https://aistudio.google.com/prompts/new_chat says 2 for free, but also 5, which might be the rpm when they start charging.
Scores 54.1 on the Extended NYT Connections Benchmark, a large improvement over Gemini 2.0 Flash Thinking Experimental 01-21 (23.1).
1 o1-pro (medium reasoning) 82.3
2 o1 (medium reasoning) 70.8
3 o3-mini-high 61.4
4 Gemini 2.5 Pro Exp 03-25 54.1
5 o3-mini (medium reasoning) 53.6
6 DeepSeek R1 38.6
7 GPT-4.5 Preview 34.2
8 Claude 3.7 Sonnet Thinking 16K 33.6
9 Qwen QwQ-32B 16K 31.4
10 o1-mini 27.0
https://github.com/lechmazur/nyt-connections/
From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.
The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length.
Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators.
Keen to hear more about this benchmark. Is it representative of chat-to-document style usecases with big docs?
Looks like it's this benchmark [1]. It's certainly less artificial than most long context benchmarks (that are basically just a big lookup table) but probably not as representative as Fiction.LiveBench [2], which asks specific questions about works of fanfiction (which are typically excluded from training sets because they are basically porn).
[1] https://arxiv.org/pdf/2409.12640
[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...
Update: Gemini 2.5 also crushes fiction.livebench
"MRCR (multi-round coreference resolution)" for those looking for the link to Michaelangelo
Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
Maybe less has been written about these newer things, even if they had technically been released?
This model is a fucking beast. I am so excited about the opportunities this presents.
It can answer my favourite riddle for LLMs:
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
https://a.qoid.us/20250325/
[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
Update: I tried a different version of the prompt and it's doing really well! Well, so far it's gotten out of its house and into Professor Oak's lab, which is not so impressive compared to ClaudePlaysPokemon, but it's a lot more than Gemini 2.0 was able to do with the same prompt.
Why do I have the feel that nobody is too much excited to google's models compared to other companies?
Yea I get a little bummed but I guess a lot of HNers have reasons to not like google. I've had a Google One membership forever so opted for the higher subscription with Gemini access since the beginning (plus a free year with new Pixel phone). and I think it is awesome.
I feel like Google intentionally don't want people to be as excited. This is a very good model. Definitely the best available model today.
Most of us care only about coding performance, and Sonnet 3.5 has been such a giant winner that we don't get too excited about the latest model from Google.
Because most of the LLM hype is still generated by people who don't use them in production, and those people don't use GCP
For me personally - rate limit of 50/day means that I can't use it as daily driver so I'll have to go back to Sonnet which will gladly accept my money for more. Then I just forget it exists.
Yeah, if I don’t have higher rate limits, it’s useless. This just sounds like a gimmick launch where they want to gather feedback. It will be a couple of months before this will be GA.
Google is worse at marketing and hyping people up.
The internal incentives must not align with new things making money.
They're not good models. They over fit to LMArena leaderboard, but perform worse in real life scenarios compared to their competitors.
The exceptions are auto regressive image generation and audio models.
Because it’s more likely to be sunsetted.
https://killedbygoogle.com/
How does Gemini have such a big context window?
I thought memory requirement grows exponentially with context size?
TPUs have a network topology better suited for long context than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-networking
> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).
Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.
Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.
You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."
This stuff has a long way to go.
I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.
That might be considered fine if Google's larger goal is to make money from enterprises/Workspace integration, using consumer launches as splashy PR.
This way they get two rounds of headlines. "Gemini 2.5 released" and later on "Gemini 2.5 coming to all Google accounts."
Their willingness to integrate depends on their perception of the model quality.
I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.
I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.
For as amazing as these things are, AGI they are not.
In its defense: it probably is just a text model that hasn't been told that its output is being read to the user.
The Gemini 1.5 tech report do reference some papers about supporting large context window.
https://supaiku.com/attention-is-logarithmic
I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output Recommend trying it out at aistudio dot google dot com
This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.
A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.
With a rate limit of 50 requests per day
Could use multiple Google accounts to increase the rate limit.
This is why we can't have nice things
On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
Have you tried o1-pro?
I wish they’d mention pricing - it’s hard to seriously benchmark models when you have no idea what putting it in production would actually cost.
It's experimental. You shouldn't be using it in production.
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces! From https://x.com/OfficialLoganK/status/1904583353954882046
It's a lot better at my standard benchmark "Magic: The Gathering" rules puzzles. Gets the answers right (both the outcome and rationale).
Ooof, it failed my "Wheel of Potential" bug finding question, and got aggressive about asserting it was correct.
This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard
It nailed my two hard reasoning+linguistic+math questions in one shot, both the kinds of things that LLM struggle but humans do well.
(DM me for the questions)
Isn't every new AI model the "most <adjective>"?
Nobody is going to say "Announcing Foobar 7.1 - not our best!"
GPT-4.5's announcement was the equivalent of that.
"It beats all the benchmarks...but you really really don't want to use it."
They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly.
My assumption was that the pricing was because it really was that expensive for whatever reason. I'm keeping fingers crossed that they're going to do some kind of 4.5 mini at some point that will be more affordable.
You're not wrong, but that just means the <adjective> is where the bulk of information resides. The trade-off matters. Maybe it's a model with good enough quality but really cheap to serve. Maybe it's a model that only plays poker really well but sucks at everything else because it bluffs too much. Etc. etc.
Sure but that adjective matters. Could be cheapest, "intelligent", fastest, etc... it's rarely all three of them.
Except for GPT 4.5 and Claude 3.7 :/
Same with new phones. The new phone is always the fastest cpu, gpu, and best camera ever!
gobble 2.0 - a bit of a turkey
Stable Diffusion definitely had a few of those.
With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?
I think an argument could be made for hardware too. Perhaps in absolute terms Nvidia is ahead, but in terms of knowing how to get the most out of the hardware, Google making its own chips, building on their networking, etc, is a pretty big advantage.
(Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)
The problem Goog has is its insane bureaucracy and lack of vision from Sundar, which isn't very attractive from an employee position. If you're working close to Demis I imagine the situation is better though.
Now that Noam is back I'm a little bit more optimistic.
UX is actually increasingly the bottleneck. Most of the top models are very good if you micromanage their context and prompts. But people aren't very good at that stuff.
Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.
But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.
I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.
Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.
That's a good UX.
I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.
The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.
I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.
I agree with you about ChatGPT. It’s actually a compelling product especially their PRO tier at $200 which essentially unlimited.
While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently. For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini) The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini. Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)
One test I always do is ask for an absolutely minimal language interpreter with TCO.
This is part of the code output (after several interactions of it not returning actual code):
I'm not very impressed.I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]
[1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24
You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.
It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.
I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.
The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.
I find o3 at least faster to get to the response I care about, anecdotally.
Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.
Probably because It is more similar to o3 in terms of size/parameters as well as price (although I would expect this to be at least half price)
Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.
I assume Gemini would be less neutered in this regard, if it wasn't developed by Google.
The incumbent has awoken.
Here's a Gemini 2.5 provided summary of this Hacker News thread as of the moment when it had 269 comments: https://gist.github.com/simonw/3efa62d917370c5038b7acc24b7c7...
I ran this command to create it:
Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themesI do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.
But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
But then where is the productivity increases?
It's a complex proposition. I think Satya was talking about actual gdp growth right ? In theory lets say all knowledge work is now 50% faster due to A.I. Well then I would assume this should affect civil society as well - planning a bridge, a railway etc should happen faster and more efficiently (the actual building of thins won't, but a lot of time is spent on planning a red tape). Healthcare in general should become way more efficient with people getting better treatment; this should have a positive economic effect. It does seem to me like it should be able to speed things up in the real world but of course a lot will have to do with how well the models can reason / how often they make catastrophic mistakes + the will of the governments and people to start using them seriously.
But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.
And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.
The problem is slightly different.
Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)
However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.
> Unemployment hasn't really picked up, and is unlikely to do so
That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).
I can feel this already with my own use of language models.
All the questions I had before language models, I have answered with language models.
That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.
In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.
As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.
> As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005
How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.
No, it won't (necessarily) be AI that's creating the new jobs. In general, when a new technology comes along and automates away some jobs, you can't expect the same technology to provide the new jobs.
To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.
But I doubt you'll be able to find any technology that automated some other job but created barista jobs.
It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.
(The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)
> It's just that the market will find stuff for people to do for money
Will it ? Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I. What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally. It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.
Well, we have many examples where in the past technology (and to a lesser extent trade) have let to some sectors of the economy using fewer people than before.
The situation you describe isn't all that special.
Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.
> Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created. It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
> It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
So I admit that this is a serious possibility that we need to consider.
But for the argument to make sense, we can't just talk about the general 'Oh, new technology will make a bunch of jobs obsolete.' We have to specifically talk about what (might) make AI special in that it might be even more general than electricity.
You didn't mention these special factors in your original comments.
I am not sure whether AI will be different or not, or rather I don't know how different it will be.
So far I see it as a good sign that we have many relatively equally competitive models from different providers, and some of them have open weights and some of them even have completely open sources (including training algorithms). So at least it's unlikely for the technology to be monopolised by any one entity.
> There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
If you want to involve the government, I'd rather give everyone a basic income, than to give our pupils inferior teachers and our sick people inferior nurses. (After all, we are assuming that humans will be worse at these jobs than the AI.) Also, I'd rather have people enjoy whatever it is they want to do, instead of being forced into some government provided make-work programme.
As it stands currently I'd say this is difficult to measure.
They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.
It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.
That’s true, but productivity has many factors and takes a long time to get confidence on. Any productivity value that could be stated clearly would have similar downsides to a benchmark, and take far longer.
Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.
In a fast moving space like this it’s reasonable to make use of leading indicators.
Also, why not compare to GPT-o3 in the benchmarks?
The models not really available.
they have access to o3, I do. Thousands of people do(tens of thousands at this point?). Come on. Compare to SOTA, when you're saying it's the best AI you have.
I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.
why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..
I'm guessing it should be enabled eventually. @logankilpatrick thoughts?
I've been trying to use Gemini 2.0 Flash, but I don't think it's possible. The model still thinks it's running the 1.5 Pro model.
Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...
When these companies release a model “2.5”, are they using some form of semver? Where are these numbers coming from?
Marketing.
Weird, they released Gemini 2.5 but I still can't use 2.0 pro with a reasonable rate limit (5 RPM currently).
It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.
There's probably a sweet spot here. On the flip side, ChatGPT currently doesn't indicate whether a given image generation request was serviced by multimodal GPT-4o [1] or Dall-E.
Personally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.
[1] https://news.ycombinator.com/item?id=43474112
Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.
By extension it should also be slightly more helpful for research, R&D?
Have been using them for non-interactive coding where latency is not an issue. Specifically, turning a set of many free-text requirements into SQL statements, so that later when an item's data is entered into the system, we can efficiently find which requirements it meets. The reasoning models' output quality is much better than the non-reasoning models like 3.5 Sonnet, it's not a subtle difference.
I found reasoning models are much more faithful at text related tasks too (i.e. 1. translating long key-value pairs (i.e. Localizable.strings), 2. long transcript fixing and verification; 3. look at csv / tabular data and fix) probably due to the reflection mechanism built into these reasoning models. Using prompts such as "check your output to make sure it covers everything in the input" letting the model to double-check its work, avoiding more manual checks on my end.
We're using it to RCA infrastructure incidents.
Seriously? That doesn't require a human?! Are we talking about some kind of "generic" incident? (Type 3: forgot to manually update the xxxx file.) Or what's going on?
Sounds unbelievable to me, but hey... :)
If theyre that easy, why not fix the casues for the needs for RCA? Our RCAs will not be solved by AI for decades, let me tell you that.
I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on
I asked it for suggestions for a project, and it was the only model that correctly pointed out serious flaws in the existing proposal. So far so good!
Yeah, and then it says that call of duty is pronounced call of dah-tee when I speak in Russian.
Chatgpt pronounced correctly
i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!
Generated 1000 lines of turn based combat with shop, skills, stats, elements, enemy types, etc. with this one
Interestingly, the model hallucinated the ability to use a search tool when I was playing around with it
I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710
It feels like Gemini 2.0 Pro + Reasoning.
I also see Gemini 2.0 Pro has been replaced completely in AI Studio.
> Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
I see it in both, probably just some gradual rollout delays.
Claude is still the king right now for me. Grok is 2nd in line, but sometimes it's better.
Can't wait for the benchmark at artificialanalysis.ai
I tried the beta version of this model to write a business plan (long story).
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
LOL the random Indonesian section. That's incredible and so strange.
hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town
"My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.
It's unlikely the model knows its actual cutoff date. Try asking 2024 news- for example in my test it knows the January 2024 Oscar nominees.
On AI studio the model told me today is June 13 2024.
Is this model going to be restricted to paying users?
It really surprises me that Google and Amazon, considering their infrastructure and the urge to excel at this, aren't leading the industry.
Google is overly cautious with their guardrails.
Granted, Gemini answers it now, however, this one left me shaking my head.
https://cdn.horizon.pics/PzkqfxGLqU.jpg
For better or worse, Google gets more bad press when their models get things wrong compared to smaller AI labs.
Ha, I still remember that super hilarious "You are under 18, so you should not write C++, as it is unsafe..." log from ... a year ago?
Looks like they're gradually removing guardrails, it returns Nixon for me.
Does it think the founding fathers were a diverse group of mixed races and genders like the last model did?
Is Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have..."
There is no point in asking such questions, the model doesn't know what it is on its own, and you could get many different answers if you repeat it a few more times.
Can it now generate images of soldier in typical uniforms from 1940s Germany without having to throw in a few token ethnicities?
Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?
[dead]
[flagged]
does it still suggest glue on pizza
I'll try it tonight, but I'm not excited, its just work.
ChatGPT4.5, I was excited.
Deepseek, I was excited. (then later disappointed)
I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.
I know I've been disappointed at the quality of Google's AI products. They are backup at best.
It interpreted blood work for me
(Everything's ok, I'm just testing it ;)
Are Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have...."
Normal Google rollout process: Bard is deprecated, Gemini is not ready yet.
And OpenAI is announcing their ImageGen in 4o
https://news.ycombinator.com/item?id=43474112
Google has this habit of 'releasing' without releasing AI models. This looks to be the same?
I don't see it on the API price list:
https://ai.google.dev/gemini-api/docs/pricing
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
It's available now as an option in Google AI Studio and Google Gemini.
It's "experimental", which means that it is not fully released. In particular, the "experimental" tag means that it is subject to a different privacy policy and that they reserve the right to train on your prompts.
2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.
Thanks. I think my post lacked clarity of what I was talking about. I meant that most people care about API access to use with their favorite editor. It's a big limiter with grok, for example.
But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.
and the price is 0.0 usd, lol
[dead]