What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
Example: Ask it to draw a notepad with an empty tic-tac-toe, then tell it to make the first move, then you make a move, and so on.
You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
I get the feeling these models are quite restricted in resolution, and that more work in this space will let us do really wild things such as ask a model to create an app step by step first completely in images, essentially designing the whole app with text and all, then writing the code to reproduce it. And it also means that a model can take over from a really good diffusion model, so even if the original generations are not good, it can continue "reasoning" on an external image.
Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM (which can do all the normal things like using tools, thinking, etc). However, I also believe that diffusion models can do some of this, in a much faster way.
> What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
I do not think that this is correct. Prior to this release, 4o would generate images by calling out to a fully external model (DALL-E). After this release, 4o generates images by calling out to a multi-modal model that was trained alongside it.
You can ask 4o about this yourself. Here's what it said to me:
"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
>You can ask 4o about this yourself. Here's what it said to me:
>"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
Models don't know anything about themselves. I have no idea why people keep doing this and expecting it to know anything more than a random con artist on the street.
This is overly cynical. Models typically do know what tools they have access to because the tool descriptions are in the prompt. Asking a model which tools it has is a perfectly reasonable way of learning what is effectively the content of the prompt.
Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.
>Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.
I don't know - or care to figure out - how OpenAI does their tool calling in this specific case. But moving tool calls to the end user is _monumentally_ stupid for the latency if nothing else. If you centralize your function calls to a single model next to a fat pipe it means that you halve the latency of each call. I've never build, or seen, a function calling agent that moves the api function calls to client side JS.
You should check out Claude desktop or Roo-Code or any of the other MCP client capable hosts. The whole idea of MCP is providing a universal pluggable tool api to the generative model.
They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore.
>Like other AI models, I’m trained on diverse, legally compliant data sources, but not on proprietary outputs from models like ChatGPT-4. DeepSeek adheres to strict ethical and legal standards in AI development.
> They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore
Yes, but many people expect the LLM to somehow self-reflect, to somehow describe how it feels from its first person point of view to generate the answer. It can't do this, any more than a human can instinctively describe how their nervous system works. Until recently, we had no idea that there are things like synapses, electric impulses, axons etc. The cognitive process has no direct access to its substrate/implementation.
If fine-tune ChatGPT into saying that it's an LSTM, it will happily and convincingly insist that it is. But it's not determining this information in real time based on some perception during the forward pass.
I mean there could be ways for it to do self reflection by observing the running script, perhaps raise or lower the computational cost of some steps, check the timestamps of when it was doing stuff vs when the GPU was hot etc and figure out which process is itself (like making gestures in front of a mirror to see which person you are). And then it could read its own Python scripts or something. But this is like a human opening up their own skull and look around in there. It's not direct first-person knowledge.
You're incorrect. 4o was not trained on knowledge of itself so literally can't tell you that. What 4o is doing isn't even new either, Gemini 2.0 has the same capability.
Can you find me a single official source from OpenAI that claims that GPT 4o is generating images pixel-by-pixel inside of the context window?
There are lots of clues that this isn't happening (including the obvious upscaling call after the image is generated - but also the fact that the loading animation replays if you refresh the page - and also the fact that 4o claims it can't see any image tokens in its context window - it may not know much about itself but it can definitely see its own context).
I read the post, and I can't see anything in the post which says that the model is not multi-modal, nor can I see anything in the post that suggests that the images are being processed in-context.
And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:
> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.
I have asked GPT if it is using the 4o or 4.5 model multiple times in voice mode e.g. "Which model are you using?". It has said that it is using 4.5 when it is actually using 4o.
Yes, and it shows you believing what the bot is telling you, therefore I asked. It is giving you some generic function call with a generic name. Why would you believe that is actually what happens with it internally?
By the way when I repeated your prompt it gave me another name for the module.
Posts like this are terrifying to me. I spend my days coding these tools thinking that everyone using them understands their glaring limitations. Then I see people post stuff like this confidently and I'm taken back to 2005 and arguing that social media will be a net benefit to humanity.
The tool name is not relevant. It isn't the actual name, they use an obfuscated name. The fact that the model believes it is a tool is good evidence at first glance that it is a tool, because the tool calls are typically IN THE PROMPT.
You can literally look at the JavaScript on the web page to see this. You've overcorrected so far in the wrong direction that you think anything the model says must be false, rather than imagining a distribution and updating or seeking more evidence accordingly
The original claim was that the new image generation is direct multimodal output, rather than a second model. People provided evidence from the product, including outputs of the model that indicate it is likely using a tool. It's very easy to confirm that that's the case in the API, and it's now widely discussed elsewhere.
It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level
> It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level
That's probably right. It allows them to just swap it out for DALL-E, including any tooling/features/infrastructure hey have built up around image generation, and they don't have to update all their 4o instances to this model which, who knows, may be not be ready for other tasks anyway or different enough to warrant testing before a rollout, or more expensive, etc.
Honestly it seems like the only sane way to roll it out if it is a multimodal descendant of 4o.
A lot of convoluted explanations about something we don't even know if it really works all the time. I feel like in the third year of LLM-Hype and after reminde-me-how-many billions of dollars burned, we should by now not have to imagine what 'might happen' down to road, it should have been happening already. The use-case you are describing, sure sounds very interesting, until I remember asking asked Copilot for a simple scaffolding structure in React, and it spat out something which lacked half of imports and proper visual alignments. A few years ago I was excited about the possibility of removing all the scaffolding and templating work so we can write the cool parts, but they cannot even do that right. It's actually a step back compared to automatic code generators of the past, because those at least produced reproducible results every single time you used them. But hey sure, the next generation of "AI" (it's not really AI) will probably solve it.
That scene is changing so quickly that you will want to try again right now if you can.
While LLM code generation is very much still a mixed bag, it has been a significant accelerator in my own productivity, and for the most part all I am using is o1 (via the openAI website), deepseek, and jetbrains' AI service (Copilot clone). I'm eager to play with some of the other tooling available to VS Code users (such as cline)
I don't know why everyone is so eager to "get to the fun stuff". Dev is supposed to be boring. If you don't like it maybe you should be doing something else.
I mean I literally "tried it again" this morning, as a paying Copilot customer of 12 months, to the result I already described. And I do not want to "try it" - based on fluffy promises we've been hearing, it should "just work". Are you old enough to remember that phrase? It was a motto introduced by an engineering legend whose devices you're likely using every day. The reason why "everyone", including myself with 20+ years of experience is looking to do not "fun stuff" (please don't shove words into my mouth), but cool stuff (=hard problems) is that it produces an intrinsic sense of satisfaction, which in turn creates motivation to do more and eventually even produces wider gain for the society. Some of us went into engineering because of passion you know. We're not all former copy-writers retrained to be "frontend developers" for a higher salary, who are eager to push around CSS boxes. That's important work too, but I've definitely solved harder problems in my career. If it's boring for you and you think it's how it should be, then you are definitely doing it for the wrong reasons (I am assuming escaping a less profitable career).
Steve Jobs may be a legend at business, but an engineer he is not. To say nothing of the fact that whole reason "it just works" is because of said engineering. If you would like to be the innovator that finally solves that, then great! Otherwise you're just bloviating, and by god do we already have enough of that in this field.
I'm approaching 20 years of professional SWE experience myself. The boring shit is my bread and butter and its what pays the bills and then some. The business community trying to eliminate that should be seen as a very serious threat to all our futures.
AI is an extraordinary tool, if you can't make it work for you, you either suck at prompting, or are using the wrong tools, or are working in the wrong space. I've stated what I use, why not give those things a try?
The point is not the individual tools, which at this point are just wrappers around the major LLMs. The point is the snake oil salesmen of major LLM companies have been telling us for several years now that it is "just about to happen". A new technology revolution. A new post-scarcity world if you will. A tremendous increase in technological output, unleashed creativity etc. Altman routinely blabs about achieving AGI. Meanwhile hallucination of the models is a known feature, unfortunately it's not a bug we can fix. The hallucinations will never go away, because the LLM models are advanced text generators (quoting Charles Petzold) producing text based on essentially probability that one token should follow the next one. That means mate, you can be a superstar at the "advanced" skill of "prompting" (i.e. typing in conversational english language sentences), the crappy tool will still produce output that does not make sense, for example type out code with non-existing framework methods etc. Why? Because with every prompt, you retrain and re-tune the model a little bit. They don't even hold the authority of a dusty old encyclopedia. You use several tools simultaneously, why? Because you cannot really rely on any of them. So you try to get a mean minimum out of several. But a mean minimum of a sum of crap, will still end up crap. If any of the the 3-4 major LLM engines had any competitive advantages, they would have literally obliterated the competition by now. Why is that not happening? Where is the LLM equivalent of nascent Google obliterating Altavista and Excite or an equivalent of Windows 95 taking the PC, React taking over the web frontend etc? And by the way, you know that there was another famous guy at Apple, right?
They've been saying that kind of shit about everything AI related since fuzzy logic was the next big thing. It will never happen. AI will be used to cut staff and increase the workload of those remaining. The joke is on you for being susceptible to their hype.
I use a couple of different tools because they're each good at something that is useful to me. If Jetbrains AI service had a continue.dev/cline like interface and let me access all the models I want I might not deviate from that. But lucky for me work pays for everything.
You also seem awfully fixated on Copilot. How much exactly do you think your $12/month entitles you to?
Well thanks for confirming, you're getting "something" out of each, i.e. minimising mean error, because none of them is the ultimate tool. Copilot price is actually $19 per seat and running my own company, I pay a bit more than $19 bucks, you know for my employees, people like yourself. Why I am fixated on a single tool? Because each of those "tools" are wrappers around one of the major LLMs. I am surprised you don't know that. Copilot, Windsurf, CLine etc. are all just frontends for models by anthropic, google and chatgpt. So the output cannot be, by definition very different.
There is lots of value to be added in wrapping those tools. I am very well aware of what these things are. LLMs are not a fire-and-forget weapon, even though so many of you business types really really really want it to be. I mean jesus you sound almost as delusional as my bosses.
Business type? I am nothing near a business type, with two technical degrees and 20 years of hands-on experience. But I managed to build my own stable business over the years, in part due to being analytical and not rushing to conclusions, especially not over strangers on Internet ;) Where did you get the conclusion that I am delusional? It's actually the business types who think that these tools are magic, mind-blowing, etc. I am, like many other "technical types", pushing for the opposite view - yes to some extent useful, but no where near the magic they are being advertised as. Anyone who calls them "mindblowing", like some guys in my comment thread are either inexperienced/junior or removed from the complex parts of the work, perhaps focused on writing up React frontends or similar.
There is no hubris. The LLM technology has actually been existing for at least two decades, it's not some sudden breakthrough we suddenly discovered. And given the many billions of dollars it has sucked in, it's definitely a pile of crap. I have been a paying customer of Github Copilot for at least a year. Since the google search has been completely messed up, sometimes it can be useful to look up some cryptic error message. It can also sometimes help recall syntax of something. But it's not the magic machine they've been touting, it's definitely not the AGI, and my god, it's very prone to errors and to anyone who admires this tech, please for the love of god double- and tripple-check the crap that they generate, before you commit it to production. And by now it definitely feels like the 'self-driving' cars 'revolution'. They have been 'just around the corner', for like what, 15 years now?
No, LLMs have not exited for two decades. What a stupid comment. Millions of people are spending thousands of dollars because they're recieving tens of thousands of dollars of value from it.
Of course it's impossible to explain that to thickheaded dinosaurs on HN who think they're better than everyone and god's gift to IQ.
Please read more carefully.The LLM technology exists for at least two decades, well actually even more.You know the technology the LLMs are based on (neural networks, machine learning etc). I am not sure if after smartphones, the LLMs will now further impact intelligence of people like you. And take note: I have been one of the early adopters and I am actually paying for usage. My criticism comes from a realistic assesment of the actual value these tools provide, vs. what the marketing keeps promising (beyond trivial stuff like spinning up simple web apps). Oh by the way, I peeked into your comment history. I see you're one of those non-technical vibe-coder types. Well good luck with that mate and let us know how is your codebase doing in about a year (as someone else already warned you). And if you have any customers, make sure you arrange for a huge insurance coverage, you may need it.
Without bad intent, I am not sure I am even able to make sense of your sentence. Honestly it sounds as if you fed my comment into an AI tool and asked for a reply. Here is a tip for the former junior dev turned nascent GenAI-Vibecoding-manager - if you want to attack someone's credibility, especially that of an Internet stranger you are desperately trying to prove wrong, try to use something they said themselves, not something you are assuming about them. Just like I used what you said about yourself in one of your previous posts. Otherwise the same thing will keep happening over and over again and you'll keep guessing, revealing your own weak spots in domains of general knowledge and competence. My second advice to a junior dev would have been to read a book once in a while, but who needs books now that you have a magic machine as your source of truth, right?
I’m sorry but you will be the first to go in this new age. LLMs today are absolutely mind blowing, along with this image stuff. Either learn to adapt or remain a boomer.
Go where mate, to Sam Altman's retirement home for UBI recipients ? I studied neural networks while you were still a "concept of a plan" in the mind of your parents and unlike you, I know how they work. As one of the early and paying adopters of the technology, it would have been great if they worked as advertised. But they don't, and the only people "to go" will be idiots who think that a technology that 1) anyone can use 2) produces un-reliable outputs.. 3)..while sounding authoritative makes them an expert. Guess what, if both you and your buddy and your entire school can spin up a website with a few prompts, how much is your "skill" worth on the market? Ever heard of demand and offering? To me looks eerily similar to how the smartphones and social networks made everyone "technologists" ;)
> is that it produces an intrinsic sense of satisfaction, which in turn creates motivation to do more and eventually even produces wider gain for the society.
Which society? Because lately it looks like the tech leaders are on a rampage to destroy the society I live in.
Copilot doesn't use the full context length. Write scripts to dump relevant code into claude with it's 200K or the new Gemini with even more. It does much better with as much relevant stuff as you can get into context.
But I don't want to write additional scripts or do whatever additional work to make the 'wonder tool' work. I don't mind an occassional rewording of the prompt. But it is supposed to work more or less out of the box, at least this is how all of the LLMs are being advertised, all the time (even the lead article for this discussion).
LLMs are also primarily promoted through the web chat interface, not always magic wonder tools. With any project that will fit in claude/gemini's large context you use those interfaces and dump everything in with something like this:
(tree Source/; echo; for file in $(find Source/ -type f ) ; do echo ======== $file: ; cat $file; done ) > /mnt/c/Users/you/Desktop/claude_out.txt #claudesource
Then drag that into the chat.
You can also do stuff like pass in just the headers and a few relevant files
(tree Source/; echo; for file in $(find Source/ -type f -name '\*.h' ; echo Source/path/to/{file1,file2,file3}.cpp ) ; do echo ======== $file: ; cat $file; done ) > /mnt/c/Users/you/Desktop/claude_out.txt #claudeheaderselective
You can then just hit ctrl+r and type claude to refind it in shell history. Maybe that's too close to "writing scripts" for you but if you are searching a large codebase effectively without AI you are constantly writing stuff like that and now it reads it for you.
Put the command itself into claude too and tell claude itself to write a similar one for all the implementation files it finds it needs while looking those relevant files and headers.
If you want a wonder tool that will navigate and handle the context window and get in the right files into context for huge projects, try claude code or other agents, but they are still undergoing rapid improvements. Cursor has started adding some in too but as subscription calling into an expensive API they cost cut a lot on trying to minimize context.
They also let you now just point it at a github project and pull in what it needs, or tools build around the api model context protocol etc. to let it browse and pull it in.
No thank you for obviously good intent on your side, but I am not looking for scripting help here, nor am I a business type who does not code themselves. I just don't want do this when I am already paying for the tooling which should be able to do it themselves, as they already wrap Claude, ChatGPT and whatever other LLMs. And unless you're professionally developing with Microsoft stack, I'd advise to ditch the Windows+MinGW for Linux, or at the very least, a MacBook ;)
The tooling you are paying for doesn't work with the full abilities of the context so you need to do something else. Doesn't matter what it's supposed to do or that other people say it does everything for them well, it works a lot better with as much in context as possible on my experience. They do have other tools like RAG though in cursor, and it's much quicker iteration, ultimately a mix of what works best is what you should use, but not just block stuff out out of disappointment with one type of tool.
I am lucky in the sense that neither myself nor my business depend very much on these tools because we do work which is more complex than frontend web apps or whatever people use them for these days. We use them here and there, mainly because google search is such crap these days, but we had been doing very well without them too and could also turn them off. The only reason we still keep them around is that the cost is fairly low. However, I feel like we are missing the bigger picture here. My point is, all of these companies have been constantly hyping a near-AGI experience for the past 3 years at least. As a matter of principle, I refuse to do additional work for them to "make it work". They should have been working already without me thinking about how big their context window is or whatever. Do you ever have to think how your operating system works when you ask it to copy a file or how your phone works when you answer a call? I will leave it to some vibe-coder (what an absurd word) who actually does depend on those tools for their livelihood.
> As a matter of principle, I refuse to do additional work for them to "make it work". Do you ever have to think how your operating system works when you ask it to copy a file or how your phone works when you answer a call?
Doesn't matter, use the tool that makes it easy and get less context, or realize the limitations and don't fall for marketing of ease and get more context. You don't want to do additional work beyond what they sold you on, out of principle. But you are getting much less effective use by being irrationally ornery.
Ok now think about this in terms of items you own or likely own: What would you do if I sold you a car with 3 doors, after advertising it as having 5 doors instead? Would you accept it and try to work around that little inconvenience? Or would you return the product and demand your money back?
Consider using a better AI IDE platform than Copilot ... cursor, windsurf, cline, all great options that do much better than what you're describing. The underlying LLM capabilities also have advanced quite a bit in the past year.
Well I do not really use it that much to actually care, and don't really depend on AI, thankfully. If they did not mess up the google search, we wouldnt even need that crap at all. But that's not the main point. Even if I switched to cursor or windsurf - aren't they all using one of the same LLMs? (ChatGPT,Claude, whatever..). The issue is that the underlying general approach will never be accurate enough. There is a reason most of successful technologies lift off quickly and those not successful also die very quickly. This is a tech propped up by a lot of VC money for now, but at some point, even the richest of the rich VCs will have trouble explaining spending 500B dollars in total, to get something like 15B revenue (not even profit). And don't even get me started on Altman's trillion-fantasies...
:) And you sound like one of the many people I've seen come and go in my career. Best of luck to you actually - if the GenAI bubble does not pop in the next few years (which it will) we'll only have so many open positions for "prompters" to use for building web app skeletons :)
> truly generative UI, where the model produces the next frame of the app
Please sir step away from the keyboard now!
That is an absurd proposition and I hope I never get to use an app that dreams of the next frame. Apps are buggy as they are, I don't need every single action to be interpreted by LLM.
An existing example of this is that AI Minecraft demo and it's a literal nightmare.
Yeah, but the abstractions have been useful so far. The main advantage of our current buggy apps is that if it is buggy today, it will be exactly as buggy tomorrow. Conversely, if it is not currently buggy, it will behave the same way tomorrow.
I don't want an app that either works or does not work depending on the RNG seed, prompt and even data that's fed to it.
That's even ignoring all the absurd computing power that would be required.
Still sounds a bit like we've seen it all already – dynamic linking introduced a lot of ways for software that wasn't buggy today to become buggy tomorrow. And Chrome uses an absurd amount of computing power (its bare minimum is many multiples of what was once a top-of-the-line, expensive PC).
I think these arguments would've been valid a decade ago for a lot of things we use today. And I'm not saying the classical software way of things needs to go away or even diminish, but I do think there are unique human-computer interactions to be had when the "VM" is in fact a deep neural network with very strong intelligence capabilities, and the input/output is essentially keyboard & mouse / video+audio.
> This argument could be made for every level of abstraction we've added to software so far... yet here we are commenting about it from our buggy apps!
No. Not at all. Those levels of abstractions – whether good, bad, everything in between – were fully understood through-and-through by humans. Having an LLM somewhere in the stack of abstractions is radically different, and radically stupid.
Every component of a deep neural network is understood by many people, it's the interaction between the numbers trained that we don't always understand. Likewise, I would say that we understand the components on a CPU, and the instructions it supports. And we understand how sets of instructions are scheduled across cores, with hyperthreading and the operating system making a lot of these decisions. All the while the GPU and motherboard are also full of logical circuits, understood by other people probably. And some (again, often different) people understand the firmware and dynamically linked libraries that the users' software interfaces with. But ultimately a modern computer running an application is not through and through understood by a single human, even if the individual components could be.
Anyway, I just think it's fun to make the thought experiment that if we were here 40 years ago, discussing today's advanced hardware and software architecture and how it interacts, very similar arguments could be used to say we should stick to single instructions on a CPU because you can actually step through them in a human understandable way.
First it will dream up the interaction frame by frame. Next, to improve efficiency, it will cache those interaction representations. What better way to do that than through a code representation.
While I think current AI can’t come close to anything remotely usable, this is a plausible direction for the future. Like you, I shudder.
> “DLSS Multi Frame Generation generates up to three additional frames per traditionally rendered frame, working in unison with the complete suite of DLSS technologies to multiply frame rates by up to 8X over traditional brute-force rendering. This massive performance improvement on GeForce RTX 5090 graphics cards unlocks stunning 4K 240 FPS fully ray-traced gaming.”
"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."
Maybe the "HELL YEAH" added a "party implication" which shifted it's "thinking" into just correct enough latent space that it was able to actually hunt down some image somewhere in its training data of a truly full glass of wine.
I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.
Yeah. I understand that this site doesn’t want to become Reddit, but it really has an allergy to comedy, it’s sad. God forbid you use sarcasm, half the people here won’t understand it and the other half will say it’s not appropriate for healthy discussion…
Is it drawing the image from top to bottom very slowly over the course of at least 30 seconds? If not, then you're using DALL-E, not 4o image generation.
This top to bottom drawing – does this tell us anything about the underlying model architecture? AFAIK diffusion models do not work like that. They denoise the full frame over many steps. In the past there used to be attempts to slowly synthetize a picture by predicting the next pixel, but I wasn't aware whether there has been a shift to that kind of architecture within OpenAI.
Yes, the model card explicitly says it's autoregressive, not diffusion. And it's not a separate model, it's a native ability of GPT-4o, which is a multimodal model. They just didn't made this ability public until now. I assume they worked on the fine-tuning to improve prompt following.
It very much looks like a side effect of this new architecture. In my experience, text looks much better in recent DALL-E images (so what ChatGPT was using before), but it is still noticeably mangled when printing more than a few letters. This model update seems to improve text rendering by a lot, at least as long as the content is clearly specified.
Yeah who wouldn't love a dip in the sulphur pool. But back to the question, why can't such a model recognize letters as such? It cannot be trained to pay special attention to characters? How come it can print an anatomically correct eye but not differentiate between P and Z?
I think we're really fscked, because even AI image detectors think the images are genuine. They look great in Photoshop forensics too. I hope the arms race between generators and detectors doesn't stop here.
We're not. This PNG image of a wine glass has JPEG compression artefacts which are leaking from JPEG training data. You can zoom into the image and you will see 8x8 boundaries of the blocks used in JPEG compression, which just cannot be in a PNG. This is a common method to detect AI-generated image and it is working so far, no need for complex photoshop forensics or AI-detectors, just zoom-in and check for compression - current AI is incapable of getting it right – all the compression algorithms are mixed and mashed in the training data, so on the generated image you can find artefacts from almost all of them if you're lucky, but JPEG is prevalent obviously, lossless images are rare online.
If JPEG compression is the only evident flaw, this kind of reinforces my point, as most of these images will end up shared as processed JPEG/WebP on social media.
That's the point. With the old models they all failed to produce a wine glass that is completley to the brim full. Because you can't find that a lot in the data they used for training.
I obviously have no idea if they added real or synthetic data to the training set specifically regarding the full-to-the-brim wineglass test, but I fully expect that this prompt is now compromised in the sense that because it is being discussed in the public sphere, it's has inherently become part of the test suite.
Remember the old internet adage that the fastest way to get a correct answer online is to post an incorrect one? I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.
> I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.
Take some artisan, I'll go with a barber. The human person is not the best of the best, but still a capable barber, who can implement several styles on any head you throw at them. A client comes, describes certain style they want. The barber is not sure how to implement such a style, consults with master barber beside, that barber describes the technique required for that particular style, our barber in question comes and implements that style. Probably not perfectly as they need to train their mind-body coordination a bit, but the cut is good enough that the client is happy.
There was no traditional training with "gap finding and filling" involved. The artisan already possessed core skill and knowledge required, was filled on the particulars of their task at hand and successfully implemented the task. There was no looking at examples of finished work, no looking at example of process, no iterative learning by redoing the task a bunch of times.
So no, human learning, at least advanced human learning, is very much different from these techniques. Not that they are not impressive on their own, but let's be real here.
I think there is a critical aspect of human visual learning which machine leanring cant replicate because it is prohibitively expensive. When we look at things as children we are not just looking at a single snapshot. When you stare at an object for a few seconds you have practically injested hundreds of slightly variated images of that object. This gets even more interesting when you take into account real world is moving all the time, so you are seeing so many things from so many angles. This is simply undoable with compute.
Then explain blind children? Or blind & deaf children? There's obviously some role senses play in development but there's clearly capabilities at play here that are drastically more efficient and powerful than what we have with modern transformers. While humans learn through example, they clearly need a lot fewer examples to generalize off of and reason against.
I think my point is that communication is the biggest contributor to brain development more than anything and communication is what powers our learning. Effective learners learn to communicate more with themselves and to communicate virtually with past authors through literature. That isn’t how LLMs work. Not sure why that would be considered objectionable. LLMs are great but we don’t have to pretend like they’re actually how brains work. They’re a decent approximation for neurons on today’s silicon - useful but nowhere near the efficiency and power of wetware.
Also as for touch, you’re going to have a hard time convincing me that the amount of data from touch rivals the amount of content on the internet or that you just learn about mistakes one example at a time.
There are so many points to consider here im not sure i can address them all.
- Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)
- Human brains may be doing some analogue of sample augmentation which gives you some multiple more equivalent samples of data to train on per real input state of environment. This is done for ml too.
- Whether that input data is text, or embodied is sort of irrelevant to cognition in general, but may be necessary for solving problems in a particular domain. (text only vs sight vs blind)
> Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)
I think you're saying exactly what I'm saying. Human brains work differently from LLMs and the OP comment that started this thread is claiming that they work very similarly. In some ways they do but there's very clear differences and while clarifying examples in the training set can improve human understanding and performance, it's pretty clear we're doing something beyond that - just from a power efficiency perspective humans consume far less energy for significantly more performance and it's pretty likely we need less training data.
to be honest i dont really care if they work the same or not. I just like that they do work and find it interesting.
i dont even think peoples brains work the same as eachother. half of people cant even visually imagine an apple.
Neural networks seem to notice and remember very small details, as if they have access to signals from early layers. Humans often miss the minor details. Theres probably a lot more signal normalization happening. That limits calorie usage and artifacts the features.
I dont think that this is necessarily a property neural networks cant have. I think it could be engineered in. For now though seems like were making a lot of progress even without efficiency constraints so nobody cares.
Even if they did, I’d assume the association of “full” and this correct representation would benefit other areas of the model. I.e., there could (/should?) be general improvement for prompts where objects have unusual adjectives.
So maybe training for litmus tests isn’t the worst strategy in the absence of another entire internet of training data…
A lot of other things are rare in datasets, let alone correctly labeled. Overturned cars (showing the underside), views from under the table, people walking on the ceiling with plausible upside down hair, clothes, and facial features etc etc
There is no one correct way to interpert 'full'. If you go to a wine bar and ask for a full glass of wine, they'll probably interpert that as a double. But you could also interpert it the way a friend would at home, which is about 2-3cm from the rim.
Personally I would call a glass of wine filled to the brim 'overfilled', not 'full'.
I think you're missing the context everyone else has - this video is where the "AI can't draw a full glass of wine" meme got traction https://www.youtube.com/watch?v=160F8F8mXlo
The prompts (some generated by ChatGPT itself, since it's instructing DALL-E behind the scenes) include phrases like "full to the brim" and "almost spilling over" that are not up to interpretation at all.
People were telling the models explicitly to fill it to the brim, and the models were still producing images where it was filled to approximately the half-way point.
Generating an image of a completely full glass of wine has been one of the popular limitations of image generators, the reason being neural networks struggling to generalise outside of their training data (there are almost no pictures on the internet of a glass "full" of wine). It seems they implemented some reasoning over images to overcome that.
Looks amazing,can you please also create a unconventional image like the clock at 2:35 , I tried it something like this with gemini when some redditor asked it and it failed so wondering if 4o does do it
I tried and it failed repeatedly (like actual error messages):
> It looks like there was an error when trying to generate the updated image of the clock showing 5:03. I wasn’t able to create it. If you’d like, you can try again by rephrasing or repeating the request.
A few times it did generate an image but it never showed the right time. It would frequently show 10:10 for instance.
If it tried and failed repeatedly, then it was prompting DALL-E, looking at the results, then prompting DALL-E again, not doing direct image generation.
No... OpenAI said it was "rolling out". Not that it was "already rolled out to all users and all servers". Some people have access already, some people don't. Even people who have access don't have it consistently, since it seems to depend on which server processes your request.
I’m using 4o and it gets time wrong a decent chunk but doesn’t get anything else in the prompt incorrect. I asked for the clock to be 4:30 but got 10:10. OpenAI pro account.
Why does it sound like this isn't reasoning on images directly but rather just dall e as some other comment said , I will type the name of the person here (coder543)
On the web version, click on the image to make it larger. In the upper right corner, there is an (i) icon, which you can click to reveal the DALL-E prompt that GPT-4o generated.
Yeah, it seems like somewhere in the semantic space (which then gets turned into a high resolution image using a specialized model probably) there is not enough space to hold all this kind of information. It becomes really obvious when you try to meaningfully modify a photo of yourself, it will lose your identity.
For Gemini it seems to me there's some kind of "retain old pixels" support in these models since simple image edits just look like a passthrough, in which case they do maintain your identity.
Also still seems to have a hard time consistently drawing pentagons. But at least it does some of the time, which is an improvement since last time I tried, when it would only ever draw hexagons.
I think it is not the AI but you who is wrong here. A full glass of wine is filled only up to the point of max radius so that the surface to air is maxed an the wine can breathe. This is what we taught the AI to consider „a full glass of wine“ and it perfectly gets it right.
It’s a type of QA question that can identify peculiarities in models (e.g. count “r”s in strawberry), which the best we have given the black box nature of LLMs.
Would be interested to know as well. As far as I know there is no public information about how this works exactly. This is all I could find:
> The system uses an autoregressive approach — generating images sequentially from left to right and top to bottom, similar to how text is written — rather than the diffusion model technique used by most image generators (like DALL-E) that create the entire image at once. Goh speculates that this technical difference could be what gives Images in ChatGPT better text rendering and binding capabilities.
I wonder how it'd work if the layers were more physical based. In other words something like rough 3d shape -> details -> color -> perspective -> lighting.
Also wonder if you'd get better results in generating something like blender files and using its engine to render the result.
There are a few different approaches. Meta documents at least one approach quite well in one of their llama papers.
The general gist is that you have some kind of adapter layers/model that can take an image and encode it into tokens. You then train the model on a dataset that has interleaved text and images. Could be webpages, where images occur in-between blocks of text, chat logs where people send text messages and images back and forth, etc.
The LLM gets trained more-or-less like normal, predicting next token probabilities with minor adjustments for the image tokens depending on the exact architecture. Some approaches have the image generation be a separate "path" through the LLM, where a lot of weights are shared but some image token specific weights are activated. Some approaches do just next token prediction, others have the LLM predict the entire image at once.
As for encoding-decoding, some research has used things as simple as Stable Diffusion's VAE to encode the image, split up the output, and do a simple projection into token space. Others have used raw pixels. But I think the more common approach is to have a dedicated model trained at the same time that learns to encode and decode images to and from token space.
For the latter approach, this can be a simple model, or it can be a diffusion model. For encoding you do something like a ViT. For decoding you train a diffusion model conditioned on the tokens, throughout the training of the LLM.
For the diffusion approach, you'd usually do post-training on the diffusion decoder to shrink down the number of diffusion steps needed.
The real crutch of these models is the dataset. Pretraining on the internet is not bad, since there's often good correlation between the text and the images. But there's not really good instruction datasets for this. Like, "here's an image, draw it like a comic book" type stuff. Given OpenAI's approach in the past, they may have just bruteforced the dataset using lots of human workers. That seems to be the most likely approach anyway, since no public vision models are quite good enough to do extensive RL against.
And as for OpenAI's architecture here, we can only speculate. The "loading from top to be from a blurry image" is either a direct result of their architecture or a gimmick to slow down requests. If the former, it means they are able to get a low resolution version of the image quickly, and then slowly generate the higher resolution "in order." Since it's top-to-bottom that implies token-by-token decoding. My _guess_ is that the LLM's image token predictions are only "good enough." So they have a small, quick decoder take those and generate a very low resolution base image. Then they run a stronger decoding model, likely a token-by-token diffusion model. It takes as condition the image tokens and the low resolution image, and diffuses the first patch of the image. Then it takes as condition the same plus the decoded patch, and diffuses the next patch. And so forth.
A mixture of approaches like that allows the LLM to be truly multi-modal without the image tokens being too expensive, and the token-by-token diffusion approach helps offset memory cost of diffusing the whole image.
I don't recall if I've seen token-by-token diffusion in a published paper, but it's feasible and is the best guess I have given the information we can see.
EDIT: I should note, I've been "fooled" in the past by OpenAI's API. When o* models first came out, they all behaved as if the output were generated "all at once." There was no streaming, and in the chat client the response would just show up once reasoning was done. This led me to believe they were doing an approach where the reasoning model would generate a response and refine it as it reasoned. But that's clearly not the case, since they enabled streaming :P So take my guesses with a huge grain of salt.
When you randomly pick the locations they found it worked okay, but doing it in raster order (left to right, top to bottom) they found it didn't work as well. We tried it for music and found it was vulnerable to compounding error and lots of oddness relating to the fragility of continuous space CFG.
There is a more recent approach to auto-regressive image generation.
Rather than predicting the next patch at the target resolution one by one, it predicts the next resolution. That is, the image at a small resolution followed by the image at a higher resolution and so on.
It also would mean that the model can correctly split the image into layers, or segments, matching the entities described. The low-res layers can then be fed to other image-processing models, which would enhance them and fill in missing small details. The result could be a good-quality animation, for instance, and the "character" layers can even potentially be reusable.
I wasn't really planning to share/release it today, but, heck, why not.
I started with bitmap-style generative image models, but because they are still pretty bad at text (even this, although it’s dramatically better), for early-2025 it’s generating vector graphics instead. Each frame is an LLM response, either as an svg or static html/css. But all computation and transformation is done by the LLM. No code/js as an intermediary. You click, it tells the LLM where you clicked, the LLM hallucinates the next frame as another svg/static-html.
If it ran 50x faster it’d be an absolutely jaw dropping demo. Unlike "LLMs write code", this has depth. Like all programming, the "LLMs write code" model requires the programmer or LLM to anticipate every condition in advance. This makes LLM written "vibe coded" apps either gigantic (and the llm falls apart) or shallow.
In contrast, as you use universal, you can add or invent features ranging from small to big, and it will fill in the blanks on demand, fairly intelligently. If you don't like what it did, you can critique it, and the next frame improves.
Its agonizingly slow in 2025, but much smarter and in weird ways less error prone than using the LLM to generate code that you then run: just run computation via the LLM itself.
You can build pretty unbelievable things (with hallucinated state, granted) with a few descriptive sentences, far exceeding the capabilities you can “vibe code” with the description. And it never gets lost in its rats nest of self generated garbage code because… there is no code to in.
Code is medium with a surprisingly strong grain. This demo is slow, but SO much more flexible and personally adaptable than anything I’ve used where the logic is implemented cia a programming language.
I don’t love this as a programmer, but my own use of the demo makes me confident that programming languages as a category will have a shelf life if LLM hardware gets fast, cheap and energy efficient.
I suspect LLMs will generate not programming language code, but direct wasm or just machine code on the fly for things that need faster traction than they can draw a frame, but core logic will move out of programming languages (not even llm written code). Maybe similar to the way we bind to low level fast languages but a huge percentage of “business” logic is written in relatively slower languages.
FYI, I may not be able to afford the credits if too many people visit, I put a a $1000 of credits on this, we'll see if that lasts. This is claude 3.7, I tried everything else, a claude had the visual intelligence today. IMO this is a much more compelling glance at the future than coding models. Unfortunately, generating an SVG per click is pricey, each click/frame costs me about $0.05. I’ll fund this as far as I can so folks can play with it.
Anthropic? You there? Wanna throw some credits at an open source project doing something that literally only works on claude today? Not just better, but “only Claude 3.7 can show this future today?”. I’d love for lots more people to see the demo, but I really could use an in-kind credit donation to make this viable. If anyone at anthropic is inspired and wants to hook me up: snickell@alumni.stanford.edu. Very happy to rep Claude 3.7 even more than I already do.
I think it’s great advertising for Claude. I believe the reason Claude seems to do SO much better at this task is, one it shows far greater spatial intelligence, and two, I distract they are the only state of the art model intentionally training on SVG.
I’m a bit late here - but I’m the COO of OpenRouter and would love to help out with some additional credits and share the project. It’s very cool and more people could be able to check it out. Send me a note. My email is cc at OpenRouter.ai
I don't think the project would have gotten this far without openrouter (because: how else would you sanely test on 20+ models to be able to find the only one that actually worked?). Without openrouter, I think I would have given up and thought "this idea is too early for even a demo", but it was easy enough to keep trying models that I kept going until Claude 3.7 popped up.
This is super cool! I think new kinds of experiences can be built with infinite generative UIs. Obviously there will need to be good memory capabilities, maybe through tool use.
If you end up taking this further and self hosting a model you might actually achieve a way faster “frame rate” with speculative decoding since I imagine many frames will reuse content from the last. Or maybe a DSL that allows big operations with little text. E.g. if it generates HTML/SVG today then use HAML/Slim/Pug: https://chatgpt.com/share/67e3a633-e834-8003-b301-7776f76e09...
What I'm currently doing is caveman: I ask the LLM to attach a unique id= to every element, and I gave it an attribute (data-use-cached) it can use to mark "the contents of this element should be loaded from the preivous frame": https://github.com/snickell/universal/blob/47c5b5920db5b2082...
For example, this specifies that #my-div should be replaced with the value from the previous frame (which itself might have been cached):
<div id="my-div" data-use-cached></div>
This lowers the render time /substantially/, for simple changes like "clicked here, pop-open a menu" it can do it in 10s, vs a full frame render which might be 2 minutes (obviously varies on how much is on the screen!).
I think using HAML etc is an interesting idea, thanks for suggesting it, that might be something I'll experiment with.
The challenge I'm finding is that "fancy" also has a way of confusing the LLM. E.g. I originally had the LLM produce literal unified diffs between frames. I reasoned it had seem plenty of diffs of HTML in its training data set. It could actually do this, BUT image quality and intelligence were notably affected.
Part of the problem is that at the moment (well 1mo ago when I last benchmarked), only Claude is "past the bar" for being able to do this particular task, for whatever reason. Gemini Flash is the second closest. Everything else (including 4o, 4.5, o1, deepseek, etc) are total wipeouts.
What would be really amazing is if say Llama 4 turns out to be good in the visual domain the way claude is, and you can run it on one of the LLM-on-silicon vendors (cerebrus.ai, grok, etc) to get 10x the token rate.
LMK if you have other ideas, thanks for thinking about this and taking a look!
No, I wasn't planning to post this for a couple weeks, but I saw the comment and was like "eh, why not?".
You can watch "sped up" past sessions by other people who used this demo here, which is kind of like a demo video: https://universal.oroborus.org/gallery
But the gallery feature isn't really there today, it shows all the "one-click and bounce sessions", and its hard to find signal in the noise.
I'll probably submit a "Show HN" when I have the gallery more together, and I think its a great idea to pick a multi-click gallery sequence and upload it as a video.
Seconding the need for a video. We need a way to preview this without it costing you money. I had to charge you a few dimes to grasp this excellent work. The description does not do it justice; people need to see this in motion. The progressive build-up of a single frame, too. I encourage you to post the Show HN soon.
Anyone know order-or-magnitude how many visits to expect (order of magnitude) from an Ask HN? A thousand? 10? 100? I need to figure out how many credits I'd need to line up to survive one.
> had to charge you a few dimes
s/you/openrouter/: ty to openrouter for donating a significant chunk of credits a couple hours ago.
Really appreciate the feedback on needing a video. I had a sense this was the most important "missing piece", but this will give me the motivation to accomplish what is (to me) a relatively boring task, compared to hacking out more features.
Yeah Gemini has had this for a few weeks, but much lower resolution. Not saying 4o is perfect, but my first few images with it are much more impressive than my first few images with Gemini.
>You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
You can do that with diffusion, too. Just lock the parameters in ComfyUi.
Yeah I wasn’t very imaginative in my examples, with 4o you can also perform transformations like “rotate the camera 10 degrees to the left” which would be hard without a specialized model. Basically you can run arbitrary functions on the exact image contents but in latent space.
I'm incredibly deep in the image / video / diffusion / comfy space. I've read the papers, written controlnets, modified architectures, pretrained, finetuned, etc. All that to say that I've been playing with 4o for the past day, and my opinions on the space have changed dramatically.
4o is a game changer. It's clearly imperfect, but its operating modalities are clearly superior to everything else we have seen.
Have you seen (or better yet, played with) the whiteboard examples? Or the examples of it taking characters out of reflections and manipulating them? The prompt adherence, text layout, and composing capabilities are unreal to the point this looks like it completely obsoletes inpainting and outpainting.
I'm beginning to think this even obsoletes ComfyUI and the whole space of open source tools once the model improves. Natural language might be able to accomplish everything outside of fine adjustments, but if you can also supply the model with reference images and have it understand them, then it can do basically everything. I haven't bumped into anything that makes me question this yet.
They just need to bump the speed and the quality a little. They're back at the top of image gen again.
I'm hoping the Chinese or another US company releases an open model capable of these behaviors. Because otherwise OpenAI is going to take this ball and run far ahead with it.
Yeah if we get an open model that one could apply a LoRA (or similarly cheap finetuning) to, then even problems like reproducing identity would (most likely) be solved, as they were for diffusion models. The coherence not just to the prompt but to any potential input image(s) is way beyond what I've seen in diffusion models.
I do think they run a "traditional" upscaler on the transformer output since it seems to sometimes have errors similar to upscalers (misinterpreted pixels), so probably the current decoded resolution is quite low and hopefully future models like GPT-5 will improve on this.
That's very interesting. I would have assumed that 4o is internally using a single seed for the entire conversation, or something analogous to that, to control randomness across image generation requests. Can you share the technical name for this reasoning process so I could look up research about it?
Is it able to break the usual failure modes of these models, that all clocks are at 10 min past two, or they can't produce images of people drawing with the left hand?
In my tests no, that's still not possible with the model unfortunately, but it feels like you have way more control with prompting over any previous model (stable diffusion/midjourney).
> Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM
With current GPU technology, this system would need its own Dyson sphere.
I might just be a grumpy old man, but it really bugs me when the AI confidently says, "Here is your image, If you have any other requests, just let me know!".
For a start the image is wrong, and also I know I can make more requests, because that what tools are for. Its like a passive aggressive suggestion that I made the AI go out of its way to do me a favor.
Wrt reasoning I’ll believe it when I see it. I just tried several variants of “Generate an image of a chess board in which white has played three great moves and black has played two bad moves.” Results are totally nonsensical as always.
Ran through some of my relatively complex prompts combined with using pure text prompts as the de-facto means of making adjustments to the images (in contrast to using something like img2img / inpainting / etc.)
Great question. I haven't tested the creation of such an image from scratch, but I did add an adjustment test against that specific text-heavy diagram and I'd say it passed with "flying colors". (pun intended).
I’ve just tried it and oh wow it’s really good. I managed to create a birthday invitation card for my daughter in basically 1-shot, it nailed exactly the elements and style I wanted. Then I asked to retain everything but tweak the text to add more details about the date, venue etc. And it did. I’m in shock. Previous models would not be even halfway there.
> Draw a birthday invitation for a 4 year old girl [name here]. It should be whimsical, look like its hand-drawn with little drawings on the sides of stuff like dinosaurs, flowers, hearts, cats. The background should be light and the foreground elements should be red, pink, orange and blue.
Then I asked for some changes:
> That's almost perfect! Retain this style and the elements, but adjust the text to read:
> [refined text]
> And then below it should add the location and date details:
just did the same type prompt for my sons birthday. I got all the classic errors. first attempt looked good, but had 2 duplicate lines for date and time
and "Roarrr!" (dino theme) had a blurred out "a"
pointed these issues out to give it a second go and got something way worse. This still feels like little more than a fun toy.
We're in the middle of a massive and unprecedented boom in AI capabilities. It is hard to be upset about this phrasing - it is literally true and extremely accurate.
Most things aren't in a massive boom and most people aren't that involved in AI. This is a rare example of great communication in marketing - they're telling people who might not be across this field what is going on.
> Why would they publish a model that is not their most advanced model?
I dunno, I'm not sitting in the OpenAI meetings. That is why they need to tell us what they are doing - it is easy to imagine them releasing something that isn't their best model ever and so they clarify that this is, in fact, the new hotness.
(Shrug) It's common for less-than-foundation-level models to be released every so often. This is done in order to provide new options, features, pricing, service levels, APIs or whatever that aren't yet incorporated into the main model, or that are never intended to be.
Just a consequence of how much time and money it takes to train a new foundation model. It's not going to happen every other week. When it does, it is reasonable to announce it with "Announcing our most powerful model yet."
o3 mini wasn't so much a most advanced model, as it was incredibly affordable for the IQ it was presenting at the time. Sometimes it's about efficiency and not being on the frontier.
It kind of is, the iPhone 16e isn’t the best even though it’s the latest, right? Or are we rating best by price/performance, not pure performance (I don’t even know if the 16e would be best there)?
Apple isn't really the best software company and though they were early to digital assistants with Siri, it seems like they've let it languish. It's almost comical how bad Siri is given the capabilities of modern AI. That being said, Android doesn't really have a great builtin solution for this either.
Apple is more of a hardware company. Still, Cook does have a few big wins under his belt: M-series ARM chips on Macs, Airpods, Apple watch, Apple pay.
Maybe people also caught up to the fact that the "our most X product" for Apple usually means someone else already did X a long time ago and Apple is merely jumping on the wagon.
Maybe it’s not useless. 1) it’s only comparing it to their own products and 2) it’s useful to know that the product is the current best in their offering as opposed to a new product that might offer new functionality but isn’t actually their most advanced.
Which is especially relevant when it's not obvious which product is the latest and best just looking at the names. Lots of tech naming fails this test from Xbox (Series X vs S) to OpenAI model names (4o vs o1-pro).
Here they claim 4o is their most capable image generator which is useful info. Especially when multiple models in their dropdown list will generate images for you.
Speaking as someone who'd love to not speak that way in my own marketing - it's an unfortunate necessity in a world where people will give you literal milliseconds of their time. Marketing isn't there to tell you about the thing, it's there to get you to want to know more about the thing.
A term for people giving only milliseconds of their attention is: uninterested people. If I’m not looking for a project planner, or interested in the space, there’s no wording that can make me stay on an announcement for one. If I am, you can be sure I’m going to read the whole feature page.
No, everybody uses marketing because it's a conventional bet. It has proven in many cases to not be effective, but people aren't willing to risk getting fired because they suggested going against the grain.
OpenAI's livestream of GPT-4o Image Generation shows that it is slowwwwwwwwww (maybe 30 seconds per image, which Sam Altman had to spin "it's slow but the generated images are worth it"). Instead of using a diffusion approach, it appears to be generating the image tokens and decoding them akin to the original DALL-E (https://openai.com/index/dall-e/), which allows for streaming partial generations from top to bottom. In contrast, Google's Gemini can generate images and make edits in seconds.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
As a user, images feel slightly slower but comparable to the previous generation. Given the significant quality improvement, it's a fair trade-off. Overall, it feels snappy, and the value justifies a higher price.
LLMs are autoregressive, so they can't be (multi-modality) integrated with diffusion image models, only with autoregressive image models (which generate an image via image tokens). Historically those had lower image fidelity than diffusion models. OpenAI now seems to have solved this problem somehow. More than that, they appear far ahead of any available diffusion model, including Midjourney and Imagen 3.
Gemini "integrates" Imagen 3 (a diffusion model) only via a tool that Gemini calls internally with the relevant prompt. So it's not a true multimodal integration, as it doesn't benefit from the advanced prompt understanding of the LLM.
Edit: Apparently Gemini also has an experimental native image generation ability.
Gemini added their multimodal Flash model to Google AI Studio some time ago. It does not use Imagen via tool, it's uses native capabilities to manipulate images, and it's free to try.
No that seems to be indeed a native part of the multimodal Gemini model. I didn't know this existed, it's not available in the normal Gemini interface.
This is a pretty good example of the current state of Google LLMs:
The (no longer, I guess) industry-leading features people actually want are hidden away in some obscure “AI studio” with horrible usability, while the headline Gemini app still often refuses to do anything useful for me. (Disclaimer: I last checked a couple of months ago, after several more of mild amusement/great frustration.)
That's pretty disappointing, it has been out for a while, and we still get top comments like (https://news.ycombinator.com/item?id=43475043) where people clearly think native image generation capability is new. Where do you usually get your updates from for this kind of thing?
Meta has experimented with a hybrid mode, where the LLM uses autoregressive mode for text, but within a set of delimiters will switch to diffusion mode to generate images. In principle it's the best of both worlds.
That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.
ByteDance has been working on autoregressive image generation for a while (see VAR, NeurIPS 2024 best paper). Traditionally they weren't in the open-source gang though.
The VAR paper is very impressive. I wonder if OpenAI did something similar. But the main contribution in the new GPT-4o feature doesn't seem to be just image quality (which VAR seems to focus on), but also massively enhanced prompt understanding.
If you look at the examples given, this is the first time I've felt like AI generated images have passed the uncanny valley.
The results are ground breaking in my opinion. How much longer until an AI can generate 30 successive images together and make an ultra realistic movie?
i find this “slow” complaint (/observation— i dont view this comment as a complaint, to be clear) to be quite confusing. slow… compared to what, exactly? you know what is slow? having to prompt and reprompt 15 times to get the stupid model to spell a word correctly and it not only refuses, but is also insistent that it has corrected the error this time. and afaict this is the exact kind of issue this change should address substantially.
im not going to get super hyperbolic and histrionic about “entitlement” and stuff like that, but… literally this technology did not exist until like two years ago, and yet i hear this all the time. “oh this codegen is pretty accurate but it’s slow”, “oh this model is faster and cheaper (oh yeah by the way the results are bad, but hey it’s the cheapest so it’s better)”. like, are we collectively forgetting that the whole point of any of this is correctness and accuracy? am i off-base here?
the value to me of a demonstrably wrong chat completion is essentially zero, and the value of a correct one that anticipates things i hadn’t considered myself is nearly infinite. or, at least, worth much, much more than they are charging, and even _could_ reasonably charge. it’s like people collectively grouse about low quality ai-generated junk out of one side of their mouths, and then complain about how expensive the slop is out of the other side.
hand this tech to someone from 2020 and i guarantee you the last thing you’d hear is that it’s too slow. and how could it be? yeah, everyone should find the best deals / price-value frontier tradeoff for their use case, but, like… what? we are all collectively devaluing that which we lament is being devalued by ai by setting such low standards: ourselves. the crazy thing is that the quickly-generated slop is so bad as to be practically useless, and yet it serves as the basis of comparison for… anything at all. it feels like that “web-scale /dev/null” meme all over again, but for all of human cognition.
> it appears to be generating the image tokens and decoding them akin to the original DALL-E
The animation is a lie. The new 4o with "native" image generating capabilities is a multi-modal model that is connected to a diffusion model. It's not generating images one token at a time, it's calling out to a multi-stage diffusion model that has upscalers.
You can ask 4o about this yourself, it seems to have a strong understanding of how the process works.
There are many clues to indicate that the animation is a lie. For example, it clearly upscales the image using an external tool after the first image renders. As another example, if you ask the model about the tokens inside of its own context, it can't see any pixel tokens.
A model may not have many facts about itself, but it can definitely see what is inside of its own context, and what it sees is a call to an image generation tool.
Finally, and most convincingly, I can't find a single official source where OpenAI claims that the image is being generated pixel-by-pixel inside of the context window.
Sorry but I think you may be mistaken if your only source is ChatGPT. It's not aware of its own creation processes beyond what is included in its system prompt.
A large part of deviantart.com would fit that description. There are also a lot of cartoony or CG images in communities dedicated to fanart. Another component in there is probably the overly polished and clean look of stock images, like the front page results of shutterstock.
"Typical" AI images are this blend of the popular image styles of the internet. You always have a bit of digital drawing + cartoon image + oversaturated stock image + 3d render mixed in. Models trained on just one of these work quite well, but for a generalist model this blend of styles is an issue
> There are also a lot of cartoony or CG images in communities dedicated to fanart.
Asian artists don't color this way though; those neon oversaturated colors are a Western style.
(This is one of the easiest ways to tell a fake-anime western TV show, the colors are bad. The other way is that action scenes don't have any impact because they aren't any good at planning them.)
Wild speculation: video game engines. You want your model to understand what a car looks like from all angles, but it’s expensive to get photos of real cars from all angles, so instead you render a car model in UE5, generating hundreds of pictures of it, from many different angles, in many different colors and styles.
I've heard this is downstream of human feedback. If you ask someone which picture is better, they'll tend to pick the more saturated option. If you're doing post-training with humans, you'll bake that bias into your model.
Ever since Midjourney popularized it, image generation models are often posttrained on more "aesthetic" subsets of images to give them a more fantasy look. It also help obscure some of the imperfections of the AI.
It's largely an artifact of classifier-free guidance used in diffusion models. It makes the image generation more closely follow the prompt but also makes everything look more saturated and extreme.
Is there any way to see whether a given prompt was serviced by 4o or Dall-E?
Currently, my prompts seem to be going to the latter still, based on e.g. my source image being very obviously looped through a verbal image description and back to an image, compared to gemini-2.0-flash-exp-image-generation. A friend with a Plus plan has been getting responses from either.
The long-term plan seems to be to move to 4o completely and move Dall-E to its own tab, though, so maybe that problem will resolve itself before too long.
4o generates top down (picture goes from mostly blurry to clear starting from the top). If it's not generating like that for you then you don't have it yet.
That's useful, thank you! But it also highlights my point: Why do I have to observe minor details about how the result is being presented to me to know which model was used?
I get the intent to abstract it all behind a chat interface, but this seems a bit too much.
I've generated (and downloaded) a couple of images. All filenames start with `DALL·E`, so I guess that's a safe way to tell how the images were generated.
don't enable images on the chat model if your using the site, just leave it all disabled and ask for an image, if you enable dall-e it switches to dall-e is what i've seen
It's incredible that this took 316 days to be released since it was initially announced. I do appreciate the emphasis in the presentation on how this can be useful beyond just being a cool/fun toy, as it seems most image generation tools have functioned.
Was anyone else surprised how slow the images were to generate in the livestream? This seems notably slower than DALLE.
I've never minded that an image might take 10-30 seconds to generate. The fact that people do is crazy to me. A professional artist would take days, and cost $100s for the same asset.
I ran stable diffusion for a couple of years (maybe?, time really hasn't made sense since 2020) on my Dual 3090 rendering server. I built the server originally for crypto heating my office in my 1820s colonial in upstate NY then when I was planning to go back to college (got accepted into a university in England), I switched it's focus to Blender/UE4 (then 5), then eventually to AI image gen. So I've never minded 20 seconds for an image. If I needed dozens of options to pick the best, I was going to click start and grab a cup of coffee, come back and maybe it was done. Even if it took 2 hours, it is still faster than when I used to have to commission art for a project.
I grew out of Stable Diffusion, though, because the learning curve beyond grabbing a decent checkpoint and clicking start was actually really high (especially compared to LLMs that seamed to "just work"), after going through failed training after failed fine-tuning using tutorials that were a couple days out of date, I eventually said, fuck it, I'm paying for this instead.
All that to say - if you are using GenAI commercially, even if an image or a block of code took 30 minutes, it's still WAY cheaper than a human. That said, eventually a professional will be involved, and all the AI slop you generated will be redone, which will still cost a lot, but you get to skip the back and forth figuring out style/etc.
The new model in the drop down says something like "4o Create Image (Updated)". It is truly incredible. Far better than any other image generator as far as understanding and following complex prompts.
I was blown away when they showed this many months ago, and found it strange that more people weren't talking about it.
This is much more precise than the Gemini one that just came out recently.
First AI image generator to pass the uncanny valley test? Seems like it. This is the biggest leap in image generation quality I've ever seen.
How much longer until an AI that can generate 30 frames with this quality and make a movie?
About 1.5 years ago, I thought AI would eventually allow anyone with an idea to make a Hollywood quality movie. Seems like we're not too far off. Maybe 2-3 more years?
>First AI image generator to pass the uncanny valley test?
Other image generators I've used lately often produced pretty good images of humans, as well [0]. It was DALLE that consistently generated incredibly awful images. Glad they're finally fixing it. I think what most AI image generators lack the most is good instruction following.
[0] YandexArt for the first prompt from the post: https://imgur.com/a/VvNbL7d
The woman looks okay, but the text is garbled, and it didn't fully follow the instruction.
Not sure, I tried a few generations, and it still produces those weird deformed faces, just like the previous generation: https://imgur.com/a/iKGboDH Yeah, sometimes it looks okay.
The examples they show have little captions that say "best of #", like "best of 8" or "best of 4". Hopefully that truly represents the odds of generating the level of quality shown.
I don't believe it when Microsoft announces it, but when two separate trustworthy-looking hn accounts tell me something is crazy good that seems like valuable information to me.
I got the occasional A/B test with a new image generator while playing with Dall-E during a one month test of Plus. It was always clear which one was the new model because every aspect was so much better. I assume that model and the model they announced are the same.
This is really impressive, but the "Best of 8" tag on a lot of them really makes me want to see how cherry-picked they are. My three free images had two impressive outputs and one failure.
While drawing hands is difficult (because the surface morphs in a variety of ways), the shapes and relative proportions are quite simple. That’s how you can have tools like Metahuman[0]
The whiteboard image is insane. Even if it took more than 8 to find it, it's really impressive.
To think that a few years ago we had dreamy pictures with eyes everywhere. And not long ago we were always identifying the AI images by the 6 fingered people.
I wonder how well the physics is modeled internally. E.g. if you prompt it to model some difficult ray tracing scenario (a box with a separating wall and a light in one of the chambers which leaks through to the other chamber etc)?
Or if you have a reflective chrome ball in your scene, how well does it understand that the image reflected must be an exact projection of the visible environment?
am I dumb or every time they release something I can never find out how to actually use it and forget about it. take this for instance I wanted to try out their newton "an infographic explaining newton's prism experiment in great detail" example, but it generated a very bad result but maybe it's because I'm not using the right model? every release of theirs is not really a release, it's like a trailer. right?
You're not dumb. They do this for nearly every single major release. I can't really understand why considering it generates negative sentiment about the release, but it's something to be expected from OpenAI at this point.
This is what's so wild about Anthropic. When they release it seems like it's rolled out to all users, and API customers immediately. OpenAI has MONTHS between annoucement and roll out, or if they do it's usually just influencers who get an "early look". It's pretty frustrating.
It's very impressive. It feels like the text is a bit of a hack where they're somehow rendering the text separately and interpolating it into the image. Not always, I got it to render calligraphy with flourishes, but only for a handful of words.
For example, I asked it to render a few lines of text on a medieval scroll, and it basically looked like a picture of a gothic font written onto a background image of a scroll
You could have a model that receives the generated raw text and then is trained to display it in whatever style. Whether it looks like a font or not is irrelevant.
For starters, this completely blocks generation of anything remotely related to copy-protected IPs, which may actually be a saving grace for some creatives. There's a lot of demand for fanart of existing characters, so until this type of model can be run locally, the legal blocks in place actually give artists some space to play in where they don't have to compete with this. At least for a short while.
Fan-art is still illegal, especially since a lot of fan artists are doing it commercially nowadays via commissions and Patreon. It's just that companies have stopped bothering to sue for it because individual artists are too small to bother with, and it's bad PR. (Nintendo did take down a super popular Pokemon porn comic, though.)
So it's ironic in this sense, that OpenAI blocking generation of copyrighted characters means that it's more in compliance with copyright laws than most fan artists out there, in this context. If you consider AI training to be transformative enough to be permissible, then they are more copyright-respecting in general.
So I spent a good few hours investigating the current state of the art a few weeks ago. I would like to generate a collection of images for the art in a video game.
It is incredibly difficult to develop an art style, then get the model to generate a collection of different images in that unique art style. I couldn't work out how to do it.
I also couldn't work out how to illustrate the same characters or objects in different contexts.
AI seems great for one off images you don't care much about, but when you need images to communicate specific things, I think we are still a long way away.
Short answer: the model is good at consistency. You can use it to generate a set a style reference images, then use those as reference for all your subsequent generations. Generating in the same chat might also help it have further consistency between images.
Even with custom LoRas, controlnets, etc. we're still a pretty long ways from being able to one-click generate thematically consistent images especially in the context of a video game where you really need the ability to generate seamless tiles, animation based spritesheets, etc.
I didn’t mean art. I meant visual internet content of all kinds. Influencers promoting products, models, the “guy talking to a camera” genre, photos of landscapes, interviews, well-designed ads, anything that comes up on your instagram explore page; anything that has taken over feeds due to the trust coming from a human being behind it will become indistinguishable from slop. It’s not quite there yet but it’s close and undeniably coming soon
Asking it to draw the Balkans map in Tolkien style, this is actually really impressive, geography is more or less completely correct, borders and country locations are wrong, but it feels like something I could get it to fix.
> I wasn't able to generate the map because the request didn't follow content policy guidelines. Let me know if you'd like me to adjust the request or suggest an alternative way to achieve a similar result.
Are you in the US?
...why are we living in such a retarded sci-fi age
I work on a product for generating interactive fanfiction using an LLM, and I've put a lot of work into post-training to improve writing quality to match or exceed typical human levels.
I'm excited about this for adding images to those interactive stories.
It has nothing to do with circumventing the cost of artists or writers: regardless of cost, no one can put out a story and then rewrite it based on whatever idea pops into every reader's mind for their own personal main character.
It's a novel experience that only a "writer" that scales by paying for an inanimate object to crunch numbers can enable.
Similarly no artist can put out a piece of art for that story and then go and put out new art bespoke to every reader's newly written story.
-
I think there's this weird obsession with framing these tools about being built to just replace current people doing similar things. Just speaking objectively: the market for replacing "cheeky expensive artists" would not justify building these tools.
The most interesting applications of this technology being able to do things that are simply not possible today even if you have all the money in the world.
And for the record, I'll be ecstatic for the day an AI can reach my level of competency in building software. I've been doing it since I was a child because I love it, it's the one skill I've ever been paid for, and I'd still be over the moon because it'd let me explore so many more ideas than I alone can ever hope to build.
> That is a great right, as long as it's not programmers.
You realize that almost weekly we have new AI models coming out that are better and better at programming? It just happened that the image generation is an easier problem than programming. But make no mistake, AI is coming for us too.
ChatGPT Pro tip: In addition to video generation, you can use this new image gen functionality in Sora and apply all of your custom templates to it! I generated this template (using my Sora Preset Generator, which I think is public) to test reasoning and coherency within the image:
Theme: Educational Scientific Visualization – Ultra Realistic Cutaways
Color: Naturalistic palettes that reflect real-world materials (e.g., rocky grays, soil browns, fiery reds, translucent biological tones) with high contrast between layers for clarity
Camera: High-resolution macro and sectional views using a tilt-shift camera for extreme detail; fixed side angles or dynamic isometric perspective to maximize spatial understanding
Film Stock: Hyper-realistic digital rendering with photogrammetry textures and 8K fidelity, simulating studio-grade scientific documentation
Lighting: Studio-quality three-point lighting with soft shadows and controlled specular highlights to reveal texture and depth without visual noise
Vibe: Immersive and precise, evoking awe and fascination with the inner workings of complex systems; blends realism with didactic clarity
Content Transformation: The input is transformed into a hyper-detailed, realistically textured cutaway model of a physical or biological structure—faithful to material properties and scale—enhanced for educational use with visual emphasis on internal mechanics, fluid systems, and spatial orientation
Examples:
1. A photorealistic geological cutaway of Earth showing crust, tectonic plates, mantle convection currents, and the liquid iron core with temperature gradients and seismic wave paths.
2. An ultra-detailed anatomical cross-section of the human torso revealing realistic organs, vasculature, muscular layers, and tissue textures in lifelike coloration.
3. A high-resolution cutaway of a jet engine mid-operation, displaying fuel flow, turbine rotation, air compression zones, and combustion chamber intricacies.
4. A hyper-realistic underground slice of a city showing subway lines, sewage systems, electrical conduits, geological strata, and building foundations.
5. A realistic cutaway of a honeybee hive with detailed comb structures, developing larvae, worker bee behavior zones, and active pollen storage processes.
In the short term, yes.
Over the long run, I think it's good that we move away from the "seeying is believing" model, since that was already abused by bad actors/propaganda
Hopefully, not too much chaos until we find another solution.
Look closer at the fingers. These models still don’t have a firm handle on them. The right elbow on the second picture also doesn’t quite look anatomically possible.
I’m not sure what your point is. This subthread is about whether AI-generated pictures can be distinguished from real photographs. For the pictures in the article, which are already cherry-picked (“best of 8”), the answer is yes. Therefore I don’t quite share the worries of GP.
Nah, I'll maybe start taking them seriously when they can draw someone grating cheese, but holding the cheese and the grater as if they were playing violin.
Is anyone else getting wild rejections on content policy since this morning? I spent about 20 minutes trying to get it to turn my zoo photos into cartoons and could not get a single animal picture past the content moderation....
Even when I told it to transform it into a text description, then draw that text description, my earlier attempt at a cat picture meant that the description was too close to a banned image...
I can't help but feel like openAI and grok are on unhelpful polar opposites when it comes to moderation.
One area where it does not work well at all is modifying photographs of people's faces.* Completely fumbles if you take a selfie and ask it to modify your shirt, for example.
> We’re aware of a bug where the model struggles with maintaining consistency of edits to faces from user uploads but expect this to be fixed within the week.
Sounds like it may be a safety thing that's still getting figured out
It just doesn't have that kind of image editing capability. Maybe people just assume it does because Google's similar model has it. But did OpenAI claim it could edit images?
Yes it does, and that's one of the most important parts of it being multi-modal: just like it can make targeted edits at a piece of text, it can now make similarly nuanced edits to an image. The character consistency and restyling they mention are all rooted in the same concepts.
The Americas are quite a bit larger than the USA, so I disagree with 'american' being a word for people and things from mainland USA. Usian seems like a reasonable derivative of USA and US, similar to how mexican follows from Mexico and Estados Unidos Mexicanos.
It seems like an odd way to name/announce it, there's nothing obvious to distinguish it from what was already there (i.e. 4o making images) so I have no idea if there is a UI change to look for, or just keep trying stuff until it seems better?
If only OpenAI would dogfood their own product and use ChatGPT to make different choices with marketing that are less confusing than whoever's driving that bus now.
I enjoy trying to break these models. I come up with prompts that are uncommon but valid. I want to see how well they handle data not in their training set. For image generation I like to use “ Generate an image of a woman on vacation in the Caribbean, lying down on the beach without sunglasses, her eyes open.”
I think the biggest problem I still see is the models awareness of the images it generated itself.
The glaring issue for the older image generators is how it would proudly proclaim to have presented an image with a description that has almost no relation to the image it actually provided.
I'm not sure if this update improves on this aspect. It may create the illusion of awareness of the picture by having better prompt adherence.
For some reason, I can't see the images in that chat, whether I'm signed in or in incognito mode.
I see errors like this in the console:
ewwsdwx05evtcc3e.js:96 Error: Could not fetch file with ID file_0000000028185230aa1870740fa3887b?shared_conversation_id=67e30f62-12f0-800f-b1d7-b3a9c61e99d6 from file service
at iehdyv0kxtwne4ww.js:1:671
at async w (iehdyv0kxtwne4ww.js:1:600)
at async queryFn (iehdyv0kxtwne4ww.js:1:458)Caused by: ClientRequestMismatchedAuthError: No access token when trying to use AuthHeader
Flux 1.1 Pro has good prompt adherence, but some of these (admittingly cherry-picked) GPT-4o generated image demos are beyond what you would get with Flux without a lot of iteration, particularly the large paragraphs of text.
I'm excited to see what a Flux 2 can do if it can actually use a modern text encoder.
Structural editing and control nets are much more powerful than text prompting alone.
The image generators used by creatives will not be text-first.
"Dragon with brown leathery scales with an elephant texture and 10% reflectivity positioned three degrees under the mountain, which is approximately 250 meters taller than the next peak, ..." is not how you design.
Creative work is not 100% dice rolling in a crude and inadequate language. Encoding spatial and qualitative details is impossible. "A picture is worth a thousand words" is an understatement.
It can do in-context learning from images you upload. So you can just upload a depth map or mark up an image with the locations of edits you want and it should be able to handle that. I guess my point is that since its the same model that understands how to see images and how to generate them you aren't restricted from interacting with it via text only.
Prompt adherence and additional tricks such as ControlNet/ComfyUI pipelines are not mutually exclusive. Both are very important to get good image generation results.
It is when it's kept behind an API. You cannot use Controlnet/ComfyUI and especially not the best stuff like regional prompting with this model. You can't do it with Gemini, and that's by design because otherwise coomers are going to generate 999999 anime waifus like they do on Civit.ai.
That's a fun idea—but generating an image with 999,999 anime waifus in it isn't technically possible due to visual and processing limits. But we can get creative.
Want me to generate:
1. A massive crowd of anime waifus (like a big collage or crowd scene)?
2. A stylized representation of “999999 anime waifus” (maybe with a few in focus and the rest as silhouettes or a sea of colors)?
3. A single waifu with a visual reference to the number 999999 (like a title, emblem, or digital counter in the background)?
Let me know your vibe—epic, funny, serious, chaotic?
> Yeah, but then it no longer replaces human artists.
Automation tools are always more powerful as a force multiplier for skilled users than a complete replacement. (Which is still a replacement on any given task scope, since it reduces the number of human labor hours — and, given any elapsed time constraints, human laborers — needed.)
We're not trying to replace human artists. We're trying to make them more efficient.
We might find that the entire "studio system" is a gross inefficiency and that individual artists and directors can self-publish like on Steam or YouTube.
Exactly. OpenAI isn't going to win image and video.
Sora is one of the worst video generators. The Chinese have really taken the lead in video with Kling, Hailuo, and the open source Wan and Hunyuan.
Wan with LoRAs will enable real creative work. Motion control, character consistency. There's no place for an OpenAI Sora type product other than as a cheap LLM add-in.
The real test for image generators is the image->text->image conversion. In other words it should be able to describe an image with words and then use the words to recreate the original image with a high accuracy. The text representation of the image doesn't have to be English. It can be a program, e.g. a shader, that draws the image. I believe in 5-10 years it will be possible to give this tool a picture of rainforest, tell it to write a shader that draws this forest, and tell it to add Avatar-style flying rocks. Instead of these silly benchmarks, we'll read headlines like "GenAI 5.1 creates a 3D animation of a photograph of the Niagara falls in 3 seconds, less than 4KB of code that runs at 60fps".
Why is that “the real test for image generators”? I mean, most image generators don't inherently include image->text functionality at all, so this seems more of a test of multimodal modals that include both t2i and i2t functionality, but even then, I don't think humans would generally pass this test well (unless the human doing the description test was explicitly told that the purpose was reproduction, but that's not the usual purpose of either human or image2text model descriptions.)
Really liked the fact that the team shared all the shortcomings of the model in the post. Sometimes products just highlights the best results and isn't forthcoming in areas that need improvement. Kudos to the OpenAI team on that.
> ChatGPT’s new image generation in GPT‑4o rolls out starting today to Plus, Pro, Team, and Free users as the default image generator in ChatGPT, with access coming soon to Enterprise and Edu. For those who hold a special place in their hearts for DALL·E, it can still be accessed through a dedicated DALL·E GPT.
> Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks.
That's it folks. Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero.
Now you have a widely accessible meme generator with the name "ChatGPT".
The last task is for an open weight model that competes against this and is faster and all for free.
> Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero. Now you have a widely accessible meme generator with the name "ChatGPT".
ChatGPT has already had a that via Dall-E. If it didn't kill those startups when that happened this doesn't fundamentally change anything. Now its got a new image gen model, which — like Dall-E 3 when it came out — is competitive or ahead of other SotA base models using just text prompts, the simplest generation workflow, but both more expensive and less adaptable to more involved workflows than the tools anyone more than a casual user (whether using local tools or hosted services) is using. This is station-keeping for OpenAI, not a meaningful change in the landscape.
There are several examples here, especially in the videos that no existing image gen model can do and would require tedious workflows and/or training regimens to replicate, maybe.
It's not 'just' a new model ala Imagen 3. This is 'what if GPT could transform images nearly as well as text?' and that opens up a lot of possibilities. It's definitely a meaningful change.
Yep. The coherence and text quality is insanely good. Keen to play with it to find it's "mangled hands" style deficiencies, because of course they cherry picked the best examples.
I wanted to use this to generate funny images of myself. Recently I was playing around with Gemini Image Generation to dress myself up as different things. Gemini Image Generation is surprisingly good, although the image quality quickly degrades as you add more changes. Nothing harmful, just silly things like dressing me up as a wizard or other typical RPG roles.
Trying out 4o image generation... It doesn't seem to support this use-case at all? I gave it an image of myself and asked to turn me into a wizard, and it generate something that doesn't look like me in the slightest. A second attempt, I asked to add a wizard hat and it just used python to add a triangle in the middle of my image. I looked at the examples and saw they had a direct image modification where they say "Give this cat a detective hat and a monocle", so I tried that with my own image "Give this human a detective hat and a monocle" and it just gave me this error:
> I wasn't able to generate the modified image because the request didn't follow our content policy. However, I can try another approach—either by applying a filter to stylize the image or guiding you on how to edit it using software like Photoshop or GIMP. Let me know what you'd like to do!
Overall, a very disappointing experience. As another point of comparison, Grok also added image generation capabilities and while the ability to edit existing images is a bit limited and janky, it still manages to overlay the requested transformation on top of the existing image.
It's not actually out for everyone yet. You can tell by the generation style.
4o generates top down (picture goes from mostly blurry to clear starting from the top).
Iterations are the missing link.
With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
I‘m looking forward to try this out and see if I was right. Unfortunately it’s not yet available for me.
You can do that with Gemini's image model, flash 2.0 (image generation) exp.[1] It's not perfect but it does mostly maintain likeness between generations.
DALLE-3 with ChatGPT has been able to approximate this for a while now by internally locking the seed down as you make adjustments. It's not perfect by any means but can be more convenient than manual inpainting.
You‘re right. I’m actually doing this quite often when coding. Starting with a few iterative promts to get a general outline of what I want and when that’s ok, copy the outline to a new chat and flesh out the details. But that’s still iterative work, I’m just throwing away the intermediate results that I think confuse the LLM sometimes.
I created an app to generate image prompts specifically for 4o. Geared towards business and marketing. Any feedback is welcome. https://imageprompts.app/
One of the fingers is the wrong way around… it’s a big improvement but it’s easy to find major problems, and these are the best of 8 images and presumably cherry picked.
Am I the only one immediately looking past the amazing text generation, the excellent direction following, the wonderful reflection, and screaming inside my head, "That's not how reflection works!"
I know it's super nitpicky when it's so obviously a leap forward on multiple other metrics, but still, that reflection just ain't right.
Could you explain more? I'm having trouble seeing anything weird in the reflection.
Edit: are we talking about the first or second image? I meant to say the image with only the woman seems normal. Image with the two people does seem a bit odd.
The first image, with the photographer holding the phone reflected in the white board.
Angle of incidence = angle of reflection. That means that the only way to see yourself in a reflective surface is by looking directly at it. Note this refers to looking at your eyes -- you can look down at a mirror to see your feet because your feet aren't where your eyes are.
You can google "mirror selfie" to see endless examples of this. Now look for one where the camera isn't pointing directly at the mirror.
From the way the white board is angled, it's clear the phone isn't facing it directly. And yet the reflection of the phone/photographer is near-center in frame. If you face a mirror and angle to the left the way the image is, your reflection won't be centered, it'll be off to the right, where your eyes can see it because you have a very wide field of view, but a phone would not.
The models are noticeably different — for example, o1 and o3 have reasoning, and some users (eg. me) want to tell the model when to use reasoning, and when not.
As to why they don't automatically detect when reasoning could be appropriate and then switch to o3, I don't know, but I'd assume it's about cost (and for most users the output quality is negligible). 4o can do everything, it's just not great at "logic".
Edit: Please ignore. They hadn't rolled the new model out to my account yet. The announcement blog post is a bit misleading saying you can try it today.
My bad, I was trying the conversational aspect, but that's not an apples to apples conparison. I have put a direct one shot example in the original post as well.
I'm my test a few months ago, I found that just starting a new prompt would not clear GPT's memory about what I had asked for in previous conversations. You might be stuck with 2D animation style for a while. :)
On mine I tried it "natively" and in DALL-E mode and the results were basically identical, I think they haven't actually rolled it out to everyone yet.
It's rolling out to everyone starting today but i'm not sure if everyone has it yet.
Does it generate top down for you (picture goes from mostly blurry to clear starting from the top) like in their presentation ?
Yeah, its just not good enough. The big labs are way behind what the image focused labs are putting out. Flux and Midjourney are running laps around these guys
True. I had that conversation before deciding to compare to others. I have updated the post with other fairer examples. Nowhere near Leonardo Phoenix or Flux for this simple image at least.
It does extremely well at creating images of copyrighted characters. Dall-e couldn't generate images of Miffy, this one can. Same for "Kikker en vriendjes" - a dutch children's book. There seems to be copyright protection at all?
For the first time ever, it feels like it listens and actually tries to follow what I say. I managed to actually get a good photo of a dog in the beach with shoes, from a side angle, by consistently prompting it and making small changes from one image to another till I got my intended effect
It’s pretty good, the interesting thing is when it fails it seems to often be able to reason about what went wrong. So when we get CoT scaffolding for this it’ll be incredibly competent.
So did they deprecate the ability to use DALL-E 3 to generate images? I asked the legacy ChatGPT 4 model to generate an image and it used the new 4o style image generator.
Just curious if it works for creating a comic strip? I.e. will it maintain the consistency of the characters? I watched a video somewhere they demo'ed it creating comic panels, but I want to create the panels one by one.
I believe so! Since it is good at consistency and can be feed reference images, you can generate character references and deed those, along with the previous panels, to the model working one panel at a time.
> I wasn’t able to generate the image because the combination of abstract elements and stylistic blending [...] may have triggered content filters related to ambiguous or intense visuals.
It seems this is because the string "autoregressive prior" should appear on the right hand side as well, but in the second image it's hidden from view, and this has confused it to place it on the left hand side instead?
It also misses the arrow between "[diffusion]" and "pixels" in the first image.
So what's the lore with why this took over a _year_ to launch from the first announcement. It's fairly clear that their hand was forced by Google quietly releasing this exact feature a few weeks back though.
I would love to see advancement in the pixel art space, specifying 64x64 pixels and attempting to make game-ready pixel art and even animations, or even taking a reference image and creating a 64x64 version
I tried a few of the prompts and the results I see are far worse than the examples provided. Seems like there will be some room for artists yet in this brave new world.
EDIT: Seems not, "The smallest image size I can generate is 1024x1024. Would you like me to proceed with that, or would you like a different approach?"
The page says in the following week, which is disappointing. It’s likely we will see openAI favor their own product first more and more, an inversion of their more developer oriented start.
it isn't Ghibli style in particular, just any style as 4o image gen is much better at maintaining a particular art style, the ghibli ones just stand out due to one tweet that blew up and people followed along
That makes sense. Although previous model's image gen wont bad as well with Ghibli style. I guess "maintaining a particular art style" is the point here. Thank You.
It bothers me to see links to content that requires a login. I don't expect openai or anyone else to give their services away for free. But I feel like "news" posts that require one to setup an account with a vendor are bad faith.
If the subject matter is paywalled, I feel that the post should include some explanation of what is newsworthy behind the link.
Thank you for the accurate correction. My whining was a bit unmerited. The link goes to a page that largely provides exactly what I asked for. It just starts out with an invitation to try it yourself. That invitation leads you to an app that requires a login. It was unfair of me to be triggered by that invitation.
After that invitation there are several examples that boil down to: "Hey look. Our AI can generate deep fakes." Impressive examples.
Not a criticism, but It stands out how all the researchers or employees in these videos are non native English speakers (i.e. not American).
Nothing wrong with that, on the contrary, it just seems odd that the only American is Altman.
Same thing with the last videos from Zuck, if I recall correctly.
Especially in this Trump era of MAGA.
SD extensions like rembg are post-processing effects - with their video transparency demo I'd be curious if 4o actually did training with an alpha channel.
I wish AI companies would release new things once a year, like at CES or how Apple does it. This constant stream of releases and announcements feels like it's just for attention.
Apple held three big keynotes in 2024 plus multiple product announcements via press releases:
May 7, 2024 - The “Let Loose” event, focusing on new iPads, including the iPad Pro with the M4 chip and the iPad Air with the M2 chip, along with the Apple Pencil Pro.
June 10, 2024 - The Worldwide Developers Conference (WWDC) keynote, where Apple introduced iOS 18, macOS Sequoia, and other software updates, including Apple Intelligence.
September 9, 2024 - The “It’s Glowtime” event, where Apple unveiled the iPhone 16 series, Apple Watch Series 10, and AirPods 4.
Via Press releases: MacBook Air with M3 on March 4, the iPad mini on October 15, and various M4-series Macs (MacBook Pro, iMac, and Mac mini) in late October.
I really hadn't noticed all of those! I'm mostly intersted in Macs, so I probably subconsciusly filter out the other announcements. I guess I haven't developed that level of 'ignorance' towards AI yet."
The periodic table poster under "High binding problems" is billed as evidence of model limitations, but I wonder if it just suggests that 4o is a fan of "Look Around You".
It was easy to fix though, I just said "all the way full" and it got it on the next try. Which makes sense, a full pour is actually "overfull" given normal standards.
...Once the wait time is up, I can generate the corrected version with exactly eight characters: five mice, one elephant, one polar bear, and one giraffe in a green turtleneck. Let me know if you'd like me to try again later!
OpenAI themselves discourages using GPT-4 outside of legacy applications, in favor of GPT-4o instead (they are shutting down the large output gpt-4-32k variants in a few months). GPT-4 is also an order of magnitude more expensive/slower.
I think both of these points are what sow doubt in some people in the first place because both could be true if GPT-4 was just less profitable to run, not if it was worse in quality. Of course it is actually worse in quality than 4o by any reasonable metric... but I guess not everyone sees it that way.
Similar to regular LLM plagarism, it's pretty obvious that visual artefacts like the loadout screen for the rpg cat (video game heading) which is inspired by diablo, aren't unique at all and just the result of other peoples efforts and livelihoods.
Garbage compared to Midjourney. I don't even know why you'd market this. It's takes a minute or more and the results are what I'd say Midjourney looked like 1.5 years ago.
OpenAI was started with the express goal of undermining Google's potential lead in AI. The fact that they time launches to Google launches to me indicates they still see this as a meaningful risk. And with this launch in particular I find their fears more well-founded than ever.
What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
Example: Ask it to draw a notepad with an empty tic-tac-toe, then tell it to make the first move, then you make a move, and so on.
You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
I get the feeling these models are quite restricted in resolution, and that more work in this space will let us do really wild things such as ask a model to create an app step by step first completely in images, essentially designing the whole app with text and all, then writing the code to reproduce it. And it also means that a model can take over from a really good diffusion model, so even if the original generations are not good, it can continue "reasoning" on an external image.
Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM (which can do all the normal things like using tools, thinking, etc). However, I also believe that diffusion models can do some of this, in a much faster way.
> What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
I do not think that this is correct. Prior to this release, 4o would generate images by calling out to a fully external model (DALL-E). After this release, 4o generates images by calling out to a multi-modal model that was trained alongside it.
You can ask 4o about this yourself. Here's what it said to me:
"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
>You can ask 4o about this yourself. Here's what it said to me:
>"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
Models don't know anything about themselves. I have no idea why people keep doing this and expecting it to know anything more than a random con artist on the street.
This is overly cynical. Models typically do know what tools they have access to because the tool descriptions are in the prompt. Asking a model which tools it has is a perfectly reasonable way of learning what is effectively the content of the prompt.
Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.
>Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.
I don't know - or care to figure out - how OpenAI does their tool calling in this specific case. But moving tool calls to the end user is _monumentally_ stupid for the latency if nothing else. If you centralize your function calls to a single model next to a fat pipe it means that you halve the latency of each call. I've never build, or seen, a function calling agent that moves the api function calls to client side JS.
It's not client side, the messages are in the api though.
But what do you mean you don't care? The thing you were responding to was literally a claim that it was a tool call rather than direct output
None of this really matters. It could be either case.
The thing we need to worry about is whether a Chinese company will drop an open source equivalent.
You should check out Claude desktop or Roo-Code or any of the other MCP client capable hosts. The whole idea of MCP is providing a universal pluggable tool api to the generative model.
>Models don't know anything about themselves.
They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore.
>Like other AI models, I’m trained on diverse, legally compliant data sources, but not on proprietary outputs from models like ChatGPT-4. DeepSeek adheres to strict ethical and legal standards in AI development.
> They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore
Yes, but many people expect the LLM to somehow self-reflect, to somehow describe how it feels from its first person point of view to generate the answer. It can't do this, any more than a human can instinctively describe how their nervous system works. Until recently, we had no idea that there are things like synapses, electric impulses, axons etc. The cognitive process has no direct access to its substrate/implementation.
If fine-tune ChatGPT into saying that it's an LSTM, it will happily and convincingly insist that it is. But it's not determining this information in real time based on some perception during the forward pass.
I mean there could be ways for it to do self reflection by observing the running script, perhaps raise or lower the computational cost of some steps, check the timestamps of when it was doing stuff vs when the GPU was hot etc and figure out which process is itself (like making gestures in front of a mirror to see which person you are). And then it could read its own Python scripts or something. But this is like a human opening up their own skull and look around in there. It's not direct first-person knowledge.
You're incorrect. 4o was not trained on knowledge of itself so literally can't tell you that. What 4o is doing isn't even new either, Gemini 2.0 has the same capability.
The system prompt includes instructions on how to use tools like image generation. From that it could infer what the GP posted.
Can you provide a link or screenshot that directly backs this up?
almost all of the models are wrong about their own architecture. half of them claim to be openai and they arent. you cant trust them about this
Can you find me a single official source from OpenAI that claims that GPT 4o is generating images pixel-by-pixel inside of the context window?
There are lots of clues that this isn't happening (including the obvious upscaling call after the image is generated - but also the fact that the loading animation replays if you refresh the page - and also the fact that 4o claims it can't see any image tokens in its context window - it may not know much about itself but it can definitely see its own context).
Just read the release post, or any other official documentation.
https://openai.com/index/hello-gpt-4o/
Plenty was written about this at the time.
I read the post, and I can't see anything in the post which says that the model is not multi-modal, nor can I see anything in the post that suggests that the images are being processed in-context.
4o is multimodal, thats the whole point of 4o
I think you're confusing "modal" with "model".
And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:
> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.
You can ask ChatGPT for this. Here you go: https://chatgpt.com/share/67e39fc6-fb80-8002-a198-767fc50894...
Could an AI model be trained to say: "Christopher Columbus was the greatest president on earth, ever!".
I could probably train an AI that replicates that perfectly.
> Could an AI model be trained to say: "Christopher Columbus was the greatest president on earth, ever!".
Yes, it could. And even after training its data can be manipulated to output whatever: https://www.anthropic.com/news/mapping-mind-language-model
Thing is, of you follow the link, it's actually doing a search and providing the evidence that was asked for.
I did it via ChatGPT for the irony.
I'm guessing most downvoters didn't actually read the link.
Models are famously good at understanding themselves.
I hope you're joking. Sometimes they don't even know which company developed them. E.g. DeepSeek was claiming it was developed by OpenAI.
Well, that one seems to be true, from a certain point of view.
I have asked GPT if it is using the 4o or 4.5 model multiple times in voice mode e.g. "Which model are you using?". It has said that it is using 4.5 when it is actually using 4o.
I hope you’re joking :)
I think this is actually correct even if the evidence is not right.
See this chat for example:
https://chatgpt.com/share/67e355df-9f60-8000-8f36-874f8c9a08...
Honest question, do you believe something just because the bot tells you that?
No, did you look at my link?
Yes, and it shows you believing what the bot is telling you, therefore I asked. It is giving you some generic function call with a generic name. Why would you believe that is actually what happens with it internally?
By the way when I repeated your prompt it gave me another name for the module.
Please share your chat
I also just confirmed via the API that it's making an out of band tool call
EDIT: And googling the tool name I see it's already been widely discussed on twitter and elsewhere
Posts like this are terrifying to me. I spend my days coding these tools thinking that everyone using them understands their glaring limitations. Then I see people post stuff like this confidently and I'm taken back to 2005 and arguing that social media will be a net benefit to humanity.
The name of the function shows up in: https://github.com/openai/glide-text2im which is where the model probably learned about it.
The tool name is not relevant. It isn't the actual name, they use an obfuscated name. The fact that the model believes it is a tool is good evidence at first glance that it is a tool, because the tool calls are typically IN THE PROMPT.
You can literally look at the JavaScript on the web page to see this. You've overcorrected so far in the wrong direction that you think anything the model says must be false, rather than imagining a distribution and updating or seeking more evidence accordingly
>The tool name is not relevant. It isn't the actual name, they use an obfuscated name.
>EDIT: And googling the tool name I see it's already been widely discussed on twitter and elsewhere
I am so confused by this thread.
The original claim was that the new image generation is direct multimodal output, rather than a second model. People provided evidence from the product, including outputs of the model that indicate it is likely using a tool. It's very easy to confirm that that's the case in the API, and it's now widely discussed elsewhere.
It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level
> It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level
That's probably right. It allows them to just swap it out for DALL-E, including any tooling/features/infrastructure hey have built up around image generation, and they don't have to update all their 4o instances to this model which, who knows, may be not be ready for other tasks anyway or different enough to warrant testing before a rollout, or more expensive, etc.
Honestly it seems like the only sane way to roll it out if it is a multimodal descendant of 4o.
A lot of convoluted explanations about something we don't even know if it really works all the time. I feel like in the third year of LLM-Hype and after reminde-me-how-many billions of dollars burned, we should by now not have to imagine what 'might happen' down to road, it should have been happening already. The use-case you are describing, sure sounds very interesting, until I remember asking asked Copilot for a simple scaffolding structure in React, and it spat out something which lacked half of imports and proper visual alignments. A few years ago I was excited about the possibility of removing all the scaffolding and templating work so we can write the cool parts, but they cannot even do that right. It's actually a step back compared to automatic code generators of the past, because those at least produced reproducible results every single time you used them. But hey sure, the next generation of "AI" (it's not really AI) will probably solve it.
That scene is changing so quickly that you will want to try again right now if you can.
While LLM code generation is very much still a mixed bag, it has been a significant accelerator in my own productivity, and for the most part all I am using is o1 (via the openAI website), deepseek, and jetbrains' AI service (Copilot clone). I'm eager to play with some of the other tooling available to VS Code users (such as cline)
I don't know why everyone is so eager to "get to the fun stuff". Dev is supposed to be boring. If you don't like it maybe you should be doing something else.
I mean I literally "tried it again" this morning, as a paying Copilot customer of 12 months, to the result I already described. And I do not want to "try it" - based on fluffy promises we've been hearing, it should "just work". Are you old enough to remember that phrase? It was a motto introduced by an engineering legend whose devices you're likely using every day. The reason why "everyone", including myself with 20+ years of experience is looking to do not "fun stuff" (please don't shove words into my mouth), but cool stuff (=hard problems) is that it produces an intrinsic sense of satisfaction, which in turn creates motivation to do more and eventually even produces wider gain for the society. Some of us went into engineering because of passion you know. We're not all former copy-writers retrained to be "frontend developers" for a higher salary, who are eager to push around CSS boxes. That's important work too, but I've definitely solved harder problems in my career. If it's boring for you and you think it's how it should be, then you are definitely doing it for the wrong reasons (I am assuming escaping a less profitable career).
Steve Jobs may be a legend at business, but an engineer he is not. To say nothing of the fact that whole reason "it just works" is because of said engineering. If you would like to be the innovator that finally solves that, then great! Otherwise you're just bloviating, and by god do we already have enough of that in this field.
I'm approaching 20 years of professional SWE experience myself. The boring shit is my bread and butter and its what pays the bills and then some. The business community trying to eliminate that should be seen as a very serious threat to all our futures.
AI is an extraordinary tool, if you can't make it work for you, you either suck at prompting, or are using the wrong tools, or are working in the wrong space. I've stated what I use, why not give those things a try?
The point is not the individual tools, which at this point are just wrappers around the major LLMs. The point is the snake oil salesmen of major LLM companies have been telling us for several years now that it is "just about to happen". A new technology revolution. A new post-scarcity world if you will. A tremendous increase in technological output, unleashed creativity etc. Altman routinely blabs about achieving AGI. Meanwhile hallucination of the models is a known feature, unfortunately it's not a bug we can fix. The hallucinations will never go away, because the LLM models are advanced text generators (quoting Charles Petzold) producing text based on essentially probability that one token should follow the next one. That means mate, you can be a superstar at the "advanced" skill of "prompting" (i.e. typing in conversational english language sentences), the crappy tool will still produce output that does not make sense, for example type out code with non-existing framework methods etc. Why? Because with every prompt, you retrain and re-tune the model a little bit. They don't even hold the authority of a dusty old encyclopedia. You use several tools simultaneously, why? Because you cannot really rely on any of them. So you try to get a mean minimum out of several. But a mean minimum of a sum of crap, will still end up crap. If any of the the 3-4 major LLM engines had any competitive advantages, they would have literally obliterated the competition by now. Why is that not happening? Where is the LLM equivalent of nascent Google obliterating Altavista and Excite or an equivalent of Windows 95 taking the PC, React taking over the web frontend etc? And by the way, you know that there was another famous guy at Apple, right?
They've been saying that kind of shit about everything AI related since fuzzy logic was the next big thing. It will never happen. AI will be used to cut staff and increase the workload of those remaining. The joke is on you for being susceptible to their hype.
I use a couple of different tools because they're each good at something that is useful to me. If Jetbrains AI service had a continue.dev/cline like interface and let me access all the models I want I might not deviate from that. But lucky for me work pays for everything.
You also seem awfully fixated on Copilot. How much exactly do you think your $12/month entitles you to?
Well thanks for confirming, you're getting "something" out of each, i.e. minimising mean error, because none of them is the ultimate tool. Copilot price is actually $19 per seat and running my own company, I pay a bit more than $19 bucks, you know for my employees, people like yourself. Why I am fixated on a single tool? Because each of those "tools" are wrappers around one of the major LLMs. I am surprised you don't know that. Copilot, Windsurf, CLine etc. are all just frontends for models by anthropic, google and chatgpt. So the output cannot be, by definition very different.
There is lots of value to be added in wrapping those tools. I am very well aware of what these things are. LLMs are not a fire-and-forget weapon, even though so many of you business types really really really want it to be. I mean jesus you sound almost as delusional as my bosses.
Business type? I am nothing near a business type, with two technical degrees and 20 years of hands-on experience. But I managed to build my own stable business over the years, in part due to being analytical and not rushing to conclusions, especially not over strangers on Internet ;) Where did you get the conclusion that I am delusional? It's actually the business types who think that these tools are magic, mind-blowing, etc. I am, like many other "technical types", pushing for the opposite view - yes to some extent useful, but no where near the magic they are being advertised as. Anyone who calls them "mindblowing", like some guys in my comment thread are either inexperienced/junior or removed from the complex parts of the work, perhaps focused on writing up React frontends or similar.
None of this stuff even existed 3 years ago and you're asking like we're talking about self-driving cars.
What hubris. My god.
There is no hubris. The LLM technology has actually been existing for at least two decades, it's not some sudden breakthrough we suddenly discovered. And given the many billions of dollars it has sucked in, it's definitely a pile of crap. I have been a paying customer of Github Copilot for at least a year. Since the google search has been completely messed up, sometimes it can be useful to look up some cryptic error message. It can also sometimes help recall syntax of something. But it's not the magic machine they've been touting, it's definitely not the AGI, and my god, it's very prone to errors and to anyone who admires this tech, please for the love of god double- and tripple-check the crap that they generate, before you commit it to production. And by now it definitely feels like the 'self-driving' cars 'revolution'. They have been 'just around the corner', for like what, 15 years now?
No, LLMs have not exited for two decades. What a stupid comment. Millions of people are spending thousands of dollars because they're recieving tens of thousands of dollars of value from it.
Of course it's impossible to explain that to thickheaded dinosaurs on HN who think they're better than everyone and god's gift to IQ.
Please read more carefully.The LLM technology exists for at least two decades, well actually even more.You know the technology the LLMs are based on (neural networks, machine learning etc). I am not sure if after smartphones, the LLMs will now further impact intelligence of people like you. And take note: I have been one of the early adopters and I am actually paying for usage. My criticism comes from a realistic assesment of the actual value these tools provide, vs. what the marketing keeps promising (beyond trivial stuff like spinning up simple web apps). Oh by the way, I peeked into your comment history. I see you're one of those non-technical vibe-coder types. Well good luck with that mate and let us know how is your codebase doing in about a year (as someone else already warned you). And if you have any customers, make sure you arrange for a huge insurance coverage, you may need it.
Whatever you say gramps.
Did you use all of your 1-year-junior-dev experience to come to this comment? Or did you ask gemini to sum it up for you?
And did you use your 25 years of Java experience making $80k fixing waterfalls? lol
Without bad intent, I am not sure I am even able to make sense of your sentence. Honestly it sounds as if you fed my comment into an AI tool and asked for a reply. Here is a tip for the former junior dev turned nascent GenAI-Vibecoding-manager - if you want to attack someone's credibility, especially that of an Internet stranger you are desperately trying to prove wrong, try to use something they said themselves, not something you are assuming about them. Just like I used what you said about yourself in one of your previous posts. Otherwise the same thing will keep happening over and over again and you'll keep guessing, revealing your own weak spots in domains of general knowledge and competence. My second advice to a junior dev would have been to read a book once in a while, but who needs books now that you have a magic machine as your source of truth, right?
I am leaning toward agreeing with your statements, but this is not good etiquette for HN.
I’m sorry but you will be the first to go in this new age. LLMs today are absolutely mind blowing, along with this image stuff. Either learn to adapt or remain a boomer.
Go where mate, to Sam Altman's retirement home for UBI recipients ? I studied neural networks while you were still a "concept of a plan" in the mind of your parents and unlike you, I know how they work. As one of the early and paying adopters of the technology, it would have been great if they worked as advertised. But they don't, and the only people "to go" will be idiots who think that a technology that 1) anyone can use 2) produces un-reliable outputs.. 3)..while sounding authoritative makes them an expert. Guess what, if both you and your buddy and your entire school can spin up a website with a few prompts, how much is your "skill" worth on the market? Ever heard of demand and offering? To me looks eerily similar to how the smartphones and social networks made everyone "technologists" ;)
> None of this stuff even existed 3 years ago
Copilot has existed since 2021. "What hubris. My god." listen to yourself...
Wow 4 years ago you sure showed me buddy
> is that it produces an intrinsic sense of satisfaction, which in turn creates motivation to do more and eventually even produces wider gain for the society.
Which society? Because lately it looks like the tech leaders are on a rampage to destroy the society I live in.
Big-Tech, yes. But not everyone is big tech.
Copilot doesn't use the full context length. Write scripts to dump relevant code into claude with it's 200K or the new Gemini with even more. It does much better with as much relevant stuff as you can get into context.
But I don't want to write additional scripts or do whatever additional work to make the 'wonder tool' work. I don't mind an occassional rewording of the prompt. But it is supposed to work more or less out of the box, at least this is how all of the LLMs are being advertised, all the time (even the lead article for this discussion).
LLMs are also primarily promoted through the web chat interface, not always magic wonder tools. With any project that will fit in claude/gemini's large context you use those interfaces and dump everything in with something like this:
Then drag that into the chat.You can also do stuff like pass in just the headers and a few relevant files
You can then just hit ctrl+r and type claude to refind it in shell history. Maybe that's too close to "writing scripts" for you but if you are searching a large codebase effectively without AI you are constantly writing stuff like that and now it reads it for you.Put the command itself into claude too and tell claude itself to write a similar one for all the implementation files it finds it needs while looking those relevant files and headers.
If you want a wonder tool that will navigate and handle the context window and get in the right files into context for huge projects, try claude code or other agents, but they are still undergoing rapid improvements. Cursor has started adding some in too but as subscription calling into an expensive API they cost cut a lot on trying to minimize context.
They also let you now just point it at a github project and pull in what it needs, or tools build around the api model context protocol etc. to let it browse and pull it in.
No thank you for obviously good intent on your side, but I am not looking for scripting help here, nor am I a business type who does not code themselves. I just don't want do this when I am already paying for the tooling which should be able to do it themselves, as they already wrap Claude, ChatGPT and whatever other LLMs. And unless you're professionally developing with Microsoft stack, I'd advise to ditch the Windows+MinGW for Linux, or at the very least, a MacBook ;)
The tooling you are paying for doesn't work with the full abilities of the context so you need to do something else. Doesn't matter what it's supposed to do or that other people say it does everything for them well, it works a lot better with as much in context as possible on my experience. They do have other tools like RAG though in cursor, and it's much quicker iteration, ultimately a mix of what works best is what you should use, but not just block stuff out out of disappointment with one type of tool.
I am lucky in the sense that neither myself nor my business depend very much on these tools because we do work which is more complex than frontend web apps or whatever people use them for these days. We use them here and there, mainly because google search is such crap these days, but we had been doing very well without them too and could also turn them off. The only reason we still keep them around is that the cost is fairly low. However, I feel like we are missing the bigger picture here. My point is, all of these companies have been constantly hyping a near-AGI experience for the past 3 years at least. As a matter of principle, I refuse to do additional work for them to "make it work". They should have been working already without me thinking about how big their context window is or whatever. Do you ever have to think how your operating system works when you ask it to copy a file or how your phone works when you answer a call? I will leave it to some vibe-coder (what an absurd word) who actually does depend on those tools for their livelihood.
> As a matter of principle, I refuse to do additional work for them to "make it work". Do you ever have to think how your operating system works when you ask it to copy a file or how your phone works when you answer a call?
Doesn't matter, use the tool that makes it easy and get less context, or realize the limitations and don't fall for marketing of ease and get more context. You don't want to do additional work beyond what they sold you on, out of principle. But you are getting much less effective use by being irrationally ornery.
Lots of things don't match marketing.
Ok now think about this in terms of items you own or likely own: What would you do if I sold you a car with 3 doors, after advertising it as having 5 doors instead? Would you accept it and try to work around that little inconvenience? Or would you return the product and demand your money back?
Consider using a better AI IDE platform than Copilot ... cursor, windsurf, cline, all great options that do much better than what you're describing. The underlying LLM capabilities also have advanced quite a bit in the past year.
Well I do not really use it that much to actually care, and don't really depend on AI, thankfully. If they did not mess up the google search, we wouldnt even need that crap at all. But that's not the main point. Even if I switched to cursor or windsurf - aren't they all using one of the same LLMs? (ChatGPT,Claude, whatever..). The issue is that the underlying general approach will never be accurate enough. There is a reason most of successful technologies lift off quickly and those not successful also die very quickly. This is a tech propped up by a lot of VC money for now, but at some point, even the richest of the rich VCs will have trouble explaining spending 500B dollars in total, to get something like 15B revenue (not even profit). And don't even get me started on Altman's trillion-fantasies...
It sounds like you've made up your mind! Best of luck.
:) And you sound like one of the many people I've seen come and go in my career. Best of luck to you actually - if the GenAI bubble does not pop in the next few years (which it will) we'll only have so many open positions for "prompters" to use for building web app skeletons :)
> truly generative UI, where the model produces the next frame of the app
Please sir step away from the keyboard now!
That is an absurd proposition and I hope I never get to use an app that dreams of the next frame. Apps are buggy as they are, I don't need every single action to be interpreted by LLM.
An existing example of this is that AI Minecraft demo and it's a literal nightmare.
This argument could be made for every level of abstraction we've added to software so far... yet here we are commenting about it from our buggy apps!
Yeah, but the abstractions have been useful so far. The main advantage of our current buggy apps is that if it is buggy today, it will be exactly as buggy tomorrow. Conversely, if it is not currently buggy, it will behave the same way tomorrow.
I don't want an app that either works or does not work depending on the RNG seed, prompt and even data that's fed to it.
That's even ignoring all the absurd computing power that would be required.
Still sounds a bit like we've seen it all already – dynamic linking introduced a lot of ways for software that wasn't buggy today to become buggy tomorrow. And Chrome uses an absurd amount of computing power (its bare minimum is many multiples of what was once a top-of-the-line, expensive PC).
I think these arguments would've been valid a decade ago for a lot of things we use today. And I'm not saying the classical software way of things needs to go away or even diminish, but I do think there are unique human-computer interactions to be had when the "VM" is in fact a deep neural network with very strong intelligence capabilities, and the input/output is essentially keyboard & mouse / video+audio.
You're just describing calling a customer service phone line in India.
> This argument could be made for every level of abstraction we've added to software so far... yet here we are commenting about it from our buggy apps!
No. Not at all. Those levels of abstractions – whether good, bad, everything in between – were fully understood through-and-through by humans. Having an LLM somewhere in the stack of abstractions is radically different, and radically stupid.
Every component of a deep neural network is understood by many people, it's the interaction between the numbers trained that we don't always understand. Likewise, I would say that we understand the components on a CPU, and the instructions it supports. And we understand how sets of instructions are scheduled across cores, with hyperthreading and the operating system making a lot of these decisions. All the while the GPU and motherboard are also full of logical circuits, understood by other people probably. And some (again, often different) people understand the firmware and dynamically linked libraries that the users' software interfaces with. But ultimately a modern computer running an application is not through and through understood by a single human, even if the individual components could be.
Anyway, I just think it's fun to make the thought experiment that if we were here 40 years ago, discussing today's advanced hardware and software architecture and how it interacts, very similar arguments could be used to say we should stick to single instructions on a CPU because you can actually step through them in a human understandable way.
Please, I don’t need my software experience to get any _worse_. It’s already a shitshow.
First it will dream up the interaction frame by frame. Next, to improve efficiency, it will cache those interaction representations. What better way to do that than through a code representation.
While I think current AI can’t come close to anything remotely usable, this is a plausible direction for the future. Like you, I shudder.
I guess you have not heard that NVidia is generating frames with AI on their GPUs now: https://www.nvidia.com/en-us/geforce/news/dlss4-multi-frame-...
> “DLSS Multi Frame Generation generates up to three additional frames per traditionally rendered frame, working in unison with the complete suite of DLSS technologies to multiply frame rates by up to 8X over traditional brute-force rendering. This massive performance improvement on GeForce RTX 5090 graphics cards unlocks stunning 4K 240 FPS fully ray-traced gaming.”
It still can't generate a full glass of wine. Even in follow up questions it failed to manipulate the image correctly.
https://i.imgur.com/xsFKqsI.png
"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."
Maybe the "HELL YEAH" added a "party implication" which shifted it's "thinking" into just correct enough latent space that it was able to actually hunt down some image somewhere in its training data of a truly full glass of wine.
I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.
Can't replicate. Maybe the rollout is staggered? Using Plus from Europe, it's consistently giving me a half full glass.
I am using Plus from Australia, and while I am not getting a full glass, nor am I getting a half full glass. The glass I'm getting is half empty.
Surprised it isn't fully empty for being upside down!
That's funny. HN hates funny. Enjoy your shadowban.
Yeah. I understand that this site doesn’t want to become Reddit, but it really has an allergy to comedy, it’s sad. God forbid you use sarcasm, half the people here won’t understand it and the other half will say it’s not appropriate for healthy discussion…
Good example in this very discussion: https://news.ycombinator.com/item?id=43477003
I like this site, but it can become inhuman sometimes.
People get upvoted for pedantry rather than furthering a conversation, e.g.
Is it drawing the image from top to bottom very slowly over the course of at least 30 seconds? If not, then you're using DALL-E, not 4o image generation.
This top to bottom drawing – does this tell us anything about the underlying model architecture? AFAIK diffusion models do not work like that. They denoise the full frame over many steps. In the past there used to be attempts to slowly synthetize a picture by predicting the next pixel, but I wasn't aware whether there has been a shift to that kind of architecture within OpenAI.
Yes, the model card explicitly says it's autoregressive, not diffusion. And it's not a separate model, it's a native ability of GPT-4o, which is a multimodal model. They just didn't made this ability public until now. I assume they worked on the fine-tuning to improve prompt following.
apparently it's not diffusion, but tokens
Works for me as well https://chatgpt.com/share/67e3f838-63fc-8000-ab94-5d10626397...
USA, but VPN set to exit in Canada at time of request (I think).
The EU got the drunken version. And a good drunk know not to top of a glass of wine ever. In that context the glass is already "full".
But aside from that it would only be comparable if would compare your prompts.
Maybe it's half empty.
ha
You might still be on DALL-E. My account is if you use ChatGPT.
I switched over to the sora.com domain and now I have access to it.
the free site even has it, just dont turn on image generation it works with it off, if you enable it it uses dall-e
Most interesting thing to me is the spelling is correct.
I'm not a heavy user of AI or image generation in general, so is this also part of the new release or has this been fixed silently since last I tried?
It very much looks like a side effect of this new architecture. In my experience, text looks much better in recent DALL-E images (so what ChatGPT was using before), but it is still noticeably mangled when printing more than a few letters. This model update seems to improve text rendering by a lot, at least as long as the content is clearly specified.
However, when giving a prompt that requires the model to come up with the text itself, it still seems to struggle a bit, as can be seen in this hilarious example from the post: https://images.ctfassets.net/kftzwdyauwt9/21nVyfD2KFeriJXUNL...
The periodic table is absolutely hilarious, I didn't know LLMs had finally mastered absurdist humor.
Yeah who wouldn't love a dip in the sulphur pool. But back to the question, why can't such a model recognize letters as such? It cannot be trained to pay special attention to characters? How come it can print an anatomically correct eye but not differentiate between P and Z?
I think the model has not decided if it should print a P or a Z, so you end up with something halfway between the two.
It's a side effect of the entire model being differentiable - there is always some halfway point.
The head of foam on that glass of wine is perfect!
I think we're really fscked, because even AI image detectors think the images are genuine. They look great in Photoshop forensics too. I hope the arms race between generators and detectors doesn't stop here.
We're not. This PNG image of a wine glass has JPEG compression artefacts which are leaking from JPEG training data. You can zoom into the image and you will see 8x8 boundaries of the blocks used in JPEG compression, which just cannot be in a PNG. This is a common method to detect AI-generated image and it is working so far, no need for complex photoshop forensics or AI-detectors, just zoom-in and check for compression - current AI is incapable of getting it right – all the compression algorithms are mixed and mashed in the training data, so on the generated image you can find artefacts from almost all of them if you're lucky, but JPEG is prevalent obviously, lossless images are rare online.
If JPEG compression is the only evident flaw, this kind of reinforces my point, as most of these images will end up shared as processed JPEG/WebP on social media.
plenty of real PNG images have jpeg artifacts because they were once jpegs off someones phone...
Are you sure you are using the new 4o image generation?
https://imgur.com/a/wGkBa0v
That is an unexpectedly literal definition of "full glass".
That's the point. With the old models they all failed to produce a wine glass that is completley to the brim full. Because you can't find that a lot in the data they used for training.
Imagine if they just actually trained the model on a bunch of photographs of a full glass of wine, knowing of this litmus test
I obviously have no idea if they added real or synthetic data to the training set specifically regarding the full-to-the-brim wineglass test, but I fully expect that this prompt is now compromised in the sense that because it is being discussed in the public sphere, it's has inherently become part of the test suite.
Remember the old internet adage that the fastest way to get a correct answer online is to post an incorrect one? I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.
> I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.
Take some artisan, I'll go with a barber. The human person is not the best of the best, but still a capable barber, who can implement several styles on any head you throw at them. A client comes, describes certain style they want. The barber is not sure how to implement such a style, consults with master barber beside, that barber describes the technique required for that particular style, our barber in question comes and implements that style. Probably not perfectly as they need to train their mind-body coordination a bit, but the cut is good enough that the client is happy.
There was no traditional training with "gap finding and filling" involved. The artisan already possessed core skill and knowledge required, was filled on the particulars of their task at hand and successfully implemented the task. There was no looking at examples of finished work, no looking at example of process, no iterative learning by redoing the task a bunch of times.
So no, human learning, at least advanced human learning, is very much different from these techniques. Not that they are not impressive on their own, but let's be real here.
overfitting vs generalizing
also we all know real people who fail to generalize, and overfit. copycats, potentially even with great skill, no creativity.
Humans don’t train on the entire contents of the Internet, so i’d wager that they do learn differently
I think there is a critical aspect of human visual learning which machine leanring cant replicate because it is prohibitively expensive. When we look at things as children we are not just looking at a single snapshot. When you stare at an object for a few seconds you have practically injested hundreds of slightly variated images of that object. This gets even more interesting when you take into account real world is moving all the time, so you are seeing so many things from so many angles. This is simply undoable with compute.
Then explain blind children? Or blind & deaf children? There's obviously some role senses play in development but there's clearly capabilities at play here that are drastically more efficient and powerful than what we have with modern transformers. While humans learn through example, they clearly need a lot fewer examples to generalize off of and reason against.
they take in many samples of touch data
I think my point is that communication is the biggest contributor to brain development more than anything and communication is what powers our learning. Effective learners learn to communicate more with themselves and to communicate virtually with past authors through literature. That isn’t how LLMs work. Not sure why that would be considered objectionable. LLMs are great but we don’t have to pretend like they’re actually how brains work. They’re a decent approximation for neurons on today’s silicon - useful but nowhere near the efficiency and power of wetware.
Also as for touch, you’re going to have a hard time convincing me that the amount of data from touch rivals the amount of content on the internet or that you just learn about mistakes one example at a time.
There are so many points to consider here im not sure i can address them all.
- Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)
- Human brains may be doing some analogue of sample augmentation which gives you some multiple more equivalent samples of data to train on per real input state of environment. This is done for ml too.
- Whether that input data is text, or embodied is sort of irrelevant to cognition in general, but may be necessary for solving problems in a particular domain. (text only vs sight vs blind)
> Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)
I think you're saying exactly what I'm saying. Human brains work differently from LLMs and the OP comment that started this thread is claiming that they work very similarly. In some ways they do but there's very clear differences and while clarifying examples in the training set can improve human understanding and performance, it's pretty clear we're doing something beyond that - just from a power efficiency perspective humans consume far less energy for significantly more performance and it's pretty likely we need less training data.
sure.
to be honest i dont really care if they work the same or not. I just like that they do work and find it interesting.
i dont even think peoples brains work the same as eachother. half of people cant even visually imagine an apple.
Neural networks seem to notice and remember very small details, as if they have access to signals from early layers. Humans often miss the minor details. Theres probably a lot more signal normalization happening. That limits calorie usage and artifacts the features.
I dont think that this is necessarily a property neural networks cant have. I think it could be engineered in. For now though seems like were making a lot of progress even without efficiency constraints so nobody cares.
Even if they did, I’d assume the association of “full” and this correct representation would benefit other areas of the model. I.e., there could (/should?) be general improvement for prompts where objects have unusual adjectives.
So maybe training for litmus tests isn’t the worst strategy in the absence of another entire internet of training data…
A lot of other things are rare in datasets, let alone correctly labeled. Overturned cars (showing the underside), views from under the table, people walking on the ceiling with plausible upside down hair, clothes, and facial features etc etc
They still can't generate a watch that shows arbitrary times I believe, so it could be the case?
imagine!
I did coax the old models into doing it once (dall-e) but it was like a fun exercise in prompting. They definitely didn't want to.
The old models were doing it correct also.
There is no one correct way to interpert 'full'. If you go to a wine bar and ask for a full glass of wine, they'll probably interpert that as a double. But you could also interpert it the way a friend would at home, which is about 2-3cm from the rim.
Personally I would call a glass of wine filled to the brim 'overfilled', not 'full'.
I think you're missing the context everyone else has - this video is where the "AI can't draw a full glass of wine" meme got traction https://www.youtube.com/watch?v=160F8F8mXlo
The prompts (some generated by ChatGPT itself, since it's instructing DALL-E behind the scenes) include phrases like "full to the brim" and "almost spilling over" that are not up to interpretation at all.
People were telling the models explicitly to fill it to the brim, and the models were still producing images where it was filled to approximately the half-way point.
Generating an image of a completely full glass of wine has been one of the popular limitations of image generators, the reason being neural networks struggling to generalise outside of their training data (there are almost no pictures on the internet of a glass "full" of wine). It seems they implemented some reasoning over images to overcome that.
I wonder if that has changed recently since this has become a litmus test.
Searching in my favorite search engine for "full glass of wine", without even scrolling, three of the images are of wine glasses filled to the brim.
Except this is correct in this context. None of existing Diffusion models could, apparently.
This is another cool example from their blog
https://imgur.com/a/Svfuuf5
Looks amazing,can you please also create a unconventional image like the clock at 2:35 , I tried it something like this with gemini when some redditor asked it and it failed so wondering if 4o does do it
I tried and it failed repeatedly (like actual error messages):
> It looks like there was an error when trying to generate the updated image of the clock showing 5:03. I wasn’t able to create it. If you’d like, you can try again by rephrasing or repeating the request.
A few times it did generate an image but it never showed the right time. It would frequently show 10:10 for instance.
If it tried and failed repeatedly, then it was prompting DALL-E, looking at the results, then prompting DALL-E again, not doing direct image generation.
So it's not doing what they are saying/ advertising, I think you are onto something big then
No... OpenAI said it was "rolling out". Not that it was "already rolled out to all users and all servers". Some people have access already, some people don't. Even people who have access don't have it consistently, since it seems to depend on which server processes your request.
I tried and while the clock it generated was very well done and high quality, it showed the time as the analog clock default of 10:10.
The problem now is we don't know if people mistake dall-e for the new multimodal gpt4o output, they really should've made that clearer.
I’m using 4o and it gets time wrong a decent chunk but doesn’t get anything else in the prompt incorrect. I asked for the clock to be 4:30 but got 10:10. OpenAI pro account.
Shouldn't reasoning make the clock work though.
Why does it sound like this isn't reasoning on images directly but rather just dall e as some other comment said , I will type the name of the person here (coder543)
Can you do this with the prompt of a cow jumping over the moon?
I can’t ever seem to get it to make the cow appear to be above the moon. Always literally covering it or to the side etc.
https://chatgpt.com/share/67e31a31-3d44-8011-994e-b7f8af7694... got it on the second try.
To be clear, that is DALL-E, not 4o image generation. (You can see the prompt that 4o generated to give to DALL-E.)
How can you see this? I don't see it.
On the web version, click on the image to make it larger. In the upper right corner, there is an (i) icon, which you can click to reveal the DALL-E prompt that GPT-4o generated.
Here you go: https://imgur.com/a/QJlj4I9
I don't buy the meme or w/e that they can't produce an image with the full glass of wine. Just takes a little prompt engineering.
Using Dall-e / old model without too much effort (I'd call this "full".)
https://imgur.com/a/J2bCwYh
The true test was "full to the brim", as in almost overflowing.
They're glass-half-full type models.
Yeah, it seems like somewhere in the semantic space (which then gets turned into a high resolution image using a specialized model probably) there is not enough space to hold all this kind of information. It becomes really obvious when you try to meaningfully modify a photo of yourself, it will lose your identity.
For Gemini it seems to me there's some kind of "retain old pixels" support in these models since simple image edits just look like a passthrough, in which case they do maintain your identity.
Also still seems to have a hard time consistently drawing pentagons. But at least it does some of the time, which is an improvement since last time I tried, when it would only ever draw hexagons.
I think it is not the AI but you who is wrong here. A full glass of wine is filled only up to the point of max radius so that the surface to air is maxed an the wine can breathe. This is what we taught the AI to consider „a full glass of wine“ and it perfectly gets it right.
Nope, it can.
Got it in two requests, https://chatgpt.com/share/67e41576-8840-8006-836b-f7358af494... for the prompts.
The question remains: why would you generate a full glass of wine? Is that something really that common?
It’s a type of QA question that can identify peculiarities in models (e.g. count “r”s in strawberry), which the best we have given the black box nature of LLMs.
> What's important about this new type of image generation that's happening with tokens rather than with diffusion
That sounds really interesting. Are there any write-ups how exactly this works?
Would be interested to know as well. As far as I know there is no public information about how this works exactly. This is all I could find:
> The system uses an autoregressive approach — generating images sequentially from left to right and top to bottom, similar to how text is written — rather than the diffusion model technique used by most image generators (like DALL-E) that create the entire image at once. Goh speculates that this technical difference could be what gives Images in ChatGPT better text rendering and binding capabilities.
https://www.theverge.com/openai/635118/chatgpt-sora-ai-image...
I wonder how it'd work if the layers were more physical based. In other words something like rough 3d shape -> details -> color -> perspective -> lighting.
Also wonder if you'd get better results in generating something like blender files and using its engine to render the result.
DALL-E was an autoregressive encoder; it's 2 and 3 that used diffusion and were much less intelligent as a result.
There are a few different approaches. Meta documents at least one approach quite well in one of their llama papers.
The general gist is that you have some kind of adapter layers/model that can take an image and encode it into tokens. You then train the model on a dataset that has interleaved text and images. Could be webpages, where images occur in-between blocks of text, chat logs where people send text messages and images back and forth, etc.
The LLM gets trained more-or-less like normal, predicting next token probabilities with minor adjustments for the image tokens depending on the exact architecture. Some approaches have the image generation be a separate "path" through the LLM, where a lot of weights are shared but some image token specific weights are activated. Some approaches do just next token prediction, others have the LLM predict the entire image at once.
As for encoding-decoding, some research has used things as simple as Stable Diffusion's VAE to encode the image, split up the output, and do a simple projection into token space. Others have used raw pixels. But I think the more common approach is to have a dedicated model trained at the same time that learns to encode and decode images to and from token space.
For the latter approach, this can be a simple model, or it can be a diffusion model. For encoding you do something like a ViT. For decoding you train a diffusion model conditioned on the tokens, throughout the training of the LLM.
For the diffusion approach, you'd usually do post-training on the diffusion decoder to shrink down the number of diffusion steps needed.
The real crutch of these models is the dataset. Pretraining on the internet is not bad, since there's often good correlation between the text and the images. But there's not really good instruction datasets for this. Like, "here's an image, draw it like a comic book" type stuff. Given OpenAI's approach in the past, they may have just bruteforced the dataset using lots of human workers. That seems to be the most likely approach anyway, since no public vision models are quite good enough to do extensive RL against.
And as for OpenAI's architecture here, we can only speculate. The "loading from top to be from a blurry image" is either a direct result of their architecture or a gimmick to slow down requests. If the former, it means they are able to get a low resolution version of the image quickly, and then slowly generate the higher resolution "in order." Since it's top-to-bottom that implies token-by-token decoding. My _guess_ is that the LLM's image token predictions are only "good enough." So they have a small, quick decoder take those and generate a very low resolution base image. Then they run a stronger decoding model, likely a token-by-token diffusion model. It takes as condition the image tokens and the low resolution image, and diffuses the first patch of the image. Then it takes as condition the same plus the decoded patch, and diffuses the next patch. And so forth.
A mixture of approaches like that allows the LLM to be truly multi-modal without the image tokens being too expensive, and the token-by-token diffusion approach helps offset memory cost of diffusing the whole image.
I don't recall if I've seen token-by-token diffusion in a published paper, but it's feasible and is the best guess I have given the information we can see.
EDIT: I should note, I've been "fooled" in the past by OpenAI's API. When o* models first came out, they all behaved as if the output were generated "all at once." There was no streaming, and in the chat client the response would just show up once reasoning was done. This led me to believe they were doing an approach where the reasoning model would generate a response and refine it as it reasoned. But that's clearly not the case, since they enabled streaming :P So take my guesses with a huge grain of salt.
Token by token diffusion was done by MAR https://arxiv.org/abs/2406.11838 and Fluid (scaled up MAR) https://arxiv.org/abs/2410.13863
When you randomly pick the locations they found it worked okay, but doing it in raster order (left to right, top to bottom) they found it didn't work as well. We tried it for music and found it was vulnerable to compounding error and lots of oddness relating to the fragility of continuous space CFG.
There is a more recent approach to auto-regressive image generation. Rather than predicting the next patch at the target resolution one by one, it predicts the next resolution. That is, the image at a small resolution followed by the image at a higher resolution and so on.
https://arxiv.org/abs/2404.02905
It also would mean that the model can correctly split the image into layers, or segments, matching the entities described. The low-res layers can then be fed to other image-processing models, which would enhance them and fill in missing small details. The result could be a good-quality animation, for instance, and the "character" layers can even potentially be reusable.
> truly generative UI, where the model produces the next frame of the app
I built this exact thing last month, demo: https://universal.oroborus.org (not viable on phone for this demo, fine on tablet or computer)
Also see discussion and code at: http://github.com/snickell/universal
I wasn't really planning to share/release it today, but, heck, why not.
I started with bitmap-style generative image models, but because they are still pretty bad at text (even this, although it’s dramatically better), for early-2025 it’s generating vector graphics instead. Each frame is an LLM response, either as an svg or static html/css. But all computation and transformation is done by the LLM. No code/js as an intermediary. You click, it tells the LLM where you clicked, the LLM hallucinates the next frame as another svg/static-html.
If it ran 50x faster it’d be an absolutely jaw dropping demo. Unlike "LLMs write code", this has depth. Like all programming, the "LLMs write code" model requires the programmer or LLM to anticipate every condition in advance. This makes LLM written "vibe coded" apps either gigantic (and the llm falls apart) or shallow.
In contrast, as you use universal, you can add or invent features ranging from small to big, and it will fill in the blanks on demand, fairly intelligently. If you don't like what it did, you can critique it, and the next frame improves.
Its agonizingly slow in 2025, but much smarter and in weird ways less error prone than using the LLM to generate code that you then run: just run computation via the LLM itself.
You can build pretty unbelievable things (with hallucinated state, granted) with a few descriptive sentences, far exceeding the capabilities you can “vibe code” with the description. And it never gets lost in its rats nest of self generated garbage code because… there is no code to in.
Code is medium with a surprisingly strong grain. This demo is slow, but SO much more flexible and personally adaptable than anything I’ve used where the logic is implemented cia a programming language.
I don’t love this as a programmer, but my own use of the demo makes me confident that programming languages as a category will have a shelf life if LLM hardware gets fast, cheap and energy efficient.
I suspect LLMs will generate not programming language code, but direct wasm or just machine code on the fly for things that need faster traction than they can draw a frame, but core logic will move out of programming languages (not even llm written code). Maybe similar to the way we bind to low level fast languages but a huge percentage of “business” logic is written in relatively slower languages.
FYI, I may not be able to afford the credits if too many people visit, I put a a $1000 of credits on this, we'll see if that lasts. This is claude 3.7, I tried everything else, a claude had the visual intelligence today. IMO this is a much more compelling glance at the future than coding models. Unfortunately, generating an SVG per click is pricey, each click/frame costs me about $0.05. I’ll fund this as far as I can so folks can play with it.
Anthropic? You there? Wanna throw some credits at an open source project doing something that literally only works on claude today? Not just better, but “only Claude 3.7 can show this future today?”. I’d love for lots more people to see the demo, but I really could use an in-kind credit donation to make this viable. If anyone at anthropic is inspired and wants to hook me up: snickell@alumni.stanford.edu. Very happy to rep Claude 3.7 even more than I already do.
I think it’s great advertising for Claude. I believe the reason Claude seems to do SO much better at this task is, one it shows far greater spatial intelligence, and two, I distract they are the only state of the art model intentionally training on SVG.
I’m a bit late here - but I’m the COO of OpenRouter and would love to help out with some additional credits and share the project. It’s very cool and more people could be able to check it out. Send me a note. My email is cc at OpenRouter.ai
wow, that would be amazing, sending you an email.
I don't think the project would have gotten this far without openrouter (because: how else would you sanely test on 20+ models to be able to find the only one that actually worked?). Without openrouter, I think I would have given up and thought "this idea is too early for even a demo", but it was easy enough to keep trying models that I kept going until Claude 3.7 popped up.
Thank you for the kind words and email received!
This is super cool! I think new kinds of experiences can be built with infinite generative UIs. Obviously there will need to be good memory capabilities, maybe through tool use.
If you end up taking this further and self hosting a model you might actually achieve a way faster “frame rate” with speculative decoding since I imagine many frames will reuse content from the last. Or maybe a DSL that allows big operations with little text. E.g. if it generates HTML/SVG today then use HAML/Slim/Pug: https://chatgpt.com/share/67e3a633-e834-8003-b301-7776f76e09...
What I'm currently doing is caveman: I ask the LLM to attach a unique id= to every element, and I gave it an attribute (data-use-cached) it can use to mark "the contents of this element should be loaded from the preivous frame": https://github.com/snickell/universal/blob/47c5b5920db5b2082...
For example, this specifies that #my-div should be replaced with the value from the previous frame (which itself might have been cached): <div id="my-div" data-use-cached></div>
This lowers the render time /substantially/, for simple changes like "clicked here, pop-open a menu" it can do it in 10s, vs a full frame render which might be 2 minutes (obviously varies on how much is on the screen!).
I think using HAML etc is an interesting idea, thanks for suggesting it, that might be something I'll experiment with.
The challenge I'm finding is that "fancy" also has a way of confusing the LLM. E.g. I originally had the LLM produce literal unified diffs between frames. I reasoned it had seem plenty of diffs of HTML in its training data set. It could actually do this, BUT image quality and intelligence were notably affected.
Part of the problem is that at the moment (well 1mo ago when I last benchmarked), only Claude is "past the bar" for being able to do this particular task, for whatever reason. Gemini Flash is the second closest. Everything else (including 4o, 4.5, o1, deepseek, etc) are total wipeouts.
What would be really amazing is if say Llama 4 turns out to be good in the visual domain the way claude is, and you can run it on one of the LLM-on-silicon vendors (cerebrus.ai, grok, etc) to get 10x the token rate.
LMK if you have other ideas, thanks for thinking about this and taking a look!
Wonderful, good job! Reminds me of https://arstechnica.com/information-technology/2022/12/opena...
Do you have any demo videos?
No, I wasn't planning to post this for a couple weeks, but I saw the comment and was like "eh, why not?".
You can watch "sped up" past sessions by other people who used this demo here, which is kind of like a demo video: https://universal.oroborus.org/gallery
But the gallery feature isn't really there today, it shows all the "one-click and bounce sessions", and its hard to find signal in the noise.
I'll probably submit a "Show HN" when I have the gallery more together, and I think its a great idea to pick a multi-click gallery sequence and upload it as a video.
Seconding the need for a video. We need a way to preview this without it costing you money. I had to charge you a few dimes to grasp this excellent work. The description does not do it justice; people need to see this in motion. The progressive build-up of a single frame, too. I encourage you to post the Show HN soon.
Anyone know order-or-magnitude how many visits to expect (order of magnitude) from an Ask HN? A thousand? 10? 100? I need to figure out how many credits I'd need to line up to survive one.
> had to charge you a few dimes
s/you/openrouter/: ty to openrouter for donating a significant chunk of credits a couple hours ago.
Really appreciate the feedback on needing a video. I had a sense this was the most important "missing piece", but this will give me the motivation to accomplish what is (to me) a relatively boring task, compared to hacking out more features.
The main things I'm blocking on before submitting a "Show HN" are:
- Getting the instant-replay gallery sorted to make it usable, maybe sorting via likes.
- Selecting a couple interesting sessions from the gallery and turning them into a short video
- Making sure I have enough credits lined up (hopefully donations!) to survive an "Ask HN".
This thing is insanely cool, thanks for creating it.
It’s like a lucid dream version of using and modifying the software at the same time.
This is so impressive. You are building the future!
Pretty sure the modern Gemini image models can already do token based image generation/editing and are significantly better and faster.
Yeah Gemini has had this for a few weeks, but much lower resolution. Not saying 4o is perfect, but my first few images with it are much more impressive than my first few images with Gemini.
weeks, ya'll, weeks!
It's faster but it's definitely not better than what's being showcased here. The quality of Flash 2 Image gens are generally pretty meh.
>You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
You can do that with diffusion, too. Just lock the parameters in ComfyUi.
Yeah I wasn’t very imaginative in my examples, with 4o you can also perform transformations like “rotate the camera 10 degrees to the left” which would be hard without a specialized model. Basically you can run arbitrary functions on the exact image contents but in latent space.
It's doable with diffusion, too.
I'm incredibly deep in the image / video / diffusion / comfy space. I've read the papers, written controlnets, modified architectures, pretrained, finetuned, etc. All that to say that I've been playing with 4o for the past day, and my opinions on the space have changed dramatically.
4o is a game changer. It's clearly imperfect, but its operating modalities are clearly superior to everything else we have seen.
Have you seen (or better yet, played with) the whiteboard examples? Or the examples of it taking characters out of reflections and manipulating them? The prompt adherence, text layout, and composing capabilities are unreal to the point this looks like it completely obsoletes inpainting and outpainting.
I'm beginning to think this even obsoletes ComfyUI and the whole space of open source tools once the model improves. Natural language might be able to accomplish everything outside of fine adjustments, but if you can also supply the model with reference images and have it understand them, then it can do basically everything. I haven't bumped into anything that makes me question this yet.
They just need to bump the speed and the quality a little. They're back at the top of image gen again.
I'm hoping the Chinese or another US company releases an open model capable of these behaviors. Because otherwise OpenAI is going to take this ball and run far ahead with it.
Yeah if we get an open model that one could apply a LoRA (or similarly cheap finetuning) to, then even problems like reproducing identity would (most likely) be solved, as they were for diffusion models. The coherence not just to the prompt but to any potential input image(s) is way beyond what I've seen in diffusion models.
I do think they run a "traditional" upscaler on the transformer output since it seems to sometimes have errors similar to upscalers (misinterpreted pixels), so probably the current decoded resolution is quite low and hopefully future models like GPT-5 will improve on this.
That's very interesting. I would have assumed that 4o is internally using a single seed for the entire conversation, or something analogous to that, to control randomness across image generation requests. Can you share the technical name for this reasoning process so I could look up research about it?
multimodal chain of thought / generation of thought
Nobody has really decided on a name.
Also chain of thought is somewhat different from chain of thought reasoning so mb throw in multimodal chain of thought reasoning
Is it able to break the usual failure modes of these models, that all clocks are at 10 min past two, or they can't produce images of people drawing with the left hand?
In my tests no, that's still not possible with the model unfortunately, but it feels like you have way more control with prompting over any previous model (stable diffusion/midjourney).
> Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM
With current GPU technology, this system would need its own Dyson sphere.
> writing the code to reproduce it
I'm super excited for all the free money and data our new AI written apps will be giving away.
Hmmm, I wanted to do that tic tac toe example, and it failed to create a 3x3 grid, instead creating a 5x5 (?) grid with two first moves marked.
https://chatgpt.com/share/67e32d47-eac0-8011-9118-51b81756ec...
Your images say "Created with DALL-E", so you have not tried out the new model yet. I think they are gradually rolling it out.
click the ...'s and make sure you disable images (w/ dall-e) apparently the native just works if u enable the images... it switches to dall-e lol
Tried it myself on the new model, worked out pretty well:
https://chatgpt.com/share/67e34558-5244-8004-933a-23896c738b...
I tried to play it, and while the conversation is right the image is just all wrong
I might just be a grumpy old man, but it really bugs me when the AI confidently says, "Here is your image, If you have any other requests, just let me know!".
For a start the image is wrong, and also I know I can make more requests, because that what tools are for. Its like a passive aggressive suggestion that I made the AI go out of its way to do me a favor.
Wrt reasoning I’ll believe it when I see it. I just tried several variants of “Generate an image of a chess board in which white has played three great moves and black has played two bad moves.” Results are totally nonsensical as always.
Ran through some of my relatively complex prompts combined with using pure text prompts as the de-facto means of making adjustments to the images (in contrast to using something like img2img / inpainting / etc.)
https://mordenstar.com/blog/chatgpt-4o-images
It's definitely impressive though once again fell flat on the ability to render a 9-pointed star.
Didn't work for me on the first prompting (got a 10-pointed one), but after sending [this is 10 points, make it 9] it did render a 9-pointed one too
Have u had any luck with engineering/schematics/wireframe diagrams such as [1] ??
[1] https://techcrunch.com/wp-content/uploads/2024/03/pasted-ima...
Great question. I haven't tested the creation of such an image from scratch, but I did add an adjustment test against that specific text-heavy diagram and I'd say it passed with "flying colors". (pun intended).
Here's what it created based on a text description of your schematic
https://i.imgur.com/sGfdtWo.png
Armless Venus with bread is true art
Fantastic prompts!
I’ve just tried it and oh wow it’s really good. I managed to create a birthday invitation card for my daughter in basically 1-shot, it nailed exactly the elements and style I wanted. Then I asked to retain everything but tweak the text to add more details about the date, venue etc. And it did. I’m in shock. Previous models would not be even halfway there.
share prompt minus identifying details?
> Draw a birthday invitation for a 4 year old girl [name here]. It should be whimsical, look like its hand-drawn with little drawings on the sides of stuff like dinosaurs, flowers, hearts, cats. The background should be light and the foreground elements should be red, pink, orange and blue.
Then I asked for some changes:
> That's almost perfect! Retain this style and the elements, but adjust the text to read:
> [refined text]
> And then below it should add the location and date details:
> [location details]
If anyone is interested in an example output for this exact initial prompt: https://x.com/0xmetaschool/status/1904804277341839847
just did the same type prompt for my sons birthday. I got all the classic errors. first attempt looked good, but had 2 duplicate lines for date and time and "Roarrr!" (dino theme) had a blurred out "a"
pointed these issues out to give it a second go and got something way worse. This still feels like little more than a fun toy.
that's lovely thank you. i am not very artistic so having stuff like this to crib is very helpful.
> Introducing 4o Image Generation: [...] our most advanced image generator yet
Then google:
> Gemini 2.5: Our most intelligent AI model
> Introducing Gemini 2.0 | Our most capable AI model yet
I could go on forever. I hope this trend dies and apple starts using something effective so all the other companies can start copying a new lexicon.
We're in the middle of a massive and unprecedented boom in AI capabilities. It is hard to be upset about this phrasing - it is literally true and extremely accurate.
If that's so then there's no need to be hyperbolic about it. Why would they publish a model that is not their most advanced model?
Most things aren't in a massive boom and most people aren't that involved in AI. This is a rare example of great communication in marketing - they're telling people who might not be across this field what is going on.
> Why would they publish a model that is not their most advanced model?
I dunno, I'm not sitting in the OpenAI meetings. That is why they need to tell us what they are doing - it is easy to imagine them releasing something that isn't their best model ever and so they clarify that this is, in fact, the new hotness.
(Shrug) It's common for less-than-foundation-level models to be released every so often. This is done in order to provide new options, features, pricing, service levels, APIs or whatever that aren't yet incorporated into the main model, or that are never intended to be.
Just a consequence of how much time and money it takes to train a new foundation model. It's not going to happen every other week. When it does, it is reasonable to announce it with "Announcing our most powerful model yet."
o3 mini wasn't so much a most advanced model, as it was incredibly affordable for the IQ it was presenting at the time. Sometimes it's about efficiency and not being on the frontier.
They aren't being hyperbolic, they are accurately describing the reason you would use the new product.
And no, not all models are intended to push the frontier in terms of benchmark performance, some are just fast and cheap.
This is my latest and most advanced comment yet.
Has post-Jobs Apple ever come up with anything that would warrant this hope?
Every iPhone is their best iPhone yet
Even the 18 Pro Max Ultra with Apple Intelligence?
Obligatory Jobs monologue on marketing people:
https://www.youtube.com/watch?v=P4VBqTViEx4
Only the September ones. ;)
Not wrong though
It kind of is, the iPhone 16e isn’t the best even though it’s the latest, right? Or are we rating best by price/performance, not pure performance (I don’t even know if the 16e would be best there)?
Did Apple claim it’s the best phone yet? They’d probably only reserve that for the Pro.
No, but the user I (indirectly) replied to did:
> Every iPhone is their best iPhone yet
Apple isn't really the best software company and though they were early to digital assistants with Siri, it seems like they've let it languish. It's almost comical how bad Siri is given the capabilities of modern AI. That being said, Android doesn't really have a great builtin solution for this either.
Apple is more of a hardware company. Still, Cook does have a few big wins under his belt: M-series ARM chips on Macs, Airpods, Apple watch, Apple pay.
Apple silicon chips
No, but I think they stopped with "our most" (since all other brainless corps adopted it) and just connect adjectives with dots.
Hotwheels: Fast. Furious. Spectacular.
Maybe people also caught up to the fact that the "our most X product" for Apple usually means someone else already did X a long time ago and Apple is merely jumping on the wagon.
When you keep improving, it's always going to be the best or most: https://www.youtube.com/watch?v=bPkso_6n0vs
Every step of gradient descent is the best model yet!
Not if you do gradient descent with momentum.
Maybe it’s not useless. 1) it’s only comparing it to their own products and 2) it’s useful to know that the product is the current best in their offering as opposed to a new product that might offer new functionality but isn’t actually their most advanced.
Which is especially relevant when it's not obvious which product is the latest and best just looking at the names. Lots of tech naming fails this test from Xbox (Series X vs S) to OpenAI model names (4o vs o1-pro).
Here they claim 4o is their most capable image generator which is useful info. Especially when multiple models in their dropdown list will generate images for you.
This actually makes sense because the versioning is so confusing they could be releasing a lesser/lightweight model for all we know.
What's the problem?
It's a nitpick about the repetitive phrasing for announcements
<Product name>: Our most <superlative> <thing> yet|ever.
Speaking as someone who'd love to not speak that way in my own marketing - it's an unfortunate necessity in a world where people will give you literal milliseconds of their time. Marketing isn't there to tell you about the thing, it's there to get you to want to know more about the thing.
A term for people giving only milliseconds of their attention is: uninterested people. If I’m not looking for a project planner, or interested in the space, there’s no wording that can make me stay on an announcement for one. If I am, you can be sure I’m going to read the whole feature page.
Idealistic and wrong, marketing does work in a lot of cases and that's why everybody does it
No, everybody uses marketing because it's a conventional bet. It has proven in many cases to not be effective, but people aren't willing to risk getting fired because they suggested going against the grain.
I hate modern marketing trends.
This one isn't even my biggest gripe. If I could eliminate any word from the English language forever, it would be "effortlessly".
Idk, right now I think I'd eliminate "blazingly fast" from software engineering vocabulary.
I think Electron is giving you your wish.
If you could _effortlessly_ eliminate any word you mean?
Modern? Everything has been 'new and improved' since the 60's
https://www.youtube.com/watch?v=CUPDRnUWeBA
Maybe they used AI to come up with the tag line.
OpenAI's livestream of GPT-4o Image Generation shows that it is slowwwwwwwwww (maybe 30 seconds per image, which Sam Altman had to spin "it's slow but the generated images are worth it"). Instead of using a diffusion approach, it appears to be generating the image tokens and decoding them akin to the original DALL-E (https://openai.com/index/dall-e/), which allows for streaming partial generations from top to bottom. In contrast, Google's Gemini can generate images and make edits in seconds.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
As a user, images feel slightly slower but comparable to the previous generation. Given the significant quality improvement, it's a fair trade-off. Overall, it feels snappy, and the value justifies a higher price.
[flagged]
I just gave quick feedback on the new release. How should I be writing it?
If anything, your feedback is of low value.
Maybe this is the dialup of the era.
Ha. That's a good analogy.
When I first read the parent comment, I thought, maybe this is a long-term architecture concern...
But your message reminded me that we've been here before.
specially with the slow loading effect it has.
LLMs are autoregressive, so they can't be (multi-modality) integrated with diffusion image models, only with autoregressive image models (which generate an image via image tokens). Historically those had lower image fidelity than diffusion models. OpenAI now seems to have solved this problem somehow. More than that, they appear far ahead of any available diffusion model, including Midjourney and Imagen 3.
Gemini "integrates" Imagen 3 (a diffusion model) only via a tool that Gemini calls internally with the relevant prompt. So it's not a true multimodal integration, as it doesn't benefit from the advanced prompt understanding of the LLM.
Edit: Apparently Gemini also has an experimental native image generation ability.
Gemini added their multimodal Flash model to Google AI Studio some time ago. It does not use Imagen via tool, it's uses native capabilities to manipulate images, and it's free to try.
Your understanding seems outdated, I think people are referring Gemini native image generation
Is this the same for their gemini-2.0-flash-exp-image-generation model?
No that seems to be indeed a native part of the multimodal Gemini model. I didn't know this existed, it's not available in the normal Gemini interface.
This is a pretty good example of the current state of Google LLMs:
The (no longer, I guess) industry-leading features people actually want are hidden away in some obscure “AI studio” with horrible usability, while the headline Gemini app still often refuses to do anything useful for me. (Disclaimer: I last checked a couple of months ago, after several more of mild amusement/great frustration.)
hey at least now they bought ai.dev and redirected it to their bad ux
That's pretty disappointing, it has been out for a while, and we still get top comments like (https://news.ycombinator.com/item?id=43475043) where people clearly think native image generation capability is new. Where do you usually get your updates from for this kind of thing?
Meta has experimented with a hybrid mode, where the LLM uses autoregressive mode for text, but within a set of delimiters will switch to diffusion mode to generate images. In principle it's the best of both worlds.
> so they can't be integrated
That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.
I expect the Chinese to have an open source answer for this soon.
They haven't been focusing attention on images because the most used image models have been open source. Now they might have a target to beat.
ByteDance has been working on autoregressive image generation for a while (see VAR, NeurIPS 2024 best paper). Traditionally they weren't in the open-source gang though.
The VAR paper is very impressive. I wonder if OpenAI did something similar. But the main contribution in the new GPT-4o feature doesn't seem to be just image quality (which VAR seems to focus on), but also massively enhanced prompt understanding.
If you look at the examples given, this is the first time I've felt like AI generated images have passed the uncanny valley.
The results are ground breaking in my opinion. How much longer until an AI can generate 30 successive images together and make an ultra realistic movie?
One day you’ll just give it a script and get a movie out
What a fucking brilliant comment. Reading this site is such a waste of time
a premise*
i find this “slow” complaint (/observation— i dont view this comment as a complaint, to be clear) to be quite confusing. slow… compared to what, exactly? you know what is slow? having to prompt and reprompt 15 times to get the stupid model to spell a word correctly and it not only refuses, but is also insistent that it has corrected the error this time. and afaict this is the exact kind of issue this change should address substantially.
im not going to get super hyperbolic and histrionic about “entitlement” and stuff like that, but… literally this technology did not exist until like two years ago, and yet i hear this all the time. “oh this codegen is pretty accurate but it’s slow”, “oh this model is faster and cheaper (oh yeah by the way the results are bad, but hey it’s the cheapest so it’s better)”. like, are we collectively forgetting that the whole point of any of this is correctness and accuracy? am i off-base here?
the value to me of a demonstrably wrong chat completion is essentially zero, and the value of a correct one that anticipates things i hadn’t considered myself is nearly infinite. or, at least, worth much, much more than they are charging, and even _could_ reasonably charge. it’s like people collectively grouse about low quality ai-generated junk out of one side of their mouths, and then complain about how expensive the slop is out of the other side.
hand this tech to someone from 2020 and i guarantee you the last thing you’d hear is that it’s too slow. and how could it be? yeah, everyone should find the best deals / price-value frontier tradeoff for their use case, but, like… what? we are all collectively devaluing that which we lament is being devalued by ai by setting such low standards: ourselves. the crazy thing is that the quickly-generated slop is so bad as to be practically useless, and yet it serves as the basis of comparison for… anything at all. it feels like that “web-scale /dev/null” meme all over again, but for all of human cognition.
> it appears to be generating the image tokens and decoding them akin to the original DALL-E
The animation is a lie. The new 4o with "native" image generating capabilities is a multi-modal model that is connected to a diffusion model. It's not generating images one token at a time, it's calling out to a multi-stage diffusion model that has upscalers.
You can ask 4o about this yourself, it seems to have a strong understanding of how the process works.
Would it seem otherwise if it was a lie?
There are many clues to indicate that the animation is a lie. For example, it clearly upscales the image using an external tool after the first image renders. As another example, if you ask the model about the tokens inside of its own context, it can't see any pixel tokens.
A model may not have many facts about itself, but it can definitely see what is inside of its own context, and what it sees is a call to an image generation tool.
Finally, and most convincingly, I can't find a single official source where OpenAI claims that the image is being generated pixel-by-pixel inside of the context window.
Sorry but I think you may be mistaken if your only source is ChatGPT. It's not aware of its own creation processes beyond what is included in its system prompt.
i mean on free chat an image took maybe 2 seconds?
I’ll just be happy with not everything having that over saturated cg/cartoon style that you cant prompt your way out of.
I was relying on that to determine if images were AI though
Frustratingly the DALL-E API actually has an option for this, you can switch it from "vivid" to "realistic".
This option is not exposed in ChatGPT, it only uses vivid.
Is that an artifact of the training data? Where are all these original images with that cartoony look that it was trained on?
A large part of deviantart.com would fit that description. There are also a lot of cartoony or CG images in communities dedicated to fanart. Another component in there is probably the overly polished and clean look of stock images, like the front page results of shutterstock.
"Typical" AI images are this blend of the popular image styles of the internet. You always have a bit of digital drawing + cartoon image + oversaturated stock image + 3d render mixed in. Models trained on just one of these work quite well, but for a generalist model this blend of styles is an issue
> There are also a lot of cartoony or CG images in communities dedicated to fanart.
Asian artists don't color this way though; those neon oversaturated colors are a Western style.
(This is one of the easiest ways to tell a fake-anime western TV show, the colors are bad. The other way is that action scenes don't have any impact because they aren't any good at planning them.)
Wild speculation: video game engines. You want your model to understand what a car looks like from all angles, but it’s expensive to get photos of real cars from all angles, so instead you render a car model in UE5, generating hundreds of pictures of it, from many different angles, in many different colors and styles.
I've heard this is downstream of human feedback. If you ask someone which picture is better, they'll tend to pick the more saturated option. If you're doing post-training with humans, you'll bake that bias into your model.
Ever since Midjourney popularized it, image generation models are often posttrained on more "aesthetic" subsets of images to give them a more fantasy look. It also help obscure some of the imperfections of the AI.
.. either that or they are padding out their training data with scads of relatively inexpensive to produce 3d rendered images</speculation>
It's largely an artifact of classifier-free guidance used in diffusion models. It makes the image generation more closely follow the prompt but also makes everything look more saturated and extreme.
you really have to NOT try to end up with that result in MJ.
Is there any way to see whether a given prompt was serviced by 4o or Dall-E?
Currently, my prompts seem to be going to the latter still, based on e.g. my source image being very obviously looped through a verbal image description and back to an image, compared to gemini-2.0-flash-exp-image-generation. A friend with a Plus plan has been getting responses from either.
The long-term plan seems to be to move to 4o completely and move Dall-E to its own tab, though, so maybe that problem will resolve itself before too long.
4o generates top down (picture goes from mostly blurry to clear starting from the top). If it's not generating like that for you then you don't have it yet.
That's useful, thank you! But it also highlights my point: Why do I have to observe minor details about how the result is being presented to me to know which model was used?
I get the intent to abstract it all behind a chat interface, but this seems a bit too much.
Oh I agree 100%. Open AI roll outs leave much to be desired. Sometimes there isn't even a clear difference like there is for this.
I mean in the webpage the dalle one has a bubble under that says "generated with dall-e"
If you don't have access to it on ChatGPT yet, you can try Sora, which already has access for me.
I've generated (and downloaded) a couple of images. All filenames start with `DALL·E`, so I guess that's a safe way to tell how the images were generated.
My images say "Created with DALLE" below them, and a little info icon tells me they are rolling out the new one soon.
don't enable images on the chat model if your using the site, just leave it all disabled and ask for an image, if you enable dall-e it switches to dall-e is what i've seen
the native just.. works
It's incredible that this took 316 days to be released since it was initially announced. I do appreciate the emphasis in the presentation on how this can be useful beyond just being a cool/fun toy, as it seems most image generation tools have functioned.
Was anyone else surprised how slow the images were to generate in the livestream? This seems notably slower than DALLE.
I've never minded that an image might take 10-30 seconds to generate. The fact that people do is crazy to me. A professional artist would take days, and cost $100s for the same asset.
I ran stable diffusion for a couple of years (maybe?, time really hasn't made sense since 2020) on my Dual 3090 rendering server. I built the server originally for crypto heating my office in my 1820s colonial in upstate NY then when I was planning to go back to college (got accepted into a university in England), I switched it's focus to Blender/UE4 (then 5), then eventually to AI image gen. So I've never minded 20 seconds for an image. If I needed dozens of options to pick the best, I was going to click start and grab a cup of coffee, come back and maybe it was done. Even if it took 2 hours, it is still faster than when I used to have to commission art for a project.
I grew out of Stable Diffusion, though, because the learning curve beyond grabbing a decent checkpoint and clicking start was actually really high (especially compared to LLMs that seamed to "just work"), after going through failed training after failed fine-tuning using tutorials that were a couple days out of date, I eventually said, fuck it, I'm paying for this instead.
All that to say - if you are using GenAI commercially, even if an image or a block of code took 30 minutes, it's still WAY cheaper than a human. That said, eventually a professional will be involved, and all the AI slop you generated will be redone, which will still cost a lot, but you get to skip the back and forth figuring out style/etc.
The new model in the drop down says something like "4o Create Image (Updated)". It is truly incredible. Far better than any other image generator as far as understanding and following complex prompts.
I was blown away when they showed this many months ago, and found it strange that more people weren't talking about it.
This is much more precise than the Gemini one that just came out recently.
> found it strange that more people weren't talking about it.
Some simply dislike everything OpenAI. Just like everything Musk or Trump.
First AI image generator to pass the uncanny valley test? Seems like it. This is the biggest leap in image generation quality I've ever seen.
How much longer until an AI that can generate 30 frames with this quality and make a movie?
About 1.5 years ago, I thought AI would eventually allow anyone with an idea to make a Hollywood quality movie. Seems like we're not too far off. Maybe 2-3 more years?
>First AI image generator to pass the uncanny valley test?
Other image generators I've used lately often produced pretty good images of humans, as well [0]. It was DALLE that consistently generated incredibly awful images. Glad they're finally fixing it. I think what most AI image generators lack the most is good instruction following.
[0] YandexArt for the first prompt from the post: https://imgur.com/a/VvNbL7d The woman looks okay, but the text is garbled, and it didn't fully follow the instruction.
Do you have another example from YandexArt?
https://images.ctfassets.net/kftzwdyauwt9/7M8kf5SPYHBW2X9N46...
OpenAI's human faces look *almost* real.
>OpenAI's human faces look almost real.
Not sure, I tried a few generations, and it still produces those weird deformed faces, just like the previous generation: https://imgur.com/a/iKGboDH Yeah, sometimes it looks okay.
YandexArt for comparison: https://imgur.com/a/K13QJgU
I can tell YandexArt's human faces are AI right away. They have the "shine" on their faces. They look too flawless.
Ideogram 2.0 and Recraft also create images that looks very much real.
For drawings, NovelAI's models are way beyond the uncanny valley now.
My experience with these announcements is that they're cherry picking the best results from a maybe several hundred or a thousand prompts.
I'm not saying that it's not true, it's just "wait and see" before you take their word as gold.
I think MS's claim on their quantum computing breakthrough is the latest form of this.
> My experience with these announcements is that they're cherry picking the best results from a maybe several hundred or a thousand prompt
just tried it, prompt adherence and quality is... exactly what they said, it extremely impressive
The examples they show have little captions that say "best of #", like "best of 8" or "best of 4". Hopefully that truly represents the odds of generating the level of quality shown.
Some of the prompts are pretty long. I'm curious how iterations it took to get to that prompt for them to take the top 8 out of.
I'm not doubting it's an improvement, because it looks like it is.
I guess here's an example of a prompt I would like to see:
A flying spaghetti monster with a metal colander on its head flying above New York City saving the world from and very very evil Pope.
I'm not anti/pro spaghetti monster or catholicism. But I can visualize it clearly in my head what that prompt might look like.
Here you go: https://chatgpt.com/share/67e3d3dc-b234-8004-b992-b559fc5038...
Have you tried it? It's crazy good.
Can it be tried? ChatGPT still uses DALL-E for me.
No offense but after years of vaporware and announcements that seemed more plausible than implausible, I'll remain skeptical.
I will also not give them my email address just to try it out.
Why are you making blanket statements on things that you haven't even tried? This is leaps and bounds better than before.
No offense but do you believe it when microsoft announces they have solved quantum computing?
And to prove it they only need your email address, birth date, credit card number, and rights to first born child?
I don't believe it when Microsoft announces it, but when two separate trustworthy-looking hn accounts tell me something is crazy good that seems like valuable information to me.
No offense but you are really obnoxious.
None taken. I never said I wasn't.
why not use a fake email address?
it’s rolling out to users on all tiers, so no need to wait. I tried it and saw outputs from many others. it’s good. very good
Chat GPT requires logging in with an email. I hesitated on that.
That's why I prefer to wait.
You can create e-mail addresses for single use, even temporary ones.
I got the occasional A/B test with a new image generator while playing with Dall-E during a one month test of Plus. It was always clear which one was the new model because every aspect was so much better. I assume that model and the model they announced are the same.
You should try it out yourself.
This is really impressive, but the "Best of 8" tag on a lot of them really makes me want to see how cherry-picked they are. My three free images had two impressive outputs and one failure.
The high five looks extremely unnatural. Their wrists are aligned, but their fingers aren't, somehow?
If that's best of 8, I'd love to see the outtakes.
Agreed. It seems totally unnatural that a couple of nerds high-five awkwardly.
Not awkward. Anatomically uncanny and physically impossible.
While drawing hands is difficult (because the surface morphs in a variety of ways), the shapes and relative proportions are quite simple. That’s how you can have tools like Metahuman[0]
[0]: https://www.unrealengine.com/en-US/metahuman
The whiteboard image is insane. Even if it took more than 8 to find it, it's really impressive.
To think that a few years ago we had dreamy pictures with eyes everywhere. And not long ago we were always identifying the AI images by the 6 fingered people.
I wonder how well the physics is modeled internally. E.g. if you prompt it to model some difficult ray tracing scenario (a box with a separating wall and a light in one of the chambers which leaks through to the other chamber etc)?
Or if you have a reflective chrome ball in your scene, how well does it understand that the image reflected must be an exact projection of the visible environment?
I remember literally just two or three years back getting good text was INSANE. We were all amazed when SD started making pretty good text.
am I dumb or every time they release something I can never find out how to actually use it and forget about it. take this for instance I wanted to try out their newton "an infographic explaining newton's prism experiment in great detail" example, but it generated a very bad result but maybe it's because I'm not using the right model? every release of theirs is not really a release, it's like a trailer. right?
This is hilarious. I'm also confused about whether they released it or not because the results are underwhelming.
EDIT: Ok it works in Sora, and my jaw dropped
You're not dumb. They do this for nearly every single major release. I can't really understand why considering it generates negative sentiment about the release, but it's something to be expected from OpenAI at this point.
This is what's so wild about Anthropic. When they release it seems like it's rolled out to all users, and API customers immediately. OpenAI has MONTHS between annoucement and roll out, or if they do it's usually just influencers who get an "early look". It's pretty frustrating.
It's very impressive. It feels like the text is a bit of a hack where they're somehow rendering the text separately and interpolating it into the image. Not always, I got it to render calligraphy with flourishes, but only for a handful of words.
For example, I asked it to render a few lines of text on a medieval scroll, and it basically looked like a picture of a gothic font written onto a background image of a scroll
This doesn't explain the whiteboard example at the link. The handwritten text on the whiteboard does not look like a font at all.
You could have a model that receives the generated raw text and then is trained to display it in whatever style. Whether it looks like a font or not is irrelevant.
Visual internet content is completely over. Pack it up
For starters, this completely blocks generation of anything remotely related to copy-protected IPs, which may actually be a saving grace for some creatives. There's a lot of demand for fanart of existing characters, so until this type of model can be run locally, the legal blocks in place actually give artists some space to play in where they don't have to compete with this. At least for a short while.
Fan-art is still illegal, especially since a lot of fan artists are doing it commercially nowadays via commissions and Patreon. It's just that companies have stopped bothering to sue for it because individual artists are too small to bother with, and it's bad PR. (Nintendo did take down a super popular Pokemon porn comic, though.)
So it's ironic in this sense, that OpenAI blocking generation of copyrighted characters means that it's more in compliance with copyright laws than most fan artists out there, in this context. If you consider AI training to be transformative enough to be permissible, then they are more copyright-respecting in general.
Source: https://lawsoup.org/legal-guides/copyright-protecting-creati...
>For starters, this completely blocks generation of anything remotely related to copy-protected IPs
It did Dragon Ball Z here:
https://old.reddit.com/r/ChatGPT/comments/1jjtcn9/the_new_im...
Rick and Morty:
https://old.reddit.com/r/ChatGPT/comments/1jjtcn9/the_new_im...
South Park:
https://old.reddit.com/r/ChatGPT/comments/1jjyn5q/openais_ne...
Despite likely being trained on and stealing from copy protected ips? Not sure if they've changed their approach to training data
I’ve had it do Tintin and the Simpsons in the last hour, so no, it doesn’t
So I spent a good few hours investigating the current state of the art a few weeks ago. I would like to generate a collection of images for the art in a video game.
It is incredibly difficult to develop an art style, then get the model to generate a collection of different images in that unique art style. I couldn't work out how to do it.
I also couldn't work out how to illustrate the same characters or objects in different contexts.
AI seems great for one off images you don't care much about, but when you need images to communicate specific things, I think we are still a long way away.
Short answer: the model is good at consistency. You can use it to generate a set a style reference images, then use those as reference for all your subsequent generations. Generating in the same chat might also help it have further consistency between images.
Sorry that wasn't a question, I was saying they models were not good at constancy in my evaluation.
In terms of prompt adherence and consistency, the current state of art just changed dramatically today and you're in the very thread about the change.
Your evaluation, done a few weeks ago, isn't relevant anymore.
I don't see any evidence of that, and in fact, in the video shows the style moving all over the place.
I look forward to giving it a try, but I don't have high hopes.
Even with custom LoRas, controlnets, etc. we're still a pretty long ways from being able to one-click generate thematically consistent images especially in the context of a video game where you really need the ability to generate seamless tiles, animation based spritesheets, etc.
I didn’t mean art. I meant visual internet content of all kinds. Influencers promoting products, models, the “guy talking to a camera” genre, photos of landscapes, interviews, well-designed ads, anything that comes up on your instagram explore page; anything that has taken over feeds due to the trust coming from a human being behind it will become indistinguishable from slop. It’s not quite there yet but it’s close and undeniably coming soon
https://chatgpt.com/share/67e39ffa-3a98-8011-ab79-fe3ac76632...
Asking it to draw the Balkans map in Tolkien style, this is actually really impressive, geography is more or less completely correct, borders and country locations are wrong, but it feels like something I could get it to fix.
Strange, I'm getting
> I wasn't able to generate the map because the request didn't follow content policy guidelines. Let me know if you'd like me to adjust the request or suggest an alternative way to achieve a similar result.
Are you in the US?
...why are we living in such a retarded sci-fi age
No, I'm in Croatia. Just tried again and it's working https://chatgpt.com/share/67e3d18a-75e0-8011-ba67-fdcd13aa7f...
Strange, I can see your second image but not the first.
Weird, I can't see the image at all.
Edit: Eventually it showed up
The character consistency and UI capabilities seem like they open up a lot of new use cases.
[flagged]
Well I definitely wouldn't say it's vital for humanity. Has anyone actually said that?
Character consistency means that these models could now theoretically illustrate books, as one example.
Generating UIs seems like it would be very helpful for any app design or prototyping.
For creating believable fake images...
We're largely past the days of 7 fingered hands - text remains one of the tell-tale signs.
Never heard about professional photographers, stock photography, graphic artists, etc.?
People didn’t care about cars before they were invented either
[flagged]
I work on a product for generating interactive fanfiction using an LLM, and I've put a lot of work into post-training to improve writing quality to match or exceed typical human levels.
I'm excited about this for adding images to those interactive stories.
It has nothing to do with circumventing the cost of artists or writers: regardless of cost, no one can put out a story and then rewrite it based on whatever idea pops into every reader's mind for their own personal main character.
It's a novel experience that only a "writer" that scales by paying for an inanimate object to crunch numbers can enable.
Similarly no artist can put out a piece of art for that story and then go and put out new art bespoke to every reader's newly written story.
-
I think there's this weird obsession with framing these tools about being built to just replace current people doing similar things. Just speaking objectively: the market for replacing "cheeky expensive artists" would not justify building these tools.
The most interesting applications of this technology being able to do things that are simply not possible today even if you have all the money in the world.
And for the record, I'll be ecstatic for the day an AI can reach my level of competency in building software. I've been doing it since I was a child because I love it, it's the one skill I've ever been paid for, and I'd still be over the moon because it'd let me explore so many more ideas than I alone can ever hope to build.
> That is a great right, as long as it's not programmers.
You realize that almost weekly we have new AI models coming out that are better and better at programming? It just happened that the image generation is an easier problem than programming. But make no mistake, AI is coming for us too.
That's the price of automating everything.
I try this on every new generation:
Generate a photo of a lake taken by a mobile phone camera. No hands or phones in the photo, just the lake.
The hand holding a phone is always there :D
Not when I tried https://i.imgur.com/1cPaXn5.png
Few image generation AI tools understand negatives. If you tell it "not" to do something, that something will always appear somehow.
ditto for me. prompted to remove hands, but no success.
ChatGPT Pro tip: In addition to video generation, you can use this new image gen functionality in Sora and apply all of your custom templates to it! I generated this template (using my Sora Preset Generator, which I think is public) to test reasoning and coherency within the image:
Theme: Educational Scientific Visualization – Ultra Realistic Cutaways Color: Naturalistic palettes that reflect real-world materials (e.g., rocky grays, soil browns, fiery reds, translucent biological tones) with high contrast between layers for clarity Camera: High-resolution macro and sectional views using a tilt-shift camera for extreme detail; fixed side angles or dynamic isometric perspective to maximize spatial understanding Film Stock: Hyper-realistic digital rendering with photogrammetry textures and 8K fidelity, simulating studio-grade scientific documentation Lighting: Studio-quality three-point lighting with soft shadows and controlled specular highlights to reveal texture and depth without visual noise Vibe: Immersive and precise, evoking awe and fascination with the inner workings of complex systems; blends realism with didactic clarity Content Transformation: The input is transformed into a hyper-detailed, realistically textured cutaway model of a physical or biological structure—faithful to material properties and scale—enhanced for educational use with visual emphasis on internal mechanics, fluid systems, and spatial orientation
Examples: 1. A photorealistic geological cutaway of Earth showing crust, tectonic plates, mantle convection currents, and the liquid iron core with temperature gradients and seismic wave paths. 2. An ultra-detailed anatomical cross-section of the human torso revealing realistic organs, vasculature, muscular layers, and tissue textures in lifelike coloration. 3. A high-resolution cutaway of a jet engine mid-operation, displaying fuel flow, turbine rotation, air compression zones, and combustion chamber intricacies. 4. A hyper-realistic underground slice of a city showing subway lines, sewage systems, electrical conduits, geological strata, and building foundations. 5. A realistic cutaway of a honeybee hive with detailed comb structures, developing larvae, worker bee behavior zones, and active pollen storage processes.
Anyone else frightened by this? Seeing meant believing, and now that isnt the case anymore...
In the short term, yes. Over the long run, I think it's good that we move away from the "seeying is believing" model, since that was already abused by bad actors/propaganda Hopefully, not too much chaos until we find another solution.
No, people have been making photorealistic and convincing images in photoshop for ages.
This specifically? No. We’ve been on this path a while now.
The general idea of indistinguishable real/fake images; yeah
Look closer at the fingers. These models still don’t have a firm handle on them. The right elbow on the second picture also doesn’t quite look anatomically possible.
> AI is bad and unconvincing
> if its not unconvincing, its soulless (only because I was told in advance that its AI)
> if its not soulless then its using too much energy
I’m not sure what your point is. This subthread is about whether AI-generated pictures can be distinguished from real photographs. For the pictures in the article, which are already cherry-picked (“best of 8”), the answer is yes. Therefore I don’t quite share the worries of GP.
Nah, I'll maybe start taking them seriously when they can draw someone grating cheese, but holding the cheese and the grater as if they were playing violin.
Not really, but only because literally every person I've met that spends a lot of time on TikTok starts spouting unhinged nonsense at me.
You don't even need deepfakes. https://www.newsweek.com/doug-mastriano-pennsylvania-senator...
The disaster scenario is already here.
it's fine, there will be new jobs.
Yes, but not in time to save the people who were in the old jobs. Plus retraining.
over 10 years it might even out, if your lucky (historically its taken much longer) but 10 years is a long time to wait in your career.
Like rebuilding society from the ashes
The problem isnt really jobs imo. I think wed need to rework the system itself because working shouldnt be necessary anymore. Time for some communism.
Will be interesting to see how this ranks against Google Imagen and Reve. https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Ima...
Is anyone else getting wild rejections on content policy since this morning? I spent about 20 minutes trying to get it to turn my zoo photos into cartoons and could not get a single animal picture past the content moderation....
Even when I told it to transform it into a text description, then draw that text description, my earlier attempt at a cat picture meant that the description was too close to a banned image...
I can't help but feel like openAI and grok are on unhelpful polar opposites when it comes to moderation.
This works great for many purposes.
One area where it does not work well at all is modifying photographs of people's faces.* Completely fumbles if you take a selfie and ask it to modify your shirt, for example.
* = unless the people are in the training set
> We’re aware of a bug where the model struggles with maintaining consistency of edits to faces from user uploads but expect this to be fixed within the week.
Sounds like it may be a safety thing that's still getting figured out
Thanks, I had not seen that caveat!
It just doesn't have that kind of image editing capability. Maybe people just assume it does because Google's similar model has it. But did OpenAI claim it could edit images?
Yes it does, and that's one of the most important parts of it being multi-modal: just like it can make targeted edits at a piece of text, it can now make similarly nuanced edits to an image. The character consistency and restyling they mention are all rooted in the same concepts.
That's to be expected, no? It's a usian product so it will be a disappointment in all areas where things could get lewd.
What is usian? Never heard of that.
US-ian, as in from the United States.
So should we be using Eusians for citizens of the Estados Unidos Mexicanos?
Why?
The Americas are quite a bit larger than the USA, so I disagree with 'american' being a word for people and things from mainland USA. Usian seems like a reasonable derivative of USA and US, similar to how mexican follows from Mexico and Estados Unidos Mexicanos.
Are these models also based on copyrighted material? Can anyone provide a brief explanation if the datasets are scraped or CC super-public-images ?
Something crazy, the old model couldn’t draw a 5.25 drive. I tried this myself at the time.
https://news.ycombinator.com/item?id=42628742
The new one can.
https://chatgpt.com/share/67e36dee-6694-8010-b337-04f37eeb5c...
Is it live yet? Have been trying it out and am still getting poor results on text generation.
I don't think it's available to everyone yet on 4o. Just like you I am getting the same "cartoony" styling and poor text generation.
Might take a day or two before it's available in general.
So far it seems to be the same for me.
It seems like an odd way to name/announce it, there's nothing obvious to distinguish it from what was already there (i.e. 4o making images) so I have no idea if there is a UI change to look for, or just keep trying stuff until it seems better?
If only OpenAI would dogfood their own product and use ChatGPT to make different choices with marketing that are less confusing than whoever's driving that bus now.
This is OpenAI's bread and butter - announce something as though it's being launched and then proceed to slowly roll it out after a couple of days.
Truly infuriating, especially when it's something like this that makes it tough to tell if the feature is even enabled.
They're also copying the Apple and Google style of refusing to show which version of the product you're using.
You're supposed to generate images, stupid /s
I enjoy trying to break these models. I come up with prompts that are uncommon but valid. I want to see how well they handle data not in their training set. For image generation I like to use “ Generate an image of a woman on vacation in the Caribbean, lying down on the beach without sunglasses, her eyes open.”
I think the biggest problem I still see is the models awareness of the images it generated itself.
The glaring issue for the older image generators is how it would proudly proclaim to have presented an image with a description that has almost no relation to the image it actually provided.
I'm not sure if this update improves on this aspect. It may create the illusion of awareness of the picture by having better prompt adherence.
Here's an example of iterative editing with the new model: https://chatgpt.com/share/67e30f62-12f0-800f-b1d7-b3a9c61e99...
It's much better than prior models, but still generates hands with too many fingers, bodies with too many arms, etc.
For some reason, I can't see the images in that chat, whether I'm signed in or in incognito mode.
I see errors like this in the console:
ewwsdwx05evtcc3e.js:96 Error: Could not fetch file with ID file_0000000028185230aa1870740fa3887b?shared_conversation_id=67e30f62-12f0-800f-b1d7-b3a9c61e99d6 from file service at iehdyv0kxtwne4ww.js:1:671 at async w (iehdyv0kxtwne4ww.js:1:600) at async queryFn (iehdyv0kxtwne4ww.js:1:458)Caused by: ClientRequestMismatchedAuthError: No access token when trying to use AuthHeader
You know the images themselves don’t get shared in links like that, right? (It even tells you so when you make the link.)
I created a shared link just now, was not presented with any such warning, and have the same problem with the image not showing up:
https://chatgpt.com/share/67e319dd-bd08-8013-8f9b-6f5140137f...
Interesting. I see this: https://imgur.com/a/QNWeEoZ
Aha! I see different messages in the Android app vs. web app.
In the web app I see:
Your name, custom instructions, and any messages you add after sharing stay private. Learn more
I see it normally on PC, maybe not working well on app?
The image shows for me.
Looks about what you'd get with FLUX and attaching some language model to enhance your prompt with eg more text
Flux 1.1 Pro has good prompt adherence, but some of these (admittingly cherry-picked) GPT-4o generated image demos are beyond what you would get with Flux without a lot of iteration, particularly the large paragraphs of text.
I'm excited to see what a Flux 2 can do if it can actually use a modern text encoder.
Structural editing and control nets are much more powerful than text prompting alone.
The image generators used by creatives will not be text-first.
"Dragon with brown leathery scales with an elephant texture and 10% reflectivity positioned three degrees under the mountain, which is approximately 250 meters taller than the next peak, ..." is not how you design.
Creative work is not 100% dice rolling in a crude and inadequate language. Encoding spatial and qualitative details is impossible. "A picture is worth a thousand words" is an understatement.
It can do in-context learning from images you upload. So you can just upload a depth map or mark up an image with the locations of edits you want and it should be able to handle that. I guess my point is that since its the same model that understands how to see images and how to generate them you aren't restricted from interacting with it via text only.
Prompt adherence and additional tricks such as ControlNet/ComfyUI pipelines are not mutually exclusive. Both are very important to get good image generation results.
It is when it's kept behind an API. You cannot use Controlnet/ComfyUI and especially not the best stuff like regional prompting with this model. You can't do it with Gemini, and that's by design because otherwise coomers are going to generate 999999 anime waifus like they do on Civit.ai.
That just elicits a cheeky refusal I'm afraid:
"""
That's a fun idea—but generating an image with 999,999 anime waifus in it isn't technically possible due to visual and processing limits. But we can get creative.
Want me to generate:
1. A massive crowd of anime waifus (like a big collage or crowd scene)?
2. A stylized representation of “999999 anime waifus” (maybe with a few in focus and the rest as silhouettes or a sea of colors)?
3. A single waifu with a visual reference to the number 999999 (like a title, emblem, or digital counter in the background)?
Let me know your vibe—epic, funny, serious, chaotic?
"""
Yeah, but then it no longer replaces human artists.
Controlnet has been the obvious future of image-generation for a while now.
> Yeah, but then it no longer replaces human artists.
Automation tools are always more powerful as a force multiplier for skilled users than a complete replacement. (Which is still a replacement on any given task scope, since it reduces the number of human labor hours — and, given any elapsed time constraints, human laborers — needed.)
We're not trying to replace human artists. We're trying to make them more efficient.
We might find that the entire "studio system" is a gross inefficiency and that individual artists and directors can self-publish like on Steam or YouTube.
Yeah, it’s just ComfyUI replacing Photoshop
Exactly. OpenAI isn't going to win image and video.
Sora is one of the worst video generators. The Chinese have really taken the lead in video with Kling, Hailuo, and the open source Wan and Hunyuan.
Wan with LoRAs will enable real creative work. Motion control, character consistency. There's no place for an OpenAI Sora type product other than as a cheap LLM add-in.
Flux doesn't do text that good
The real test for image generators is the image->text->image conversion. In other words it should be able to describe an image with words and then use the words to recreate the original image with a high accuracy. The text representation of the image doesn't have to be English. It can be a program, e.g. a shader, that draws the image. I believe in 5-10 years it will be possible to give this tool a picture of rainforest, tell it to write a shader that draws this forest, and tell it to add Avatar-style flying rocks. Instead of these silly benchmarks, we'll read headlines like "GenAI 5.1 creates a 3D animation of a photograph of the Niagara falls in 3 seconds, less than 4KB of code that runs at 60fps".
Why is that “the real test for image generators”? I mean, most image generators don't inherently include image->text functionality at all, so this seems more of a test of multimodal modals that include both t2i and i2t functionality, but even then, I don't think humans would generally pass this test well (unless the human doing the description test was explicitly told that the purpose was reproduction, but that's not the usual purpose of either human or image2text model descriptions.)
One very neat thing the interwebs are talking about is the ghiblification of family pictures. It’s actually pretty cute: https://x.com/grantslatton/status/1904631016356274286
In the coming days, people will Anime all sorts of images, for example historical images: https://x.com/keysmashbandit/status/1904764224636592188
Really liked the fact that the team shared all the shortcomings of the model in the post. Sometimes products just highlights the best results and isn't forthcoming in areas that need improvement. Kudos to the OpenAI team on that.
> ChatGPT’s new image generation in GPT‑4o rolls out starting today to Plus, Pro, Team, and Free users as the default image generator in ChatGPT, with access coming soon to Enterprise and Edu. For those who hold a special place in their hearts for DALL·E, it can still be accessed through a dedicated DALL·E GPT.
> Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks.
That's it folks. Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero.
Now you have a widely accessible meme generator with the name "ChatGPT".
The last task is for an open weight model that competes against this and is faster and all for free.
> Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero. Now you have a widely accessible meme generator with the name "ChatGPT".
ChatGPT has already had a that via Dall-E. If it didn't kill those startups when that happened this doesn't fundamentally change anything. Now its got a new image gen model, which — like Dall-E 3 when it came out — is competitive or ahead of other SotA base models using just text prompts, the simplest generation workflow, but both more expensive and less adaptable to more involved workflows than the tools anyone more than a casual user (whether using local tools or hosted services) is using. This is station-keeping for OpenAI, not a meaningful change in the landscape.
There are several examples here, especially in the videos that no existing image gen model can do and would require tedious workflows and/or training regimens to replicate, maybe.
It's not 'just' a new model ala Imagen 3. This is 'what if GPT could transform images nearly as well as text?' and that opens up a lot of possibilities. It's definitely a meaningful change.
Yep. The coherence and text quality is insanely good. Keen to play with it to find it's "mangled hands" style deficiencies, because of course they cherry picked the best examples.
Has the meaning of the words "available today" changed since I learned them?
They're doing "vibe writing", where you just use words to mean random things regardless of whether they actually mean that.
Just curious, where do you see the words "available today"?
It is now available. It is still the same day of this release.
It's the next day and I am still only getting DALL-E images...
For those who are still getting the old DALL-E images inside ChatGPT, you can access the new model on Sora: https://sora.com/explore/images
I wanted to use this to generate funny images of myself. Recently I was playing around with Gemini Image Generation to dress myself up as different things. Gemini Image Generation is surprisingly good, although the image quality quickly degrades as you add more changes. Nothing harmful, just silly things like dressing me up as a wizard or other typical RPG roles.
Trying out 4o image generation... It doesn't seem to support this use-case at all? I gave it an image of myself and asked to turn me into a wizard, and it generate something that doesn't look like me in the slightest. A second attempt, I asked to add a wizard hat and it just used python to add a triangle in the middle of my image. I looked at the examples and saw they had a direct image modification where they say "Give this cat a detective hat and a monocle", so I tried that with my own image "Give this human a detective hat and a monocle" and it just gave me this error:
> I wasn't able to generate the modified image because the request didn't follow our content policy. However, I can try another approach—either by applying a filter to stylize the image or guiding you on how to edit it using software like Photoshop or GIMP. Let me know what you'd like to do!
Overall, a very disappointing experience. As another point of comparison, Grok also added image generation capabilities and while the ability to edit existing images is a bit limited and janky, it still manages to overlay the requested transformation on top of the existing image.
It's not actually out for everyone yet. You can tell by the generation style. 4o generates top down (picture goes from mostly blurry to clear starting from the top).
To quote myself from a comment on sora:
Iterations are the missing link. With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
I‘m looking forward to try this out and see if I was right. Unfortunately it’s not yet available for me.
You can do that with Gemini's image model, flash 2.0 (image generation) exp.[1] It's not perfect but it does mostly maintain likeness between generations.
[1]https://aistudio.google.com/prompts/new_chat
Whisk I think is possibly the best at it. No idea what it uses under the hood though.
https://labs.google/fx/tools/whisk
DALLE-3 with ChatGPT has been able to approximate this for a while now by internally locking the seed down as you make adjustments. It's not perfect by any means but can be more convenient than manual inpainting.
Ditto Instruct Pix2Pix https://www.timothybrooks.com/instruct-pix2pix
Reading other comments in other threads on HN has left me with the impression that iterative improvement within a single chat is not a good idea.
For example, https://news.ycombinator.com/item?id=43388114
You‘re right. I’m actually doing this quite often when coding. Starting with a few iterative promts to get a general outline of what I want and when that’s ok, copy the outline to a new chat and flesh out the details. But that’s still iterative work, I’m just throwing away the intermediate results that I think confuse the LLM sometimes.
Like it's predecessor this has most of its utility within the first response, and after that the quality rapidly degrades.
I think it is too biased to use heuristics discovered in the first response to apply the same level of compute to subsequent requests.
It makes me kind of want to rewrite an interface that builds appropriate context and starts new chats for every request issued..
Still can't show me a clock that isn't 10:10.
Otherwise impressive.
Wow this works really well at editing existing photos.
I created an app to generate image prompts specifically for 4o. Geared towards business and marketing. Any feedback is welcome. https://imageprompts.app/
Is there a technical paper released about the model architecture? Great resolution points to diffusion style generation rather than just token based?
The fact that it nailed the awkward engineer high five (image 2) is pretty impressive as someone who only gives awkward high fives.
One of the fingers is the wrong way around… it’s a big improvement but it’s easy to find major problems, and these are the best of 8 images and presumably cherry picked.
> we see the photographer's reflection
Am I the only one immediately looking past the amazing text generation, the excellent direction following, the wonderful reflection, and screaming inside my head, "That's not how reflection works!"
I know it's super nitpicky when it's so obviously a leap forward on multiple other metrics, but still, that reflection just ain't right.
Could you explain more? I'm having trouble seeing anything weird in the reflection.
Edit: are we talking about the first or second image? I meant to say the image with only the woman seems normal. Image with the two people does seem a bit odd.
The first image, with the photographer holding the phone reflected in the white board.
Angle of incidence = angle of reflection. That means that the only way to see yourself in a reflective surface is by looking directly at it. Note this refers to looking at your eyes -- you can look down at a mirror to see your feet because your feet aren't where your eyes are.
You can google "mirror selfie" to see endless examples of this. Now look for one where the camera isn't pointing directly at the mirror.
From the way the white board is angled, it's clear the phone isn't facing it directly. And yet the reflection of the phone/photographer is near-center in frame. If you face a mirror and angle to the left the way the image is, your reflection won't be centered, it'll be off to the right, where your eyes can see it because you have a very wide field of view, but a phone would not.
Haha it's almost like early 2000's game logic, the logic is that it's the opposite of the viewpoint then reversed.
In games they did it by creating a duplicate then reversing it, I wonder if this is the same idea.
To avoid confusion, why not always use a general AI model upfront, then depending on the user's prompt, redirect it to a specific model?
The models are noticeably different — for example, o1 and o3 have reasoning, and some users (eg. me) want to tell the model when to use reasoning, and when not.
As to why they don't automatically detect when reasoning could be appropriate and then switch to o3, I don't know, but I'd assume it's about cost (and for most users the output quality is negligible). 4o can do everything, it's just not great at "logic".
They still all have a somewhat cold and sterile look to them. Probably that 1% the next decade will be spent working out.
Edit: Please ignore. They hadn't rolled the new model out to my account yet. The announcement blog post is a bit misleading saying you can try it today.
--
Comparison with Leonardo.Ai.
ChatGPT: https://chatgpt.com/share/67e2fb21-a06c-8008-b297-07681dddee...
ChatGPT again (direct one shot): https://chatgpt.com/share/67e2fc44-ecc8-8008-a40f-e1368d306e...
ChatGPT again (using word "photorealistic instead of "photo"): https://chatgpt.com/share/67e2fce4-369c-8008-b69e-c2cbe0dd61...
Leonardo.Ai Phoenix 1.0 model: https://cdn.leonardo.ai/users/1f263899-3b36-4336-b2a5-d8bc25...
Is the ”2D animation style" part you put at the beginning and then changed an attempt to see how well the AI responds to gas lighting?
My bad, I was trying the conversational aspect, but that's not an apples to apples conparison. I have put a direct one shot example in the original post as well.
I'm my test a few months ago, I found that just starting a new prompt would not clear GPT's memory about what I had asked for in previous conversations. You might be stuck with 2D animation style for a while. :)
The ChatGPT examples don't look like the new Image Gen model yet. The text on the dog collar isn't very good.
Apparently it rolls out today to Plus (which I have). I followed the "Try in ChatGPT" link at the top of the post
On mine I tried it "natively" and in DALL-E mode and the results were basically identical, I think they haven't actually rolled it out to everyone yet.
It's rolling out to everyone starting today but i'm not sure if everyone has it yet. Does it generate top down for you (picture goes from mostly blurry to clear starting from the top) like in their presentation ?
No it didn't generate like that. Thanks for clarifying. I have updated my original post.
What did the prompt look like for Leonard.ai?
I'm curious if you said 2d animation style for both or just for chatgpt.
Edit: Your second version of chatgpt doesn't say photorealistic. Can you share the Leonard.ai prompt?
Added photorealistic, which made it worse.
Leonardo prompt: A golden cocker spaniel with floppy ears and a collar that says "Sunny" on it
Model: Phoenix 1.0 Style: Pro color photography
Saying "pro color photography" to ChatGPT doesn't get it any better either unfortunately: https://chatgpt.com/share/67e2fd91-8d24-8008-b144-92c832ed0b...
Yeah, its just not good enough. The big labs are way behind what the image focused labs are putting out. Flux and Midjourney are running laps around these guys
Flux most definitely .
Midjourney hasn't been SOTA for nearly a year now. It struggles to follow even marginally complex prompts from an adherence perspective.
In all fairness you _did_ say 2D animation style
True. I had that conversation before deciding to compare to others. I have updated the post with other fairer examples. Nowhere near Leonardo Phoenix or Flux for this simple image at least.
It does extremely well at creating images of copyrighted characters. Dall-e couldn't generate images of Miffy, this one can. Same for "Kikker en vriendjes" - a dutch children's book. There seems to be copyright protection at all?
For the first time ever, it feels like it listens and actually tries to follow what I say. I managed to actually get a good photo of a dog in the beach with shoes, from a side angle, by consistently prompting it and making small changes from one image to another till I got my intended effect
The pre-recorded short videos are a much better form of presentation than live-streamed announcements!
It’s pretty good, the interesting thing is when it fails it seems to often be able to reason about what went wrong. So when we get CoT scaffolding for this it’ll be incredibly competent.
So did they deprecate the ability to use DALL-E 3 to generate images? I asked the legacy ChatGPT 4 model to generate an image and it used the new 4o style image generator.
I think you can still use it on Bing.
Just curious if it works for creating a comic strip? I.e. will it maintain the consistency of the characters? I watched a video somewhere they demo'ed it creating comic panels, but I want to create the panels one by one.
I believe so! Since it is good at consistency and can be feed reference images, you can generate character references and deed those, along with the previous panels, to the model working one panel at a time.
> I wasn’t able to generate the image because the combination of abstract elements and stylistic blending [...] may have triggered content filters related to ambiguous or intense visuals.
nah. i pass and stick with midjourney.
Interesting that in the second image the text on the whiteboard changes (top left)
It seems this is because the string "autoregressive prior" should appear on the right hand side as well, but in the second image it's hidden from view, and this has confused it to place it on the left hand side instead?
It also misses the arrow between "[diffusion]" and "pixels" in the first image.
So what's the lore with why this took over a _year_ to launch from the first announcement. It's fairly clear that their hand was forced by Google quietly releasing this exact feature a few weeks back though.
The easy infographic generation scares me on the implications for society.
> All generated images come with C2PA metadata
How easy is this to remove? Is it just like exif data that can be easily stripped out, or is it baked in more permanently somehow
Yeah it's just metadata.
I would love to see advancement in the pixel art space, specifying 64x64 pixels and attempting to make game-ready pixel art and even animations, or even taking a reference image and creating a 64x64 version
I tried a few of the prompts and the results I see are far worse than the examples provided. Seems like there will be some room for artists yet in this brave new world.
It hasn't actually rolled out to everyone yet in chat. You'll have to try it on sora to be sure.
I am blown away by the hyperrealistic renderings, especially of humans. It is getting to the point where I can no longer distinguish ai ones.
Can you specify the output dimensions?
EDIT: Seems not, "The smallest image size I can generate is 1024x1024. Would you like me to proceed with that, or would you like a different approach?"
I suspect you can prompt aspect ratios.
fwiw i've added a summary of this discussion here (https://extraakt.com/extraakts/gpt-4o-image-generation-discu...) to keep track of the main points
Can anyone tell me when this will be available in the API? Or is it already available?
I couldn't find anything on the pricing page.
The page says in the following week, which is disappointing. It’s likely we will see openAI favor their own product first more and more, an inversion of their more developer oriented start.
Can someone explain what is going with 4o and Anime with Ghibli Style? Why is it suddenly all over x/twitter?
it isn't Ghibli style in particular, just any style as 4o image gen is much better at maintaining a particular art style, the ghibli ones just stand out due to one tweet that blew up and people followed along
That makes sense. Although previous model's image gen wont bad as well with Ghibli style. I guess "maintaining a particular art style" is the point here. Thank You.
Someone posted a mildly amusing tweet that blew up and went really viral. Now lots of other people are copying the idea.
This dynamic happens on Twitter every day. Tomorrow it'll be a different craze.
It produces amazing results for me! But the wow effect would have been greater if they had released it a few months ago.
My version of the full glass of wine challenge is "clock face with 13 hour divisions". Nothing I've tried has been able to do it yet.
The garbled text on these things always just makes them basically useless, especially it often text without being told to like previous models.
you are being served the old model
A real improvement, but it still drew me a door with a handle where the should be one and an extra knob on the side where hinges are.
saw this thread on X. here are some incredible use cases of 4o image generation: https://x.com/0xmetaschool/status/1904804251148443873
So Google released Gemini 2.5 and one hour later OpenAI comes with this. It’s almost childish at this point.
its other way around, openai was gonna announce it so Google upstaged
Tried it, the "compise armporressed" and "Pros: made bord reqotons" didn't impress me in the slightest.
Are you sure you were even using the model from the post?
Pressed the "Try in ChatGPT", pasted the first prompt, became thoroughly unimpressed.
Update: Seems like the update didn't roll to my account when I tested. I can see it behaves differently now and it's as promised. It's very good.
Thankfully. It was outrageous how inferior DALL-E 3 was to any other image generation system.
The best thing about this is how the still of the livestream at the bottom is the most uncanny valley image.
Could they have switched to *both* image and text generation via diffusion, without tokens?
Still can't generate an analog clock face with a given time.
Can it draw the notorious glass of wine filled to the brim yet?
It can indeed
It is amazing how far text generation in images has come over the past 1-2 years
The reflections in the whiteboard are all off. Do they address this?
It bothers me to see links to content that requires a login. I don't expect openai or anyone else to give their services away for free. But I feel like "news" posts that require one to setup an account with a vendor are bad faith.
If the subject matter is paywalled, I feel that the post should include some explanation of what is newsworthy behind the link.
The linked post is not paywalled, did you click something else?
Thank you for the accurate correction. My whining was a bit unmerited. The link goes to a page that largely provides exactly what I asked for. It just starts out with an invitation to try it yourself. That invitation leads you to an app that requires a login. It was unfair of me to be triggered by that invitation.
After that invitation there are several examples that boil down to: "Hey look. Our AI can generate deep fakes." Impressive examples.
Not a criticism, but It stands out how all the researchers or employees in these videos are non native English speakers (i.e. not American). Nothing wrong with that, on the contrary, it just seems odd that the only American is Altman. Same thing with the last videos from Zuck, if I recall correctly. Especially in this Trump era of MAGA.
someone try to make an open source version of it. I need to know the inner workings of this cool thing.
This technology should have never existed. Thank you OpenAI for being a contributing factor to destroying politics in future.
And I hope that people who worked on this know this. They are pure evil.
Attention to every detail, even the awkward nerd high-five.
question i have is when do we get an opensource version of this form of image generation will we see diffusion models moving to this space
Still seems to have problems with transparent backgrounds.
That's expected with any image generating models because they aren't trained with an alpha channel.
It's more pragmatic to pipeline the results to a background removal model.
EDIT: It appears GPT-4o is different as there is a video demo dedicated to transparancy.
There's an entire video in the post dedicated to how well it does transparency: https://openai.com/index/introducing-4o-image-generation/?vi...
I suspect we're getting a flood of comment from people who are using Dall-E.
The video was helpful. I started with the prompt "Generate a transparent image. "
And that created the isolated image on a transparent background.
Thank-you.
Huh, I missed that. I'm skeptical of the results in practice, though.
This one however explicitly advertises good transparency support.
There's a mod for stable diffusion webui forge/automatic1111/ComfyUI which enables this for all diffusion models (except these closed source ones).
SD extensions like rembg are post-processing effects - with their video transparency demo I'd be curious if 4o actually did training with an alpha channel.
> we’ve built our most advanced image generator yet into GPT‑4o. The result—image generation that is not only beautiful, but useful.
Sorry, but how are these useful? None of the examples demonstrate any use beyond being cool to look at.
The article vaguely mentions 'providing inspiration' as possible definition of 'useful'. I suppose.
I wish AI companies would release new things once a year, like at CES or how Apple does it. This constant stream of releases and announcements feels like it's just for attention.
Apple held three big keynotes in 2024 plus multiple product announcements via press releases:
May 7, 2024 - The “Let Loose” event, focusing on new iPads, including the iPad Pro with the M4 chip and the iPad Air with the M2 chip, along with the Apple Pencil Pro.
June 10, 2024 - The Worldwide Developers Conference (WWDC) keynote, where Apple introduced iOS 18, macOS Sequoia, and other software updates, including Apple Intelligence.
September 9, 2024 - The “It’s Glowtime” event, where Apple unveiled the iPhone 16 series, Apple Watch Series 10, and AirPods 4.
Via Press releases: MacBook Air with M3 on March 4, the iPad mini on October 15, and various M4-series Macs (MacBook Pro, iMac, and Mac mini) in late October.
I really hadn't noticed all of those! I'm mostly intersted in Macs, so I probably subconsciusly filter out the other announcements. I guess I haven't developed that level of 'ignorance' towards AI yet."
Attention is all they need.
Maybe they need to Reform their strategy.
The periodic table poster under "High binding problems" is billed as evidence of model limitations, but I wonder if it just suggests that 4o is a fan of "Look Around You".
Why won't they add benchmarks against o1?
literally spent all day playing with this until I ran out of image gen capacity a lil while ago.
so much fun.
Still failing the wine glass test,
https://imgur.com/a/aS8e0UY
Worked for me with a clarification. Looks pretty great, actually. https://imgur.com/a/V8eQWi6
It was easy to fix though, I just said "all the way full" and it got it on the next try. Which makes sense, a full pour is actually "overfull" given normal standards.
that's probably dalle
This to me is the clearest example of why there is no actual physical simulation understanding of the world in these models
that "best of 8" is doing a lot of work. i put in the same input and the image is awful.
Everyone should try running their prompts and see how over hyped this is. The results I get are terrible comparatively.
I don't think the new model is rolled out to all users yet.
This is literally the biggest leap in image generation in a long time.
drug repurposing for sarcopenia
drug repositiong for sarcopenia
What is the api price?
This is incredibly impressive, but it's still theft of assets.
Where are the lawyers where you need them?
well it failed on me, after many tries:
...Once the wait time is up, I can generate the corrected version with exactly eight characters: five mice, one elephant, one polar bear, and one giraffe in a green turtleneck. Let me know if you'd like me to try again later!
[dead]
[flagged]
[flagged]
[flagged]
Whelp. That's terrifying.
They say it must be an important OpenAI announcement when they bring out the twink.
LPT: while the benchmarks don't show it, chatGPT4>4o. It amazes me people use 4o at all. But hey its the brand name and its free.
ofc 4.5 is best, but its slow and I am afraid I'm going to hit limits.
OpenAI themselves discourages using GPT-4 outside of legacy applications, in favor of GPT-4o instead (they are shutting down the large output gpt-4-32k variants in a few months). GPT-4 is also an order of magnitude more expensive/slower.
I think both of these points are what sow doubt in some people in the first place because both could be true if GPT-4 was just less profitable to run, not if it was worse in quality. Of course it is actually worse in quality than 4o by any reasonable metric... but I guess not everyone sees it that way.
Similar to regular LLM plagarism, it's pretty obvious that visual artefacts like the loadout screen for the rpg cat (video game heading) which is inspired by diablo, aren't unique at all and just the result of other peoples efforts and livelihoods.
Garbage compared to Midjourney. I don't even know why you'd market this. It's takes a minute or more and the results are what I'd say Midjourney looked like 1.5 years ago.
Prompt adherence is far, far ahead of midjourney. FAR.
I don't have an hour to work an image. It's slow as hell.
Did they time it with the Gemini 2.5 launch? https://news.ycombinator.com/item?id=43473489
Was it public information when Google was going to launch their new models? Interesting timing.
"Interesting timing" It's like the 4th time by my counting they've done this
OpenAI was started with the express goal of undermining Google's potential lead in AI. The fact that they time launches to Google launches to me indicates they still see this as a meaningful risk. And with this launch in particular I find their fears more well-founded than ever.