M4v3R 3 hours ago

I think they've buried the lede with their image editing capabilities, which seem to be very good! OpenAI's model will change the whole image while editing messing up details in unrelated areas. This seems to perfectly preserve parts of the image unrelated to your query and selectively apply the edits, which is very impressive! The only downside is the output resolution (the resulting image is 1184px wide even though the input image was much larger).

For a quick test I've uploaded a photo of my home office and asked the following prompt: "Retouch this photo to fix the gray panels at the bottom that are slightly ripped, make them look brand new"

Input image (rescaled): https://i.imgur.com/t0WCKAu.jpeg

Output image: https://i.imgur.com/xb99lmC.png

I think it did a fantastic job. The output image quality is ever so slightly worse than the original but that's something they'll improve with time I'm sure.

  • sync 2 hours ago

    FYI, your Input and output URLs are the same (I thought I was crazy for a sec trying to spot the differences)

    • M4v3R 2 hours ago

      whoops, sorry about that, fixed

  • bakkoting an hour ago

    Kontext is probably better at this specific task, if that's what Mistral is using. Certainly faster and cheaper. But:

    OpenAI just yesterday added the ability to do higher fidelity image edits with their model [1], though I'm not sure if the functionality is only in the API or if their chat UI will make use of this feature too. Same prompt and input image: [2]

    [1] https://x.com/OpenAIDevs/status/1945538534884135132

    [2] https://i.imgur.com/w5Q0UQm.png

  • joshcartme 38 minutes ago

    Wow, that really is amazing!

    I couldn't help but notice that you can still see the shadows of the rips in the fixed version. I wonder how hard it would be to get those fixed as well.

  • pablonaj 2 hours ago

    They are using Flux Kontext from Black Forest Labs, fantastic model.

    • koakuma-chan an hour ago

      So Mistral is just hosting a Flux model?

      • Squarex an hour ago

        Yes, but it's great that they are both made by european companies.

tdhz77 4 hours ago

I’m struggling with MRF. Model Release Fatigue. It’s a syndrome of constantly context switching new large models. Claude 4, gpt, llama, Gemini 2.5, pro-mini, mistrial.

I fire off the ide switch the model and think oh great this is better. I switch to something that worked before and man, this sucks now.

Context switching llm, Model Release Fatigue

  • reilly3000 3 hours ago

    Not to invalidate your feelings of fatigue, but I’m sure glad that there are a lot of choices in the marketplace, and that they are innovating at a decent clip. If you’re committed to always be using the best of all options you’re in for a wild ride, but it beats stagnation and monopoly.

    • ivape 3 hours ago

      We’re also headed into a world where there will be very few open weight models coming out (Meta going closed source, not releasing Behemoth). This era of constant model releases may be over before it even started. Gratitude definitely needs to be echoed.

      • randomNumber7 2 hours ago

        I don't agree with that. I didn't expect we ever get open weight models close to the current state of the art, yet china delivered some real burners.

      • echelon 2 hours ago

        If China stays open, then the rest of the world will build on open. I'm frankly shocked that a domestic player isn't doing this.

        Fine tuning will work for niche business use cases better than promises of AGI.

        • kakapo5672 an hour ago

          It's curious that China is carrying the open banner nowadays. Why is that?

          One theory is that they believe the real endpoint value will be embodied AIs (i.e. robots), where they think they'll hold a long-term competitive advantage. The models themselves will become commoditized, under the pressure of the open-source models.

        • seszett 2 hours ago

          > If China stays open, then the rest of the world will build on open

          I was listening to a Taiwanese news channel earlier today and although I wasn't paying much attention, I remember hearing about how Chinese AIs are biased towards Chinese political ideas and that some programme to create a more Taiwanese-aligned AI was being put in place.

          I wouldn't be surprised if just for this reason, at least a few different open models kept being released, because even if they don't directly bring in money, several actors care more about spreading or defending their ideas and IAs are perfect for that.

  • bee_rider 3 hours ago

    A major reason I haven’t really tried any of these things (despite thinking they are vaguely neat). I think I will wait until… 2026, second half, most likely. At least I’ll check if we have local models and hardware that can run them nicely, by then.

    Hats off to the folks who have decided to deal with the nascent versions though.

    • Uehreka an hour ago

      When ChatGPT, then Llama, then Alpaca came out in rapid succession, I decided to hold off a year before diving in. This was definitely the right choice at the time, it’s becoming less-the-right-choice all the time.

      In particular it’s important to get past the whole need-to-self-host thing. Like, I used to be holding out for when this stuff would plateau, but that keeps not happening, and the things we’re starting to be able to build in 2025 now that we have fairly capable models like Claude 4 are super exciting.

      If you just want locally runnable commodity “boring technology that just works” stuff, sure, cool, keep waiting. If you’re interested in hacking on interesting new technology (glances at the title of the site) now is an excellent time to do so.

    • Nezteb 3 hours ago

      Depending on the definition of "nicely", FWIW I currently run Ollama sever [1] + Qwen Coder models [2] with decent success compared to the big hosted models. Granted, I don't utilize most "agentic" features and still mostly use chat-based interactions.

      The server is basically just my Windows gaming PC, and the client is my editor on a macOS laptop.

      Most of this effort is so that I can prepare for the arrival of that mythical second half of 2026!

      [1] https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...

      [2] https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22...

      • QRY 2 hours ago

        Thanks for sharing your setup! I'm also very interested in running AI locally. In which contexts are you experiencing decent success? eg debugging, boilerplate, or some other task?

        • bogzz 2 hours ago

          I'm running qwen via ollama on my M4 Max 14 inch with the OpenWebUI interface, it's silly easy to set up.

          Not useful though, I just like the idea of having so much compressed knowledge on my machine in just 20gb. In fact I disabled all Siri features cause they're dogshit.

      • Kostic 2 hours ago

        Agentic editing is really nice. If on VSCode, Cline works well with Ollama.

    • randomNumber7 2 hours ago

      It is completely unreasonable to buy the hardware to run a local model and only use it 1% of the time. It will be unreasonable in 2026 and probably very long after that.

      Maybe s.th. like a collective that buys the gpu's together and then uses them without leaking data can work.

    • nosianu 3 hours ago

      I have a modified tiered approach, that I adopted without consciously thinking hard about it.

      I use AI mostly for problems on my fringes. Things like manipulating some Excel table somebody sent me with invoice data from one of our suppliers and some moderately complex question that they (pure business) don't know how to handle, where simple formulas would not be sufficient and I would have to start learning Power Query. I can tell the AI exactly what I want in human language and don't have to learn a system that I only use because people here use it to fill holes not yet served by "real" software (databases, automated EDI data exchange, and code that automates the business processes). It works great, and it saves me hours on fringe tasks that people outsource to me, but that I too don't really want to deal with too much.

      For example, I also don't check various vendors and models against one another. I still stick to whatever the default is from the first vendor I signed up with, and so far it worked well enough. If I were to spend time checking vendors and models, the knowledge would be outdated far too quickly for my taste.

      On the other hand, I don't use it for my core tasks yet. Too much movement in this space, I would have to invest many hours in how to integrate this new stuff when the "old" software approach is more than sufficient, still more reliable, and vastly more economical (once implemented).

      Same for coding. I ask AI on the fringes where I don't know enough, but in the core that I'm sufficiently proficient with I wait for a more stable AI world.

      I don't solve complex sciency problems, I move business data around. Many suppliers, many customers, different countries, various EDI formats, everybody has slightly different data and naming and procedures. For example, I have to deal with one vendor wanting some share of pre-payment early in the year, which I have to apply to thousands of invoices over the year and track when we have to pay a number of hundreds or thousands of invoices all with different payment conditions and timings. If I were to ask the AI I would have to be so super specific I may as well write the code.

      But I love AI on the not-yet-automated edges. I'm starting to show others how they can ask some AI, and many are surprised how easy it is - when you have thee right task and know exactly hat you have and what you want. My last colleague-convert was someone already past retirement age (still working on the business side). I think this is a good time to gradually teach regular employees some small use cases to get them interested, rather than some big top-down approach that mostly creates more work and many people then rightly question what the point is.

      About politically-touched questions like whether I should rather use an EU-made AI like the one this topic is about, or use one from the already much of the software-world dominating US vendor, I don't care at this point, because I'm not yet creating any significant dependencies. I am glad to see it happening though (as an EU country citizen).

      • bee_rider 2 hours ago

        > About politically-touched questions like whether I should rather use an EU-made AI like the one this topic is about, or use one from the already much of the software-world dominating US vendor, I don't care at this point, because I'm not yet creating any significant dependencies. I am glad to see it happening though (as an EU country citizen).

        Another nice thing about waiting a bit—one can see how much (if any) the EU models get from paying the “do things somewhat ethically” price. I suspect it won’t be much of a penalty.

  • sva_ 2 hours ago

    All the competition is great to me. I'm using premium models all the time and barely spent a few euro on them, as there's always some offers that are almost free if you look around.

  • emilsedgh 4 hours ago

    Why do you even follow? Just stick to one that works well for you?

    • barbazoo 3 hours ago

      Totally, I feel like though you do have to pay some attention for example in the context I'm working on, for the last while, Gemini was our gold standard for code generation whereas today, Claude subjectively produces the better results. Sure you can stick to what worked abut then you're missing the opportunity to be more productive or less busy, whichever one you choose.

      • exe34 2 hours ago

        I remember the days when I was looking for the perfect note-taking system/setup - I never achieved anything with it, I was too busy figuring out the best way to take notes.

        • barbazoo 2 hours ago

          Once we find the best way though...

    • tartoran 2 hours ago

      FOMO may be one of the reasons amongst others.

  • zamadatix 2 hours ago

    Much like with new computer hardware, announcements are constant but they rarely entice me to drop one thing and switch to another. If an average user picked a top 3 option last year and stuck with them through now you didn't really miss out on all that much, even if your particular choice wasn't the absolute latest and greatest the entire time.

    • wahnfrieden 2 hours ago

      Sticking with one year old models would mean no o3 which is a huge loss for dev work

  • didibear 4 hours ago

    I believe perfs of previous versions are worse because providers reallocate resources to newer versions. Also because of training data cut-off to previous years. This is what happened between claude sonnet 3.5 and 3.7.

    Personally I only use Claude/Anthropic and ignore other providers because I understand it the more. It's smart enough, I rarely need the latest greatest.

  • vouaobrasil 2 hours ago

    An alternative: don't use LLMs. Focus on the enjoyment of coding, not on becoming more efficient. Because the lion's share of the gains from increased efficiency are mainly going to the CEOs.

    • freedomben an hour ago

      This might be good short term advice, but in the medium and long term I think devs who don't use any AI will start to be much slower at delivery than devs who do. I'm alreay seeing it IRL (and I'm not a fan of AI coding, so this sucks for me)

      • ivape an hour ago

        Slower in initial delivery maybe, but the maintenance and debugging of production applications requires intimate knowledge of the code base usually. The amount of code AI writes will require AI itself to manage it since no human would inundate themselves with that much code. Will it be faster even so? We simply won’t know because those vibe coded apps have just entered production. The horror stories can’t be written yet because the horror is ongoing.

        I’m big on AI, but vibe coding is such a fuck around and find out situation.

    • wahnfrieden an hour ago

      This is HN, we are not all wage workers here

      For wage workers, not learning the latest productivity tools will result in job loss. By the time it is expected of your role, if you have not learned already, you won't be given the leniency to catch up on company time. There is no impactful resistance to this through individual protest, only by organizing your peers in industry

  • mrcwinn 3 hours ago

    What a luxury!

    One way to avoid this: stick with one LLM and bet on the company behind it (meaning, over time, they’ll always have the best offering). I’ve bet on OpenAI. Others can make different conclusions.

    • tenuousemphasis 3 hours ago

      When the medicine is worse than the disease...

  • criemen 3 hours ago

    I totally get it. Due to my work, I mostly keep up with new model releases, but the pace is not sustainable for individuals, or the industry. I'm hoping that model releases (and the entire development speed of the field) will slow down over time, as LLMs mature and most low-hanging fruits in model training have been picked. Are we there yet? Surely not.

  • sunaookami 3 hours ago

    You only need Claude and GPT. Everything else is not worth your time.

Aissen 3 hours ago

The Voxtral release seemed interesting, because it brought back competitive open source audio transcription. I wonder if it was necessary to have an LLM backbone (vs a pure-function model) though, but the approach is interesting.

  • nomad_horse 3 hours ago

    > brought back competitive open source audio transcription

    Bear in mind that there are a lot of very strong _open_ STT models that Mistral's press-release didn't bother to compare to, making impression they are the best new open thing since Whisper. Here is an open benchmark: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard . The strongest model Mistral compared to is Scribe, ranked 10 here.

    This benchmark is for English, but many of those models are multilingual (eg https://huggingface.co/nvidia/canary-1b-flash )

    • espadrine 2 hours ago

      The best model there is 2.5B parameters. I can believe that a model 10x bigger is somewhat better.

      One element of comparison is OpenAI Whisper v3, which achieves 7.44 WER on the ASR leaderboard, and shows up as ~8.3 WER on FLEURS in the Voxtral announcement[0]. If FLEURS has +1 WER on average compared to ASR, it would imply that Voxtral does have a lead on ASR.

      [0]: https://mistral.ai/news/voxtral

      • nomad_horse 2 hours ago

        There are larger models in there, a 8B and a 6B. By this logic they should be above 2B model, yet we don't see this. That's why we have open standard benchmarks, to measure this directly - not hypothesize by the models' sizes or do some cross-dataset arithmetics.

        Also note that, Voxtral's capacity is not necessarily all devoted to speech, since it "Retains the text understanding capabilities of its language model backbone"

behnamoh 3 hours ago

At this point, the entire AI industry seems to just copy OpenAI for the most part. I cannot help but notice that we have the same services just offered by different companies. The amount of innovation in this build is not that high actually.

  • klntsky 3 hours ago

    They are not the same service. There is A LOT of difference between offerings if you actually use the models for daily tasks like coding.

  • mirekrusin 2 hours ago

    Whole world is now building stuff on top of `f(input: string): string` function - they're going to be similar.

  • cubefox 3 hours ago

    > At this point, the entire AI industry seems to just copy OpenAI for the most part

    Well, OpenAI copied the Deep Research feature from Google. They even used the same name (as does Mistral).

    • cowpig an hour ago

      Weird that you're being downvoted for stating a fact.

      All of the major labs are innovating and copying one another.

      Anthropic has all of the other labs trying to come up with an "agentic" protocol of their own. They also seem to be way ahead on interpretability research

      Deepseek came up with multi-headed latent attention, and publishing an open-source model that's huge and SOTA.

      Deepmind's way ahead on world models

      ...

  • scotty79 3 hours ago

    That's what a healthy competition in the free market looks like. Things like Apple that "stay innovative" for decades are aberration caused by monopolistic gatekeeping.

    • behnamoh 3 hours ago

      > Things like Apple are aberration.

      This used to be a good example of innovation that is hard to copy. But it doesn't apply anymore for two reasons:

      1. Apple went from being an agile, pro-developers, creative company to an Oracle-style old-board milking-cow company; not much innovation is happening at Apple anymore.

      2. To their surprise, much of what they call "innovative" is actually pretty easy to replicate on other platforms. It took 4 hours for Flutter folks to re-create Liquid Glass...

      • overfeed an hour ago

        > This used to be a good example of innovation that is hard to copy.

        Steve Jobs did say they "patented the hell out of [the iPhone]" and went about saber-rattling, then came the patent wars which proved that Apple also rely on innovation by others, and that patent workarounds would still result in competitive products, and things calmed down afterwards.

    • croes 3 hours ago

      They often copied others but because Apple is more popular they got the fame for „their“ innovation.

  • croes 3 hours ago

    It’s basically everywhere the same technology. Maybe a difference in training data and computing power.

jddj 4 hours ago

The examples aren't great. The personal planning one for example answers the prompt better without deep research than with (with answers only the Visas point)

bangaladore 2 hours ago

If you haven't tried OpenAI's deep research feature, you are missing out. I'm not sure of any good alternatives, I've tried Google's, and I'm not impressed.

There is a lot of value to say engineers doing tradeoff studies using these tools as a huge head start.

  • crmd 2 hours ago

    It’s been invaluable to me for market research related to starting a business. It’s like having a bright early career new hire research assistant/product manager “on staff” to collaborate with.

  • ripley12 an hour ago

    Anthropic's Research is pretty good; I'd say on par with OpenAI.

    Agreed about Google, accuracy is a little better on the paid version but the reports are still frustrating to read through. They're incredibly verbose, like an undergrad padding a report to get to a certain word count.

    • the_duke an hour ago

      That's Gemini Pro now in general. The initial preview was pretty good, but the newer iterations are incredibly verbose.

      "Be terse" is a mandatory part of the prompt now.

      Either it's to increase token counts so they can charge more, or to show better usage growth metrics internally or for shareholders, or just some odd effects of fine tuning / system prompt ... who knows.

  • freedomben an hour ago

    I've gotten pretty different results from OpenAI and Gemini, though it's hard to say one is better/worse than the other. Just different

  • ankit219 2 hours ago

    Try one from Kimi 2 as well. I was surprised how good it turned out to be.

  • criemen an hour ago

    Perplexities isn't bad? Although I lack the OpenAI subscription to compare.

chickenzzzzu an hour ago

Back in my day, La Chat was a rapper, not a wrapper.

  • dust42 an hour ago

    Actually Le Chat is french for the (male) cat. Also 'Le Chat' is a well known laundry detergent (of german origin - Henkel company). The headline 'Le Chat takes a deep dive' means 'the cat takes a deep dive'. As there is a cooperation with (german) Black Forest Labs, this is all pretty funny for a french speaking person. 'La Chatte' is the female cat. And also colloquial for female private parts.

BrunoWinck 2 hours ago

I needed that. Now I have it :)

htrp 4 hours ago

is anyone doing online reviews of model performance ? (I know artificial analysis does some work on infrastructure and has an intelligence index)

  • reckless 3 hours ago

    The aggregate picture only tells you so much.

    Sites like simonwillison.net/2025/jul/ and channels like https://www.youtube.com/@aiexplained-official also cover new model releases pretty quickly for some "out of the box thinking/reasoning" evaluations.

    For me and my usage I can really only tell if I start using the new model for tasks I actually use them for.

    My personal benchmark andrew.ginns.uk/merbench has full code and data on GitHub if you want a staring point!