I get so confused on this. I play around, test, and mess with LLMs all the time and they are miraculous. Just amazing, doing things we dreamed about for decades. I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out. It talks to me like a person. It generates really cool images. It helps me write code. And just tons of other stuff that astounds me.
And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly? This is the most amazing technology I've experienced as a 50+ year old nerd that has been sitting deep in tech for basically my whole life. This is the stuff of science fiction, and while there totally are limitations, the speed at which it is progressing is insane. And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
The technology is not just less than superintelligence, for many applications it is less than prior forms of intelligence like traditional search and Stack Exchange, which were easily accessible 3 years ago and are in the process of being displaced by LLMs. I find that outcome unimpressive.
And this Tweeter's complaints do not sound like a demand for superintelligence. They sound like a demand for something far more basic than the hype has been promising for years now.
- "They continue to fabricate links, references, and quotes, like they did from day one."
- "I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error." (Why have these companies not manually engineered out a problem like this by now? Just do a check to make sure links are real. That's pretty unimpressive to me.)
- "They reference a scientific publication, I look it up, it doesn't exist."
- "I have tried Gemini, and actually it was even worse in that it frequently refuses to even search for a source and instead gives me instructions for how to do it myself."
- "I also use them for quick estimates for orders of magnitude and they get them wrong all the time. "
- "Yesterday I uploaded a paper to GPT to ask it to write a summary and it told me the paper is from 2023, when the header of the PDF clearly says it's from 2025. "
A municipality in Norway used LLM to create a report about the school structure in the municipality (how many schools are there, how many should there be, where should they be, how big should they be, pros and cons of different size schools and classes etc etc). Turns out the LLM invented scientific papers to use as references and the whole report is complete and utter garbage based on hallucinations.
I agree. I use LLMs heavily for gruntwork development tasks (porting shell scripts to Ansible is an example of something I just applied them to). For these purposes, it works well. LLMs excel in situations where you need repetitive, simple adjustments on a large scale. IE: swap every postgres insert query, with the corresponding mysql insert query.
A lot of the "LLMs are worthless" talk I see tends to follow this pattern:
1. Someone gets an idea, like feeding papers into an LLM, and asks it to do something beyond its scope and proper use-case.
2. The LLM, predictably, fails.
3. Users declare not that they misused the tool, but that the tool itself is fundamentally corrupted.
It in my mind is no different to the steam roller being invented, and people remaking how well it flattens asphalt. Then a vocal group trying to use this flattening device to iron clothing in bulk, and declaring steamrollers useless when it fails at this task.
>swap every postgres insert query, with the corresponding mysql insert query.
If the data and relationships in those insert queries matter, at some unknown future date you may find yourself cursing your choice to use an LLM for this task. On the other hand you might not ever find out and just experience a faint sense of unease as to why your customers have quietly dropped your product.
I’ve already seen people completely mess things up. It’s hilarious. Someone who thinks they’re in “founder mode” and a “software engineer” because chatgpt or their cursor vomited out 800 lines of python code.
The vileness of hoping people suffer aside, anyone who doesn’t have adequate testing in place is going to fail regardless of whether bad code is written by LLMs or Real True Super Developers.
What vileness? These are people who are gleefully sidestepping things they don't understand and putting tech debt onto others.
I'd say maybe up to 5-10 years ago, there was an attitude of learning something to gain mastery of it.
Today, it seems like people want to skip levels which eventually leads to catastrophic failure. Might as well accelerate it so we can all collectively snap out of it.
The mentality you're replying to confuses me. Yes, people can mess things up pretty badly with AI. But I genuinely don't understand why the assumption that anyone using AI is also not doing basic testing, or code review.
Right, which is why you go back and validate code. I'm not sure why the automatic assumption that implementing AI in a workflow means you blindly accept the outputs. You run the tool, you validate the output, and you correct the output. This has been the process with every new engineering tool. I'm not sure why people assume first that AI is different, and second that people who use it are all operating like the lowest common denominator AI slop-shop.
In this analogy are all the steamroller manufacturers loudly proclaiming how well it 10x the process of bulk ironing clothes?
And is a credulous executive class en masse buying into that steam roller industry marketing and the demos of a cadre of influencer vibe ironers who’ve never had to think about the longer term impacts of steam rolling clothes?
Thank you for mentioning that! What a great example of something an LLM can pretty well do that otherwise can take a lot of time looking up Ansible docs to figure out the best way to do things. I'm guessing the outputs aren't as good as someone real familiar with Ansible could do, but it's a great place to start! It's such a good idea that it seems obvious in hindsight now :-)
Exactly, yeah. And once you look over the Ansible, it's a good place to start and expand. I'll often have it emit hemlcharts for me as templates, then after the tedious setup of the helm chart is done, the rest of it is me manually doing the complex parts, and customizing in depth.
Plus, it's a generic question; "give a helm chart for velero that does x y and z" is as proprietary as me doing a Google search for the same, so you're not giving proprietary source code to OpenAI/wherever so that's one fewer thing to worry about.
Yeah, I tend to agree. The main reason that I use AI for this sort of stuff is it also gives me something complete that I can then ask questions about, and refine myself. Rather than the fragmented documentation style "this specific line does this" without putting it in the context of the whole picture of a completed sample.
I'm not sure if it's a facet of my ADHD, or mild dyslexia, but I find reading documentation very hard. It's actually a wonder I've managed to learn as much as I have, given how hard it is for me to parse large amounts of text on a screen.
Having the ability to interact with a conversational type documentation system, then bullshit check it against the docs after is a game changer for me.
that's another thing! people are all "just read the documentation". the documentation goes on and on about irrelevant details, how do people not see the difference between "do x with library" -> "code that does x", and having to read a bunch of documentation to make a snippet of code that does the same x?
I'm not sure I follow what you mean, but in general yes. I do find "just read the docs" to be a way to excuse not helping team members. Often docs are not great, and tribal knowledge is needed. If you're in a situation where you're either working on your own and have no access to that, or in a situation where you're limited by the team member's willingness to share, then AI is an OK alternative within limits.
Then there's also the issue that examples in documentation are often very contrived, and sometimes more confusing. So there's value in "work up this to do such and such an operation" sometimes. Then you can interrogate the functionality better.
No, it says that people dislike liars. If you are known for making up things constantly, you might have a harder time gaining trust, even if you're right this time.
1. LLMs have been massively overhyped, including by some of the major players.
2. LLMs have significant problems and limitations.
3. LLMs can do some incredibly impressive things and can be profoundly useful for some applications.
I would go so far as to say that #2 and #3 are hardly even debatable at this point. Everyone acknowledges #2, and the only people I see denying #3 are people who either haven't investigated or are so annoyed by #1 that they're willing to sacrifice their credibility as an intellectually honest observer.
#3 can be true and yet not be enough to make your case. Many failed technologies achieved impressive engineering milestones. Even the harshest critic could probably brainstorm some niche applications for a hallucination machine or whatever.
It only makes it worthless for implementations where you require data. There's a universe of LLM use cases that aren't asking ChatGPT to write a report or using it as a Google replacement.
The problem is that yes llms are great when working on some regular thing for the first time. You can get started at a speed never before seen in the tech world.
But as soon as your use case goes beyond that LLMs are almost useless.
The main complaint that yes its extremely helpful in that specific subset of problems, it’s not actually pushing human knowledge forward. Nothing novel is being created with it.
It has created this illusion of being extremely helpful when in reality it is a shallow kind of help.
> If it makes data up, then it is worthless for all implementations.
Not true. It's only worthless for the things you can't easily verify. If you have a test for a function and ask an LLM to generate the function, it's very easy to say whether it succeeded or not.
In some cases, just being able to generate the function with the right types will mostly mean the LLM's solution is correct. Want a `List(Maybe a) -> Maybe(List(a))`? There's a very good chance a LLM will either write the right function or fail the type check.
In a research context, it provides pointers, and keywords for further investigation. In a report-writing context it provides textual content.
Neither of these or the thousand other uses are worthless. Its when you expect working and complete work product that it's (subjectively, maybe) worthless but frankly aiming for that with current gen technology is a fool's errand.
It says that people need training on what the appropriate use-cases for LLMs are.
This is not the type of report I'd use an LLM to generate. I'd use a database or spreadsheet.
Blindly using and trusting LLMs is a massive minefield that users really don't take seriously. These mistakes are amusing, but eventually someone is going to use an LLM for something important and hallucinations are going to be deadly. Imagine a pilot or pharmacist using an LLM to make decisions.
Some information needs to come from authoritative sources in an unmodified format.
It mostly says that one of the seriously difficult challenges with LLMs is a meta-challenge:
* LLMs are dangerously useless for certain domains.
* ... but can be quite useful for others.
* The real problem is: They make it real tricky to tell, because most of all they are trained to sound professional and authoritative. They hallucinate papers because that's what authoritative answers look like.
That already means I think LLMs are far less useful than they appear to be. It doesn't matter how amazing a technology is: If it has failure modes and it is very difficult to know what they are, it's dangerous technology no matter how awesome it is when it is working well. It's far simpler to deal with tech that has failure modes but you know about them / once things start failing it's easy to notice.
Add to it the incessant hype, and, oh boy. I am not at all surprised that LLMs have a ridiculously wide range as to detractors/supporters. Supporters of it hype the everloving fuck out of it, and that hype can easily seem justified due to how LLMs can produce conversational, authoritative sounding answers that are explicitly designed to make your human brain go: Wow, this is a great answer!
... but experts read it and can see the problems there. Which lots of tech suffers from: as a random example: Plenty of highly upvoted apparently fantastically written Stack Overflow answers have problems. For example, it's a great answer... for 10 years ago; it is a bad idea today because the answer has been obsoleted.
But between the fact that it's overhyped and particularly complex to determine an LLM answer is hallucinated drivel, it's logical to me that experts are hyperbolic when highlighting the problems. That's a natural reaction when you have a thing that SEEMS amazing but actually isn't.
You, and the OP, are being unfair in your replies. Obviously, it's not worthless for all applications but when LLMs obviously fail in disastrous ways in some important areas, you can't refute that by going "actually it gives me codign advice and generates images".
Thats nice and impressive, but there are still important issues and shortcomings. Obligatory, semirelated xkcd: https://xkcd.com/937/
All of these anecdotal stories about "LLM" failures need to go into more detail about what model, prompt, and scaffolding was used. It makes a huge difference. Were they using Deep Research, which searches for relevant articles and brings facts from them into the report? Or did they type a few sentences into ChatGPT Free and blindly take it on faith?
LLMs are _tools_, not oracles. They require thought and skill to use, and not every LLM is fungible with every other one, just like flathead, Phillips, and hex-head screwdrivers aren't freely interchangeable.
If any non-trivial ask of an LLM also requires the prompts/scaffolding to be listed, and independently verified, along with its output, their utility is severely diminished. They should be saving time not giving us extra homework.
That isn't what I'm saying. I'm saying you can't make a blanket statement that LLMs in general aren't fit for some particular task. There are certainly tasks where no LLM is competent, but for others, some LLMs might be suitable while others are not. At least some level of detail beyond "they used an LLM" is required to know whether a) there was user error involved, or b) an inappropriate tool was chosen.
Are they? Every foundation model release includes benchmarks with different levels of performance in different task domains. I don't think I've seen any model advertised by its creating org as either perfect or even equally competent across all domains.
The secondary market snake oil salesmen <cough>Manus</cough>? That's another matter entirely and a very high degree of skepticism for their claims is certainly warranted. But that's not different than many other huckster-saturated domains.
People like Zuckerberg go around claiming most of their code will be written by AI starting sometime this year. Other companies are hearing that and using it as a reason(or false cover) for layoffs. The reality is LLMs still have a way to go before replacing experienced devs and even when they start getting there there will be a period of time where we’re learning what we can and can’t trust them with and how to use them effectively and responsibly. Feels like at least a few years from now, but the marketing says it’s now.
In many, many cases those problems are resolved by improvements to the model. The point is that making a big deal about LLM fuck ups in 3 year old models that don't reproduce in new ones is a complete waste of time and just spreads FUD.
Did you read the original tweet? She mentions the models and gives high level versions of her prompts. I'm not sure what "scaffolding" is.
You're right that they're tools, but I think the complaint here is that they're bad tools, much worse than they are hyped to be, to the point that they actually make you less efficient because you have to do more legwork to verify what they're saying. And I'm not sure that "prompt training," which is what I think you're suggesting, is an answer.
I had several bad experiences lately. With Claude 3.7 I asked how to restore a running database in AWS to a snapshot (RDS, if anyone cares). It basically said "Sure, just go to the db in the AWS console and select 'Restore from snapshot' in the actions menu." There was no such button. I later read AWS docs that said you cannot restore a running database to a snapshot, you have to create a new one.
I'm not sure that any amount of prompting will make me feel confident that it's finally not making stuff up.
I was responding to the "they used an LLM" story about the Norwegian school report, not the original tweet. The original tweet has a great level of detail.
I agree that hallucination is still a problem, albeit a lot less of one than it was in the recent past. If you're using LLMs for tasks where you are not directly providing it the context it needs, or where it doesn't have solid tooling to find and incorporate that context itself, that risk is increased.
Why do you think these details are important? The entire point of these tools is that I am supposed to be able to trust what they say. The hard work is precisely to be able to spot which things are true and false. If I could do that I wouldn't need an assistant.
> The entire point of these tools is that I am supposed to be able to trust what they say
Hard disagree, and I feel like this assumption might be at the root of why some people seem so down on LLMs.
They’re a tool. When they’re useful to me, they’re so useful they save me hours (sometimes days) and allow me to do things I couldn’t otherwise, and when they’re not they’re not.
It never takes me very long to figure out which scenario I’m in, but I 100% understand and accept that figuring that out is on me and part of the deal!
Sure if you think you can “vibe code” (or “vibe founder”) your way to massive success but getting LLMs to do stuff you’re clueless about without anyone way to check, you’re going to have a bad time, but the fact they can’t (so far) do that doesn’t make them worthless.
Sounds like a user problem, though. When used properly as a tool they are incredible. When you give up 100% trust to them to be perfect it’s you that is making the mistake.
Well yeah, it's fancy autocomplete. And it's extremely amazing what 'fancy autocomplete' is able to do, but making the decision to use an LLM for the type of project you described is effectively just magical thinking. That isn't an indictment against LLM, but rather the person who chose the wrong tool for the job.
Some of the more modern tools do exactly that. If you upload a CSV to Claude, it will not (or at least not anymore) try to process the whole thing. It will read the header, and then ask you what you want. It will then write the appropriate Javascript code and run it to process the data and figure out the stats/whatever you asked it for.
I recently did this with a (pretty large) exported CSV of calories/exercise data from MyFitnessPal and asked it to evaluate it against my goals/past bloodwork etc (which I have in a "Claude Project" so that it has access to all that information + info I had it condense and add to the project context from previous convos).
It wrote a script to extract out extremely relevant metrics (like ratio of macronutrients on a daily basis for example), then ran it and proceeded to talk about the result, correlating it with past context.
Use the tools properly and you will get the desired results.
Often they will do exactly that, currently their reasoning isn't the best so you may have to coax it to take the best path. It's also making a judgement call in its writing the code so worth checking too. No different to a senior instructing an intern.
"Even a journey of 1,000 miles begins with the first step. Unless you're an AI hyper then taking the first step is the entire journey - how dare you move the goalposts"
"They continue to fabricate links, references, and quotes, like they did from day one." - "I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error."
Why have these companies not manually engineered out a problem like this by now? Just do a check to make sure links are real. That's pretty unimpressive to me.
There are no fabricated links, references, or quotes, in OpenAI's GPT 4.5 + Deep Research.
It's unfortunate the cost of a Deep Research bespoke white paper is so high. That mode is phenomenal for pre-work domain research. You get an analyst's two week writeup in under 20 minutes, for the low cost of $200/month (though I've seen estimates that white paper cost OpenAI over USD 3000 to produce for you, which explains the monthly limits).
You still need to be a domain expert to make use of this, just as you need to be to make use of an analyst. Both the analyst and Deep Research can generate flawed writeups with similar misunderstandings: mis-synthesizing, misapplication, or missing inclusion of some essential.
Neither analyst nor LLM is a substitute for mastery.
How do people in the future become domain experts capable of properly making use of it if they are not the analyst spending two weeks on the write-up today?
My complaints with Deep Research LLMs is they don't go deeper than 2 pages of SERPs. I want them to dig down obscure stuff, not list cursorily relevant peripheral directions. they just seem to do breadth first than depth first search.
This assessment is incomplete. Large languages models are both less and more than these traditional tools. They have not subsumed them and all can sit together in separate tabs of a single browser window. They are another resource, and when the conditions are right, which is often the case in my experience, they are a startlingly effective tool for navigating the information landscape. The criticism of Gemini is a fair one, and I encountered it yesterday, but perhaps with 50% less entitlement. But Gemini also helped me translate obscure termios APIs to python from C source code I provided. The equivalent using search and/or Stack Overflow would have required multiple piecemeal searches without guarantees -- and definitely would have taken much more time.
The 404 links are hilarious, like you can't even parse the output and retry until it returns a link that doesn't 404? Even ignoring the billions in valuation, this is so bad for a $20 sub.
The tweeters complaints sound like a user problem. LLM’s are tools. How you use them, when you use them, and what you expect out of them should be based on the fact they are tools.
I’m sorry but the experience of coding with an LLM is about ten billion times better than googling and stack overflowing every single problem I come across. I’ve stack overflowed maybe like two things in the past half year and I’m so glad to not have to routinely use what is now a very broken search engine and web ecosystem.
How did you measure and compare googling/stack overflow to coding with an LLM? How did you get to the very impressive number ten billion times better?! Can you share your methodology? How have you defined better?
That's part of it. The other part is Google sacrificing product quality for excessive monetization. An example would be YouTube search - first three results are relevant, next 12 results are irrelevant "people also watched", then back to relevant results. Another example would be searching for an item to buy and getting relevant results in the images tab of google, but not the shopping tab.
It’s broken bc google has spent 20+ years promoting garbage content in a self-serving way. No one was able to compete unless they played by googles rules, and so all we have left is blog spam and regular spam
I didn't notice that example. I doubt top tier models have issues with that. I was more referencing Sabines mentions of hallucinating citations and papers which is an issue I also had 2 years ago but is probably solved by Deep Research at this point. She just has massive skill issues and doesn't know what shes doing.
>What are the use cases where the expected performance is high?
o1-pro is probably at top tier human level performance on most small coding tasks and definitely at answering STEM questions. o3 is even better but not released outside of it powering Deep Research.
> This is just not a use case where the expected performance on these tasks is high.
Yet the hucksters hyping AI are falling all over themselves saying AI can do all this stuff. This is where the centi-billion dollar valuations are coming from. It's been years and these super hyped AIs still suck at basic tasks.
When pre-AI shit Google gave wrong answers it at least linked to the source of the wrong answers. LLMs just output something that looks like a link and calls it a day.
<<After glowing reviews, I spent $200 to try it out for my research. It hallucinated 8 of 10 references on a couple of different engineeribg topics. For topics that are well established (literature search), it is useful, although o3-mini-high with web search worked even better for me. For truly frontier stuff, it is still a waste of time.>>
<<I've had the hallucination problem too, which renders it less than useful on any complex research project as far as I'm concerned.>>
These quotes are from the link you posted. There are a lot more.
The whole point is that an LLM is not a search engine and obviously anyone who treats it as one is going to be unsatisfied. It's just not a sensible comparison. You should compare working with an LLM to working with an old "state of the art" language tool like Python NLTK -- or, indeed, specifying a problem in Python versus specifying it in the form of a prompt -- to understand the unbridgeable gap between what we have today and what seemed to be the best even a few years ago. I understand when a popular science author or my relatives haven't understood this several years after mass access to LLMs, but I admit to being surprised when software developers have not.
Hosted and free or subscription-based DeepResearch like tools that integrate LLMs with search functionality (the whole domain of "RAG" or "Retrieval Augmented Generation") will be elementary for a long time yet simply because the cost of the average query starts to go up exponentially and there isn't that much money in it yet. Many people have and will continue to build their own research tools where they can determine how much compute time and API access cost they're willing to spend on a given query. OCR remains a hard problem, let alone appropriately chunking potentially hundreds of long documents into context length and synthesizing the outputs of potentially thousands of LLM outputs into a single response.
Certainly. I agree of course as to the problem of hype and I'm aware of how many people use LLMs today. I tried to emphasize in my earlier post that I can understand why someone like Sabine has the opinion she does -- I'm more confused how there's still similar positions to be found among software developers, evidenced often within Hacker News threads like the one we're in. I don't intend that to refer to you, who clearly has more than a passing knowledge of LLM internals, but more to the original commenter I was responding to.
More than marketing, I think from my experience it's chat with little control over context as the primary interface of most non-engineers with LLMs that leads to (mis)expectations of the tool in front of them. Having so little control over what is actually being input to the model makes it difficult to learn to treat a prompt as something more like a program.
It's mostly because of how they were initially marketed. In an effort to drive hype 'we' were promised the world. Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence? In reality Bard, let alone whatever early version he was using, is about as sentient as my left asscheek.
OpenAI did similar things by focusing to the point of absurdity on 'safety' for what was basically a natural language search engine that has a habit of inventing nonsensical stuff. But on that same note (and also as you alluded to) - I do agree that LLMs have a lot of use as natural language search engines in spite of their proclivity to hallucinate. Being able to describe a e.g. function call (or some esoteric piece of history) by description and then often get the precise term/event that I'm looking for is just incredibly useful.
But LLMs obviously are not sentient, are not setting us on the path to AGI, or any other such nonsense. They're arguably what search engines should have been 10 or 15 years ago, but anti-competitive monopolization of the industry meant that search engine technology progress basically stalled out, if not regressed for the sake of ads (and individual 'entrepreneurs' becoming better at SEO), about the time Google fully established itself.
> Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence?
I presume you are referring to this Google engineer, who was sacked for making the claim. Hardly an example of AI companies overhyping the tech; precisely the opposite, in fact.
https://www.bbc.co.uk/news/technology-62275326
It seems to be a common human hallucination to imagine that large organisations are conspiring against us.
Corporations are motivated by profit, not doing what's best for humanity. If you need an example of "large organizations conspiring against us," I can give you twenty.
I agree that sometimes organisations conspire against people. My point was, in case it wasn't apparent, the irony that somenameforme was talking about how LLMs were of little use because they hallucinate, whilst apparently hallucinating a conspiracy by AI companies to overhype the technology.
I wasn't making a political point. You see similar evidence-free allegations against international organisations and national government bodies.
> Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence?
That's not what happened. Google stomped hard on Lemoine, saying clearly that he was wrong about LaMDA being sentient ... and then they fired him for leaking the transcripts.
Your whole argument here is based on false information and faulty logic.
Were you perchance noting that according to some people «LLMs ... can hallucinate and create illogical outputs» (you also specified «useless», but that must be a further subset and will hardly create a «litter[ing]» here), but also that some people use «false information and faulty logic»?
Noting that people are imperfect is not a justification for the weaknesses in LLMs. Since around late 2022 some people started stating LLMs are "smart like their cousin", to which the answer remains "we hope that your cousin has a proportionate employment".
If you built a crane that only lifts 15kg, it's no justification that "many people lift 10". The purpose of the crane is to lift as needed, with abundance for safety.
If we build cranes, it is because people are not sufficient: the relative weakness of people is, far from a consolation of weak cranes, the very reason why we want strong cranes. Similarly for intelligence and other qualities.
People are known to use use «false information and faulty logic»: but they are not being called "adequate".
> angry at
There's a subculture around here that thinks it normal to downvote without any rebuttal - equivalent to "sneering and leaving" (quite impolite), almost all times it leaves us without a clue about what could be the point of disapproval.
I think you're missing the point. He's pointing out what the atmosphere was/is around LLMs in these discussions, and how that impacts stories like with Lemoine.
I mean, you're right that he's silly and Google didn't want to be part of it, but it was (and is?) taken seriously that: LLMs are nascent AGI, companies are pouring money to get there first, we might be a year or two away. Take these as true, it's at least possible that Google might have something chained up in their basement.
In retrospect, Google dismissed him because he was acting in a strange and destructive way. At the time, it could be spun as just further evidence: they're silencing him because he's right. Could it have created such hysteria and silliness if the environment hadn't been so poisoned by the talk of imminent AGI/sentience?
Which comment claimed that LLMs were marketed as super-intelligence? I'm looking up the chain and I can't see it.
I don't think they were, but I think it's pretty clear they were marketed as being the imminent path to super-intelligence, or something like it. OpenAI were saying GPT-(n-1) is as intelligent as a high school student, GPT-(n) is a university student, GPT-(n+1) will be.. something.
> OpenAI did similar things by focusing to the point of absurdity on 'safety' for what was basically a natural language search engine that has a habit of inventing nonsensical stuff.
The focus on safety, and the concept of "AI", preexisted the product. An LLM was just the thing they eventually made; it wasn't the thing they were hoping to make. They applied their existing beliefs to it anyway.
I am worried about them as a substitute for search engines.
My reasoning is that classic google web-scraping and SEO, as shitty as it may be, is 'open-source' (or at least, 'open-citation') in nature - you can 'inspect the sh*t it's built from'.
Whereas LLMs, to me seem like a chinese - or western - totalitarian political system wet dream - 'we can set up an inscrutable source of "truth" for the people to use, with the _truths_ we intend them to receive'.
We already saw how weird and unsane this was, when they were configured to be woke under the previous regime. Imagine it being configured for 'the other post-truth' is a nightmare.
> Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence?
No, first time I hear about it. I guess the secret to happiness is not following leaks. I had very low expectations before trying LLMs and I’m extremely impressed now.
Not following leaks, or just the news, not living in the real world, not caring of the consequences of reality: anybody can think he's """happy""" with psychedelia and with just living in private world. But it is the same kind of "happy" that comes with "just smile".
If you did not get information that there are severe pitfalls - which is by the way so unrelated to the "it's sentient thing", as we are talking about the faults in the products, not the faults in human fools -, you are supposed to see them from your own judgement.
They have their value in analyzing huge amounts of data for example scientific papers or raw observations, but the popular public ones are mostly trained on stolen/pirated texts offthe internet and from social media clouds the companies control. So this means: bullshit in -> bullshit out. I don't need machines for that the regular human bullshitters do this job just fine.
> the popular public ones are mostly trained on stolen/pirated texts offthe internet
You mean like actual literature, textbooks and scientific papers? You can't get them in bulk without pirating. Thank intellectual property laws.
> from social media clouds the companies control
I.e. conversations of real people about matters of real life.
But if it satisfies your elitist, ivory-towerish vision of "healthy information diet" for LLMs, then consider that e.g. Twitter is where, until now, you'd get most updates from the best minds in several scientific fields. Or that besides r/All, the Reddit dataset also contains r/AskHistorians and other subreddits where actual experts answer questions and give first-hand accounts of things.
The actually important bit though, is that LLM training manages to extract value from both the "bullshit" and whatever you'd call "not bullshit", as the model has to learn to work with natural language just as much as it has to learn hard facts or scientific theories.
Yes, I find the biggest issue in discussing the present state of AI with people outside the field, whether technical or not, is that "machine learning" had only just entered popular understanding: i.e. everyone seems ready today to talk about the limits of training a machine learning model on X limited data set, unable to extrapolate beyond it. The difference between "learning the best binary classifier on a labelled training set" and "exploring the set of all possible programs representable by a deep neural network of whatever architecture to find that which best generates all digitally recorded traces of human beings throughout history" is very far from intuitive to even specialists. I think Ilya's old public discussions of this question are the most insightful for a popular audience, explaining how and why a world model and not simply a Markov chain is necessary to solve the seemingly trivial problem of "predicting the next word in a sequence."
Nobody promised the world. The marketing underpromised and LLMs overdelivered. Safety worries didn't come from marketing, it came from people who were studying this as a mostly theoretical worry for the next 50+ years, only to see major milestones crossed a decade or more before they expected.
Did many people overhype LLMs? Yes, like with everything else (transhumanist ideas, quantum physics). It helps being more picky who one listens to, and whether they're just painting pretty pictures with words, or actually have something resembling a rational argument in there.
Folks really over index when an LLM is very good for their use case. And most of the folks here are coders, at which they're already good and getting better.
For some tasks they're still next to useless, and people who do those tasks understandably don't get the hype.
Tell a lab biologist or chemist to use an LLM to help them with their work and they'll get very little useful out of it.
Ask an attorney to use it and it's going to miss things that are blindingly obvious to the attorney.
Ask a professional researcher to use it and it won't come up with good sources.
For me, I've had a lot of those really frustrating experiences where I'm having difficulty on a topic and it gives me utter incorrect junk because there just isn't a lot already published about that data.
I've fed it tricky programming tasks and gotten back code that doesn't work, and that I can't debug because I have no idea what it's trying to do, or I'm not familiar with the libraries it used.
It sounds like you're trying to use these llms as oracles, which is going to cause you a lot of frustration. I've found almost all of them now excel at imitating a junior dev or a drunk PhD student. For example the other day I was looking at acoustic sensor data and I ran it down the trail of "what are some ways to look for repeating patterns like xyz" and 10 minutes later I had a mostly working proof of concept for a 2nd order spectrogram that reasonably dealt with spectral leakage and a half working mel spectrum fingerprint idea. Those are all things I was thinking about myself, so I was able to guide it to a mostly working prototype in very little time. But doing it myself from zero would've taken at least a couple of hours.
But truthfully 90% of work related programming is not problem solving, it's implementing business logic. And dealing with poor, ever changing customer specs. Which an llm will not help with.
> But truthfully 90% of work related programming is not problem solving, it's implementing business logic. And dealing with poor, ever changing customer specs. Which an llm will not help with.
Au contraire, these are exactly things LLMs are super helpful at - most of business logic in any company is just doing the same thing every other company is doing; there's not that many unique challenges in day-to-day programming (or business in general). And then, more than half of the work of "implementing business logic" is feeding data in and out, presenting it to the user, and a bunch of other things that boil down to gluing together preexisting components and frameworks - again, a kind of work that LLMs are quite a big time-saver for, if you use them right.
Strongly in agreement. I've tried them and mostly come away unimpressed. If you work in a field where you have to get things right, and it's more work to double check and then fix everything done by the LLM, they're worse than useless. Sure, I've seen a few cases where they have value, but they're not much of my job. Cool is not the same as valuable.
If you think "it can't quite do what I need, I'll wait a little longer until it can" you may still be waiting 50 years from now.
> If you work in a field where you have to get things right, and it's more work to double check and then fix everything done by the LLM, they're worse than useless.
Most programmers understand reading code is often harder than writing it. Especially when someone else wrote the code. I'm a bit amused by the cognitive dissonance of programmers understanding that and then praising code handed to them by an LLM.
It's not that LLMs are useless for programming (or other technical tasks) but they're very junior practitioners. Even when they get "smarter" with reasoning or more parameters their nature of confabulation means they can't be fully trusted in the way their proponents suggest we trust them.
It's not that people don't make mistakes but they often make reasonable mistakes. LLMs make unreasonable mistakes at random. There's no way to predict the distribution of their mistakes. I can learn a human junior developer sucks at memory management or something. I can ask them to improve areas they're weak in and check those areas of their work in more detail.
I have to spend a lot of time reviewing all output from LLMs because there's rarely rhyme or reason to their errors. They save me a bunch of typing but replace a lot of my savings with reviews and debugging.
My view is that it will be some time before they can as well because of the success in the software domain - not because LLM's aren't capable as a tech but because data owners and practitioners in other domains will resist the change. From the SWE experience, news reports, financial magazines, etc many are preparing accordingly, even if it is a subconscious thing. People don't like change, and don't want to be threatened when it is them at risk - no one wants what happened to artists and now SWE's to happen to their profession. They are happy for other professions to "democratize/commoditize" as long as it isn't them - after all this increases their purchasing power. Don't open source knowledge/products, don't let AI near your vertical domain, continue to command a premium for as long as you can - I've heard variations of this in many AI conversations. Much easier in oligopoly and monopoly like domains and/or domains where knowledge was known to be a moat even when mixed with software as you have more trust competitors won't do the same.
For many industries/people work is a means to earn, not something to be passionate in for its own sake. Its a means to provide for other things in life you are actually passionate about (e.g. family, lifestyle, etc). In the end AI may get your job eventually but if it gets you much later vs other industries/domains you win from a capital perspective as other goods get cheaper and you still command your pre-AI scarcity premium. This makes it easier for them to acquire more assets from the early disrupted industries and shield them from eventual AI taking over.
I'm seeing this directly in software. Less new frameworks/libraries/etc outside the AI domain being published IMO, more apprehension from companies to open source their work and/or expose what they do, etc. Attracting talent is also no longer as strong of a reason to showcase what you do to prospective employees - economic conditions and/or AI make that less necessary as well.
I frequently see news stories where attorneys get in trouble for using LLMs, because they cite hallucinated case law (e.g.). If they didn't get caught, that would look the same as using them "productively".
Asking the LLM for relevant case law and checking it up - productive use of LLM. Asking the LLM to write your argument for you and not checking it up - unproductive use of LLM. It's the same as with programming.
>Asking the LLM for relevant case law and checking it up - productive use of LLM
That's a terrible use for an LLM. There are several deterministic search engines attorneys use to find relevant case law, where you don't have to check to see if the cases actually exist after it produces results. Plus, the actual text of the case is usually very important, and isn't available if you're using an LLM.
Which isn't to say they're not useful for attorneys. I've had success getting them to do some secretarial and administrative things. But for the core of what attorneys do, they're not great.
For law firms creating their own repositories of case law, having LLMs search via summaries, and then dive into the selected cases to extract pertinent information seems like an obvious great use case to build a solution using LLMs.
The orchestration of LLms that will be reading transcripts, reading emails, reading case law, and preparing briefs with sources is unavoidable in the next 3 years. I don’t doubt multiple industry specialized solutions are already under development.
Just asking chatGPT to make your case for you is missing the opportunity.
If anyone is unable to get Claud 3.7 or Gemini 2.5 to accelerate their development work I have to doubt their sentience at this point. (Or more likely doubt that they’re actively testing these things regularly)
Law firms don't create their own repos of case law. They use a database like westlaw or lexis. LLMs "preparing briefs with sources" would be a disaster and wholly misunderstands what legal writing entails.
I find it very useful to review the output and consider its suggestions.
I don’t trust it blindly, and I often don’t use most of what it suggests; but I do apply critical thinking to evaluate what might be useful.
The simplest example is using it as a reverse dictionary. If I know there’s a word for a concept, I’ll ask an LLM. When I read the response, I either recognize the word or verify it using a regular dictionary.
I think a lot of the contention in these discussions is because people are using it for different purposes: it's unreliable for some purposes and it is excellent at others.
> Asking the LLM for relevant case law and checking it up - productive use of LLM.
Only if you're okay with it missing stuff. If I hired a lawyer, and they used a magic robot rather than doing proper research, and thus missed relevant information, and this later came to light, I'd be going after them for malpractice, tbh.
Surely this was meant ironically, right? You must've heard of at least one of the many cases involving lawyers doing precisely what you described and ending up presenting made up legal cases in court. Guess how that worked out for them.
The uses that they cited to me were "additional pair of eyes in reviewing contracts," and, "deep research to get started on providing a detailed overview of a legal topic."
Honestly it's worse than this. A good lab biologist/chemist will try to use it, understand that it's useless, and stop using it. A bad lab biologist/chemist will try to use it, think that it's useful, and then it will make them useless by giving them wrong information. So it's not just that people over-index when it is useful, they also over-index when it's actively harmful but they think it's useful.
You think good biologists never need to summarize work into digestible language, or fill out multiple huge, redundant grant applications with the same info, or reformat data, or check that a writeup accurate reflects data?
I’m not a biologist (good or bad) but the scientists I know (who I think are good) often complain that most of the work is drudgery unrelated to the science they love.
Sure, lots of drudgery, but none of your examples are things that you could trust an LLM to do correctly when correctness counts. And correctness always counts in science.
Edit to add: and regardless, I'm less interested in the "LLM's aren't ever useful to science" part of the point. The point that actual LLM usage in science will mostly be for cases where they seem useful but actually introduce subtle problems is much more important. I have observed this happening with trainees.
The problem Sabine tries to communicate is that reality is different from what the cash-heads behind main commercial models are trying to portray. They push the narrative that they’ve created something akin to human cognition, when in reality, they’ve just optimised prediction algorithms on an unprecedented scale. They are trying to say that they created Intelligence, which is the ability to acquire and apply knowledge and skills, but we all know the only real Intelligence they are creating is the collection of information of military or political value.
The technology is indeed amazing and very amusing, but like all the good things in the hands of corporate overlords, it will be slowly turning into profit-milking abomination.
> They push the narrative that they’ve created something akin to human cognition
This is your interpretation of what these companies are saying. I'd love to see if some company specifically anything like that?
Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now? How many things from today when brought back into 2010 would make the person using it make it feel like they're being tricked or pranked? We already take them for granted even thought they've only been around for less than half of a decade.
LLMs aren't a catch all solution to the world's problems; or something that is going to help us in every facet of our lives; or an accelerator for every industry that exists out there. But at no point in history could you talk to your phone about general topics, get information, practice language skills, build an assistant that teaches your kid about the basics of science, use something to accelerate your work in a many different ways etc...
Looking at llms shouldn't be boolean, it shouldn't be between they're the best thing ever invented vs they're useless; but it seems like everyone presents the issue in this manner and Sabine is part of that problem.
No major company directly states "We have created human-like intelligence," they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
>I'd love to see if some company specifically anything like that?
1. DeepMind researchers: Sparks of Artificial General Intelligence: Early experiments with GPT-4 - https://arxiv.org/abs/2303.12712
2. "GPT-4 is not AGI, but it does exhibit more general intelligence than previous models." - Sam Altman
3. Musk has claimed that AI is on the path to "understanding the universe." His branding of Tesla's self-driving AI as "Full Self-Driving" (FSD) also misleadingly suggests a level of autonomous reasoning that doesn't exist.
4. Meta's AI chief scientist, Yann LeCun, has repeatedly said they are working on giving AI "common sense" and "world models" similar to how humans think.
>Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now?
ELIZA is an early natural language processing computer program developed from 1964 to 1967
ELIZA's creator, Weizenbaum, intended the program as a method to explore communication between humans and machines. He was surprised and shocked that some people, including Weizenbaum's secretary, attributed human-like feelings to the computer program. 60 years ago.
So as you can see, us humans are not too hard to fool with this.
ELIZA was not a natural language processor, and the fact that some people were easily fooled by a program that produced canned responses based on keywords in the text but was presented as a psychotherapist is not relevant to the issue here--it's a fallacy of affirmation of the consequent.
Also,
"4. Meta's AI chief scientist, Yann LeCun, has repeatedly said they are working on giving AI "common sense" and "world models" similar to how humans think."
completely misses the mark. That LLMs don't do this is a criticism from old-school AI researchers like Gary Marcus; LeCun is saying that they are addressing the criticism by developing the sorts of technology that Marcus says are necessary.
> they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
As do all companies in the world. If you want to buy a hammer, the company will sell it as the best hammer in the world. It's the norm.
I don't know exactly what your point is with ELIZA?
> So as you can see, us humans are not too hard to fool with this.
I mean ok? How is that related to having a 30 minute conversation with ChatGPT where it teaches you a language? Or Claude outputting an entire application in a single go? Or having them guide you through fixing your fridge by uploading the instructions? Or using NotebookLM to help you digest a scientific paper?
Im not saying LLMs are not impressive or useful — Im pointing out that corporations behind commercial AI models are capitalising on our emotional response to natural language prediction. This phenomenon isnt new – Weizenbaum observed it 60 years ago, even with the simplest of algorithms like ELIZA.
Your example actually highlights this well. AI excels at language, so it’s naturally strong in teaching (especially for language learning ;)). But coding is different. It’s not just about syntax; it requires problem-solving, debugging, and system design — areas where AI struggles because it lacks true reasoning.
There’s no denying that when AI helps you achieve or learn something new, it’s a fascinating moment — proof that we’re living in 2025, not 1967. But the more commercialised it gets, the more mythical and misleading the narrative becomes
> system design — areas where AI struggles because it lacks true reasoning.
Others addressed code, but with system design specifically - this is more of an engineering field now, in that there's established patterns, a set of components at various levels of abstraction, and a fuck ton of material about how to do it, including but not limited to everything FAANG publishes as preparatory material for their System Design interviews. At this point in time, we have both a good theoretical framework and a large collection of "design patterns" solving common problems. The need for advanced reasoning is limited, and almost no one is facing unique problems here.
I've tested it recently, and suffice it to say, Claude 3.7 Sonnet can design systems just fine - in fact much better than I'd expect a random senior engineer to. Having the breadth of knowledge and being really good at fitting patterns is a big advantage it has over people.
> They push the narrative that they’ve created something akin to human cognition
I am saying they're not doing that, they're doing sales and marketing and it's you that interprets this as possible/true. In my analogy if the company said it's a hammer that can do anything, you wouldn't use it to debug elixir. You understand what hammers are for and you realize the scope is different. Same here. It's a tool that has its uses and limits.
> Your example actually highlights this well. AI excels at language, so it’s naturally strong in teaching (especially for language learning ;)). But coding is different. It’s not just about syntax; it requires problem-solving, debugging, and system design — areas where AI struggles because it lacks true reasoning.
I disagree since I use it daily and Claude is really good at coding. It's saving me a lot of time. It's not gonna build a new Waymo but I don't expect it to. But this is besides the point. In the original tweet what Sabine is implying is that it's useless and OpenAI should be worth less than a shoe factory. When in fact this is a very poor approach to look at LLMs and their value and both sides of the spectrum are problematic (those that say it's a catch all AGI and those that say hurr it couldn't solve P versus NP it's trash).
I think one difference between a hammer and an LLM is that hammers have existed since forever, so common sense is assumed to be there as to what their purpose is. For LLMs though, people are still discovering on a daily basis to what extent they can usefully apply them, so it's much easier to take such promises made by companies out of context if you are not knowledgeable/educated on LLMs and their limitations.
Person you replied to:
they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
Your response:
As do all companies in the world. If you want to buy a hammer, the company will sell it as the best hammer in the world. It's the norm.
As a programmer (and GOFAI buff) for 60 years who was initially highly critical of the notion of LLMs being able to write code because they have no mental states, I have been amazed by the latest incarnations being able to write complex functioning code in many cases. There are, however, specific ways that not being reasoners is evident ... e.g., they tend to overengineer because they fail to understand that many situations aren't possible. I recently had an example where one node in a tree was being merged into another, resulting in the child list of the absorbed node being added to the child list of the kept node. Without explicit guidance, the LLM didn't "understand" (that is, its response did not reflect) that a child node can only have one parent so collisions weren't possible.
> proof that we’re living in 2025, not 1967. But the more commercialised it gets, the more mythical and misleading the narrative becomes
You seem to be living in 2024, or 2023. People generally have far more pragmatic expectations these days, and the companies are doing a lot less overselling ... in part because it's harder to come up with hype that exceeds the actual performance of these systems.
How many examples of CEOs writing shit like that can you name? I can name more than one. Elon's been saying that camera driven level 5 autonomous driving will be ready in 2021. Did you believe him?
Elon? Never did, and for the record, also never really understood his fanboys. I never even bought a Tesla. And no, besides these two guys, I don´t really remember many other CEOs making such revolutionary statements. That is usually the case when people understand their technology and are not ready to bullshit. There is one small differentiation though: At least self-driving cars hype was believable because it seemed almost like a finite-search problem, like along the lines of, how hard could it be to process X input signals from lidars and image frames and marry it to an advanced variation of what is basically a PID controller. And at least there is a defined use-case. With genAI, we have no idea what the problem definition and even problem space is, and the main use-case that the companies seem to be pushing down our throats (aside from code assistants) is "summarising your email" and chatting with your smartphone, for lonely people. Ew, thanks, but no thanks.
No mate, not everyone is trying hard to prove some guy on the Internet wrong. I do remember these two but to be honest, they were not on top of my mind in this context, probably because it's a different example - or what are you trying to say? That the people running AI companies should go to jail for deceiving their investors? This is different to Theranos. Holmes actively marketed and PRESENTED a "device" which did not exist as specified (they relied on 3rd party labs doing their tests in the background). For all that we know, OpenAI and their ilk are not doing that really. So you're on thin ice here. Amazon came close though, with their failed Amazon Go experiment, but they only invested their own money, so no damage was done to anyone. In either case your example is showing what? That lying is normal in the business world and should be done by the CEOs as part of their job description? That they should or should not go to jail for it? I am really missing your point here, no offence.
> In either case your example is showing what? That lying is normal in the business world and should be done by the CEOs as part of their job description? That they should or should not go to jail for it? I am really missing your point here, no offence.
If you run through the message chain you'll see first that the comment OP is claiming companies market llms as AGI, and then the next guy quotes Altmans tweet to support it. I am saying companies don't claim llms are AGI and that CEOs are doing CEO things; my examples are Elon (didn't go to jail btw) and the other two that did.
> For all that we know, OpenAI and their ilk are not doing that really.
I think you completely missed the point. Altman is definitely engaging in 'creative' messaging, so do other GenAI CEOs. But unlike Holmes and others, they are careful to wrap it into conditionals and future tense and this vague corporate speak about how something "feels" like this and that and not that it definitely is this or that. Most of us dislike the fact that they are indeed implying this stuff as being almost AGI, just around the corner, just a few more years, just a few more hundred billion dollars wasted in datacenters. When we can see on a day-to-day basis, that their tools are just advanced text generators. Anyone who finds them 'mindblowing' clearly does not have a complex enough use case.
I think you are missing the point. I never said it's the same nor is that what I am arguing.
> Anyone who finds them 'mindblowing' clearly does not have a complex enough use case.
What is the point of llms? If their only point is complex use cases then they're useless, let's throw them away. If their point/scope/application is wider and they're doing something for a non negligible percentage of people then who are you to gauge whether they deserve to be mindblowing to someone or not regardless of their use case?
What is the point of LLMs? It seems nobody really knows, including the people selling them. They are a solution in search of a problem. But if you figure it out in the meanwhile, make sure to let everyone know. Personally I'd be happy with just having back Google as it was between roughly 2006-2019 (RIP) in the place of the overly verbose statistical parrots.
> Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now?
Lots e.g. vacuum cleaners.
> But at no point in history could you talk to your phone
You could always "talk" to your phone just like you could "talk" to a parrot or a dog. What does that even mean?
If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation and often doesn't reply with anything related.
> If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation and often doesn't reply with anything related.
I don't believe this one bit. But keep on trucking.
Of course they aren't "real" conversations but I can dialog with LLMs as a means of clarifying my prompts. The comment about parrots and dogs is made in bad faith.
By your own admission, those are not dialogues, but merely query optimisations in an advanced query language. Like how you would tune an SQL query until your get the data you are expecting to see. That's what it is for the LLMs.
> The comment about parrots and dogs is made in bad faith
Not necessarily. (Some aphonic, adactyl downvoters seem to have possibly tried to nudge you into noticing that your idea above is against some entailed spirit the guidelines.)
The poster may have meant that for the use natural to him, he feels in the results the same utility of discussing with a good animal. "Clarifying one's prompts" may be effective in some cases, but it's probably not what others seek. It is possible that many want the good old combination of "informative" and "insightful": in practice there may be issues with both.
> "Clarifying one's prompts" may be effective in some cases but it's probably not what others seek
It's not even that. Can the LLM run away, stop the conversation or even say no? It's as much as your boss "talking" to you about the task and not giving you a chance to respond. Is that a talk? It's 1-way.
E.g. ask the LLM who invented Wikipedia. It will respond with "facts". If I ask a friend, the reply might be "look it up yourself". This a real conversation. Until then.
Even parrots and dogs can respond differently than a forced reply exactly how you need it.
> This is your interpretation of what these companies are saying. I'd love to see if some company specifically anything like that?
What is the layman to make of the claim that we now have “reasoning” models? Certainly sounds like a claim of human-like cognition, even though the reality is different.
Studies have shown that corvids are capable of reasoning. Does that sound like a claim of human level cognition?
I think you’re going too far in imagining what one group of people will make of what another group of people is saying, without actually putting yourself in either group.
Much as i agree with the point about overhyping from companies, I'd be more sympathetic to this point of view if she acknowledged the merits of the technology.
Yes, it hallucinates and if you replace your brain with one of these things, you won't last too long. However, it can do things which, in the hands of someone experienced, are very empowering. And it doesn't take an expert to see the potential.
As it stands, it sounds like a case of "it's great in practice but the important question is how good it is in theory."
I use LLMs. They're somewhat useful if you're on a non niche problem. They're also useful instead of search engines, but that's because search has been entshittified more than because a LLM is better.
However 90% of the marketing material about them is simply disgusting. The bigwigs sound like they're spreading a new religion, and most enthusiasts sound like they're new converts to some sect.
If you're marketing it as a tool, fine. If you're marketing it as the third and fourth coming of $DEITY, get lost.
> I use LLMs. They're somewhat useful if you're on a non niche problem. They're also useful instead of search engines...
The problem for me is that I could use that type of assistance precisely when I hit that "niche problem" zone. Non-niche problems are usually already solved.
Like search. Popular search engines like Google and Bing are mostly garbage because they keep trying to shove gen AI in my face with made up answers. I have no such problems with my SearxNG instance.
> I could use that type of assistance precisely when I hit that "niche problem" zone
Tough luck. On the other hand, we're still justified in asking for money to do the niche problems with our fleshy brains, right? In spite of the likes of Altman saying every week that we'll be obsoleted in 5 years by his products. Like ... cold fusion? Always 5 years away?
[I have more hope for cold fusion than these "AIs" though.]
> Popular search engines like Google and Bing are mostly garbage because they keep trying to shove gen AI in my face with made up answers.
No they became garbage significantly before "AI". Google at least has gradually reduced the number of results returned and expanded the search scope to the point that you want a reminder of the i2c api syntax on a raspberry pi and they return 20 beginner tutorial results that show you how to unpack the damn thing and do the first login instead.
I completely agree about the marketing material. I'm not sure about 90% but that's not something I have a strong opinion on. The stream from the bigwigs is the same song being played in a different tune and I'm inoculated to it.
I'm not marketing it. I'm not a marketer. I'm a developer trying to create an informed opinion on its utility and the marketing speak you criticize is far away from the truth.
The problem is this notion that it's just completely bullshit. The way it's worded irks me. "I genuinely don't understand...". It's quite easy to see the utility and acknowledging that doesn't, in any way, detract from valid criticisms of the technology and the people who peddle.
Exactly. It’s so strange to read so many comments that boil down to “because some marketing people are over-promising, I will retaliate by choosing to believe false things”
But it’s not the marketers building the products. This is like saying “because the car salesman lied about this Prius’ gas mileage, I’ll retaliate by refusing to believe hybrids are any better than pure ICE cars and will buy a pickup”.
It hurts nobody but the person choosing ignorance.
I hate to bring an ad hominem into this, but Sabine is a YouTube influencer now. That's her current career. So I'd assume this Tweet storm is also pushing a narrative on its own, because that's part of doing the work she chose to do to earn a living.
While true, I think this is more likely a question of framing or anchoring — I am continuously impressed and surprised by how good AI is, but I recognise all the criticisms she's making here. They're amazing, but at the same time they make very weird mistakes.
They actually remind me of myself, as I experience being a native English speaker now living in Berlin and attempting to use a language I mainly learned as an adult.
I can often appear competent in my use of the language, but then I'll do something stupid like asking someone in the office if we have a "Gabelstapler" I can borrow — Gabelstapler is "forklift truck", I meant to ask for a stapler, which is "Tacker" or "Hefter", and I somehow managed to make this mistake directly after carefully looking up the word. (Even this is a big improvement for me, as I started off like Officer Crabtree from Allo' Allo').
What you have done there is to discount statements that may build up a narrative - and still may remain fair... On which basis? Possibly they do not match your own narrative?
LLMs seem akin to parts of human cognition, maybe the initial fast thinking bit when ideas pop up in a second of two. But any human writing a review with links to sources would look them up and check the are they right ones that match the initial idea. Current LLMs don't seem to do that, at least the ones Sabine complains about.
Akin to human cognition but still a few bricks short of a load, as it were.
You lay the rhetoric on so thick (“cash heads”, “pushing the narrative”, “corporate overlords”, “profit-making abomination”) that it’s hard to understand your actual claim.
Are you trying to say that LLMs are useful now but you think that will stop being the case at some point in the future?
Look man, and I'm saying this not to you but to everyone who is in this boat; you've got to understand that after a while, the novelty wears off. We get it. It's miraculous that some gigabytes of matrices can possibly interpret and generate text, images, and sound. It's fascinating, it really is. Sometimes, it's borderline terrifying.
But, if you spend too much time fawning over how impressive these things are, you might forget that something being impressive doesn't translate into something being useful.
Well, are they useful? ... Yeah, of course LLMs are useful, but we need to remain somewhat grounded in reality. How useful are LLMs? Well, they can dump out a boilerplate React frontend to a CRUD API, so I can imagine it could very well be harmful to a lot of software jobs, but I hope it doesn't bruise too many egos to point out that dumping out yet another UI that does the same thing we've done 1,000,000 times before isn't exactly novel. So it's useful for some software engineering tasks. Can it debug a complex crash? So far I'm around zero for ten and believe me, I'm trying. From Claude 3.7 to Gemini 2.5, Cursor to Claude Code, it's really hard to get these things to work through a problem the way anyone above the junior dev level can. Almost unilaterally, they just keep digging themselves deeper until they eventually give up and try to null out the code so that the buggy code path doesn't execute.
So when Sabine says they're useless for interpreting scientific publications, I have zero trouble believing that. Scoring high on some shitty benchmarks whose solutions are in the training set is not akin to generalized knowledge. And these huge context windows sound impressive, but dump a moderately large document into them and it's often a challenge to get them to actually pay attention to the details that matter. The best shot you have by far is if the document you need it to reference definitely was already in the training data.
It is very cool and even useful to some degree what LLMs can do, but just scoring a few more points on some benchmarks is simply not going to fix the problems current AI architecture has. There is only one Internet, and we literally lit it on fire to try to make these models score a few more points. The sooner the market catches up to the fact that they ran out of Internet to scrape and we're still nowhere near the singularity, the better.
100% this. I think we should start producing independent evaluations of these tools for their usefulness, not for whatever made up or convoluted evaluation index the OpenAI, Google or Anthropic throw at us.
Hardly. I pretty much have been using LLM at least weekly (most of the time daily) since GPT3.5. I am still amazed. It's really, really hard to not be bullish for me.
It kinda reminds me the days I learned Unix-like command line. At least once a week, I shouted to me self: "What? There is a one-liner that does that? People use awk/sed/xargs this way??" That's how I feel about LLM so far.
I tried LLMs for generating shell snippets. Mixed bag for me. It seems to have a hard time making portable awk/sed commands. It also really overcomplicates things; you really don't need to break out awk for most simple file renaming tasks. Lesser used utilities, all bets are off.
Yesterday Gemini 2.5 Pro suggested running "ps aux | grep filename.exe" to find a Wine process (pgrep is the much better way to go for that, but it's still wrong here) and get the PID, then pass that into "winedbg --attach" which is wrong in two different ways, because there is no --attach argument and the PID you pass into winedbg needs to be the Win32 one not the UNIX one. Not an impressive showing. (I already knew how to do all of this, but I was curious if it had any insights I didn't.)
For people with less experience I can see how getting e.g. tailored FFmpeg commands generated is immensely useful. On the other hand, I spent a decent amount of effort learning how to use a lot of these tools and for most of the ways I use them it would be horrific overkill to ask an LLM for something that I don't even need to look anything up to write myself.
Will people in the future simply not learn to write CLI commands? Very possible. However, I've come to a different, related conclusion: I think that these areas where LLMs really succeed in are examples of areas where we're doing a lot of needless work and requiring too much arcane knowledge. This counts for CLI usage and web development for sure. What we actually want to do should be significantly less complex to do. The LLM actually sort of solves this problem to the extent that it works, but it's a horrible kludge solution. Literally converting video files and performing basic operations on them should not require Googling reference material and Q&A websites for fifteen minutes. We've built a vastly overly complicated computing environment and there is a real chance that the primary user of many of the interfaces will eventually not even be humans. If the interface for the computer becomes the LLM, it's mostly going to be wasted if we keep using the same crappy underlying interfaces that got us into the "how do I extract tar file" problem in the first place.
They really don’t. People say this all the time, but you give any project a little time and it evolves into a special unique snowflake every single time.
That’s why every low code solution and boilerplate generator for the last 30 years failed to deliver on the promises they made.
I agree some will evolve into more, but lots of them won't. That's why shopify, WordPress and others exist - most commercial websites are just online business cards or small shops. Designers and devs are hired to work on them all the time.
If you’re hiring a dev to work on your Shopify site, it’s most likely because you want to do something non-standard. By the time the dev gets done with it, it will be a special unique snowflake.
If your site has users, it will evolve. I’ve seen users take what was a simple trucking job posting form and repurpose an unused “trailer type” field to track the status of the job req.
Every single app that starts out as a low code/no code solution given enough time and users will evolve beyond that low code solution. They may keep using it, but they’ll move beyond being able to maintain it exclusively through a low code interface.
And most software engineering principles is for dealing how to deal with this evolution.
- Architecture (making it easy to adjust part of the codebase and understanding it)
- Testing (making sure the current version works and future version won't break it)
- Requirements (describing the current version and the planned changes)
- ...
If a project was just a clone, I'd sure people would just buy the existing version and be done with it. And sometimes they do, then a unique requirement comes and the whole process comes back into play.
If your job can be hallowed out into >90% entering prompts into AI text editors, you won't have to worry about continuing to be paid to do it every day for very long.
> Well, are they useful? ... Yeah, of course LLMs are useful, but we need to remain somewhat grounded in reality. How useful are LLMs?
They are useful enough that they can passably replace (much more expensive) humans in a lot of noncritical jobs, thus being a tangible tool for securing enterprise bottom lines.
From what I've seen in my own job and observing what my wife does (she's been working with the things on very LLM-centric processes and products in a variety of roles for about three years) not a lot of people are able to use them to even get a small productivity boost. Anyone less than very-capable trying to use them just makes a ton more work for someone more expensive than they are.
They're still useful, but they're not going to make cheap employees wildly more productive, and outside maybe a rare, perfect niche, they're not going to increase expensive employees' productivity so much that you can lay off a bunch of the cheap ones. Like, they're not even close to that, and haven't really been getting much closer despite improvements.
>they can dump out a boilerplate react frontend to a CRUD API
This is so clearly biased that it boarders on parody. You can only get out what you put in.
The real use case of current LLMs is that any project that would previously require collaboration can now be down solo with a much faster turnover. Of course in 20 years when compute finally catches up they will just be super intelligent AGI
I have Cursor running on my machine right now. I am even paying for it. This is in part because no matter what happens, people keep professing, basically every single time a new model is released, that it has finally happened: programmers are finally obsolete.
Despite the ridiculous hype, though, I have found that these things have crossed into usefulness. I imagine for people with less experience, these tools are a godsend, enabling them to do things they definitely couldn't do on their own before. Cool.
Beyond that? I definitely struggle to find things I can do with these tools that I couldn't do better without. The main advantage so far is that these tools can do these things very fast and relatively cheaply. Personally, I would love to have a tool that I can describe what I want in detailed but plain English and have it be done. It would probably ruin my career, but it would be amazing for building software. It'd be like having an army of developers on your desktop computer.
But, alas, a lot of the cool shit I'd love to do with LLMs doesn't seem to pan out. They're really good at TypeScript and web stuff, but their proficiency definitely tapers off as you veer out. It seems to work best when you can find tasks that basically amount to translation, like converting between programming languages in a fuzzy way (e.g. trying to translate idioms). What's troubling me the most is that they can generate shitloads of code but basically can't really debug the code they write beyond the most entry-level problem-solving. Reverse engineering also seems like an amazing use case, but the implementations I've seen so far definitely are not scratching the itch.
> Of course in 20 years when compute finally catches up they will just be super intelligent AGI
I am betting against this. Not the "20 years" part, it could be months for all we know; but the "compute finally catches up" part. Our brains don't burn kilowatts of power to do what they do, yet given basically unbounded time and compute, current AI architectures are simply unable to do things that humans can, and there aren't many benchmarks that are demonstrating how absolutely cataclysmically wide the gap is.
I'm certain there's nothing magical about the meat brain, as much as that is existentially challenging. I'm not sure that this follows through to the idea that you could replicate it on a cluster of graphics cards, but I'm also not personally betting against that idea, either. On the other hand, getting the absurd results we have gotten out of AI models today didn't involve modest increases. It involved explosive investment in every dimension. You can only explode those dimensions out so far before you start to run up against the limitations of... well, physics.
Maybe understanding what LLMs are fundamentally doing to replicate what looks to us like intelligence will help us understand the true nature of the brain or of human intelligence, hell if I know, but what I feel most strongly about is this: I do not believe LLMs are replicating some portion of human intelligence. They are very obviously neither a subset or superset or particularly close to either. They are some weird entity that overlaps in other ways we don't fully comprehend yet.
I see a difference between seeing them as valuable in their current state vs being "bullish about LLMs" in the stock market sense.
The big problem with being bullish in the stock market sense is that OpenAI isn't selling the LLMs that currently exist to their investors, they're selling AGI. Their pitch to investors is more or less this:
> If we accomplish our goal we (and you) will have infinite money. So the expected value of any investment in our technology is infinite dollars. No, you don't need to ask what the odds are of us accomplishing our goal, because any percent times infinity is infinity.
Since OpenAI and all the founders riding on their coat tails are selling AGI, you see a natural backlash against LLMs that points out that they are not AGI and show no signs of asymptotically approaching AGI—they're asymptotically approaching something that will be amazing and transformative in ways that are not immediately clear, but what is clear to those who are watching closely is that they're not approaching Altman's promises.
The AI bubble will burst, and it's going to be painful. I agree with the author that that is inevitable, and it's shocking how few people see it. But also, we're getting a lot of cool tech out of it and plenty of it is being released into the open and heavily commoditized, so that's great!
I think that people who don't believe LLMs to be AGI are not very good at Venn diagrams. Because they certainly are artificial, general, and intelligent according to any dictionary.
Good grief. You are deeply confused and/or deeply literal. That's not the accepted definition of AGI in any sense. One does not evaluate each word has an isolated component for testing the truth of a statement in an open compound word. Does your "living room" have organs?
It is that or you can't recognize a tongue-in-cheek comment on goalpost shifting. Wiki page you linked has the original definition of the term from 1997, dig it up. Better yet, look at the history of that page in Wayback machine and see with your own eyes how ChatGPT release changed it.
For reference, 1997 original: By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed.
2014 wiki requirements: reason, use strategy, solve puzzles, and make judgments under uncertainty;
represent knowledge, including commonsense knowledge; plan; learn; communicate in natural language;
and integrate all these skills towards common goals.
No, it's really not. Joining words into a compound word enables the new compound to take on new meaning and evolve on its own, and if it becomes widely used as a compound it always does so. The term you're looking for if you care to google it is an "open compound noun".
A dog in the sun may be hot, but that doesn't make it a hot dog.
You can use a towel to dry your hair, but that doesn't make the towel a hair dryer.
Putting coffee on a dining room table doesn't turn it into a coffee table.
Spreading Elmer's glue on your teeth doesn't make it tooth paste.
The White House is, in fact, a white house, but my neighbor's white house is not The White House.
I could go on, but I think the above is a sufficient selection to show that language does not, in fact, work that way. You can't decompose a compound noun into its component morphemes and expect to be able to derive the compound's meaning from them.
You wrote so much while failing to read so little:
> in most cases
What do you think will happen if we will start comparing the lengths of the list ["hot dog", ...] and the list ["blue bird", "aeroplane", "sunny March day", ...]?
No, I read that, and it's wrong. Can you point me to a single compound noun that works that way?
A bluebird is a specific species. A blue parrot is not a bluebird.
An aeroplane is a vehicle that flies through the air at high speeds, but if you broke it down into morphemes and tried to reason it out that way you could easily argue that a two-dimensional flat surface that extends infinitely in all directions and intersects the air should count.
Sunny March day isn't a compound noun, it's a noun phrase.
Can you point me to a single compound noun (that is, a two-or-more-part word that is widely used enough to earn a definition in a dictionary, like AGI) that can be subjected to the kind of breaking apart into morphemes that you're doing without yielding obviously nonsensical re-interpretations?
I feel like LLMs are the same as the leap from "world before web search" to "world after web search." Yeah, in google, you get crap links for sure, and you have to wade through salesy links and random blogs. But in the pre-web-search world, your options were generally "ask a friend who seems smart" or "go to the library for quite a while," AND BOTH OF THOSE OPTIONS HAD PLENTY OF ISSUES. I found a random part in an old arduino kit I bought years ago, and GPT-4o correctly identified it and explained exactly how to hook it up and code for it to me. That is frickin awesome, and it saves me a ton of time and leads me to reuse the part. I used DeepResearch to research car options that fit my exact needs, and it was 100% spot on - multiple people have suggested models that DeepReearch did not identify that would be a fit, but every time I dig in, I find that DeepResearch was right and the alternative actually had some dealbreaker I had specified. Etc., etc.
In the 90s, Robert Metcalfe infamously wrote "Almost all of the many predictions now being made about 1996 hinge on the Internet’s continuing exponential growth. But I predict the Internet, which only just recently got this section here in InfoWorld, will soon go spectacularly supernova and in 1996 catastrophically collapse." I feel like we are just hearing LLM versions of this quote over and over now, but they will prove to be equally accurate.
Generic. For the Internet, more complex questions would have been "What are the potential benefits, what the potential risks, what will grow faster" etc. The problem is not the growth but what that growth means. For LLMs, the big clear question is "will they stop just being LLMs, and when will they". Progress is seen, but we seek a revolution.
It would be fine if it were sold that way, but there is so much hype. We're told that it's going to replace all of us and put us all out our jobs. They set the expectations so high. Like remember OpenAI showing a video of it doing your taxes for you? Predictions that super-intelligent AI is going to radically transform society faster than we can keep up? I think that's where most of the backlash is coming from.
> We're told that it's going to replace all of us and put us all out our jobs.
I think this is the source of a lot of the hype. There are people salivating at the thought of no longer needing to employ the peasant class. They want it so badly that they'll say anything to get more investment in LLMs even if it might only ever allow them to fire a fraction of their workers, and even if their products and services suffer because the output they get with "AI" is worse than what the humans they throw away were providing.
They know they're overselling it, but they're also still on their knees praying that by some miracle their LLMs trained on the collective wisdom of facebook and youtube comments will one day gain actual intelligence and they can stop paying human workers.
In the meantime, they'll shove "AI" into everything they can think of for testing and refinement. They'll make us beta test it for them. They don't really care if their AI makes your customer service experience go to shit. They don't care if their AI screws up your bill. They don't care if their AI rejects your claims or you get denied services you've been paying for and are entitled to. They don't care if their AI unfairly denies you parole or mistakenly makes you the suspect of a crime. They don't care if Dr. Sbaitso 2.0 misdiagnoses you. Your suffering is worth it to them as long as they can cut their headcount by any amount and can keep feeding the AI more and more information because just maybe with enough data one day their greatest dream will become reality, and even if that never happens a lot of people are currently making massive amounts of money selling that lie.
The problem is that the bubble will burst eventually. The more time goes by and AI doesn't live up to the hype the harder that hype becomes to sell. Especially when by shoving AI into everything they're exposing a lot of hugely embarrassing shortcomings. Repeating "AI will happen in just 10 more years" gives people a lot of time to make money and cash out though.
On the plus side, we do get some cool toys to play with and the dream of replacing humans has sparked more interest in robotics so it's not all bad.
Yeah, it won't do your taxes for you, but it can sure help you do them yourself. Probably won't put you out of your job either, but it might help you accomplish more. Of course, one result of people accomplishing more in less time is that you need fewer people to do the same amount of work - so some jobs could be lost. But it's also possible that for the most part instead, more will be accomplished overall.
People frame that like it's something we gain, efficiency, as if before we were wasting time by thinking for ourselves. I get that they can do certain things better, I'm not sure that delegating to them is free of charge. We're paying something, losing something. Probably learning and fulfillment. We become increasingly dependent on machines to do anything.
Something important happened when we turned the tables around, I don't feel it gets the credit it should. It used to be humans telling machines what to do. Now we're doing the opposite.
And it might even be right and not get you in legal trouble! Not that you'd know (until audit day) unless you went back and did them as a verification though.
Except now, you can hire a competent professional accountant and discover on audit day that they got taken over by private equity, replaced 90% of the professionals doing work with AI and made a lot of money before the consequences become apparent.
Yes, but you're going to pay through the nose for the "wouldn't have to worry about legal trouble at all" (part of what you're paying for with professional services is a degree of protection from their fuckups).
So going back to apples-and-apples comparison, i.e. assuming that "spend a lot of money to get it done for you" is not on the table, I'd trust current SOTA LLM to do a typical person's taxes better than they themselves would.
I pay my accountant 500 USD to file my taxes. I don't consider that "through the nose" relative to my my inflated tech salary.
If a person is making a smaller income their tax situation is probably very simple, and can be handled by automated tools like TurboTax (as the sibling comment suggests).
I don't see a lot of value add from LLMs in this particular context. It's a situation where small mistakes can result in legal trouble or thousands of dollars of losses.
I'm on a financial forum where people often ask tax questions, generally _fairly_ simple questions. An obnoxious recent trend on many forums, including this one, is idiots feeding questions into a magic robot and posting what it says as a response. Now, ChatGPT may be very good at, er, something, I dunno, I am assured that it has _some_ use by the evangelists, but it is not good at tax, and if people follow many of the answers it gives then they are likely to get in trouble.
If a trillion-parameter model can't handle your taxes, that to me says more about the tax code than the AI code.
People who paste undisclosed AI slop in forums deserve their own place in hell, no argument there. But what are some good examples of simple tax questions where current models are dangerously wrong? If it's not a private forum, can you post any links to those questions?
So, a super-basic one I saw recently, in relation to Irish tax. In Ireland, ETFs are taxed differently to normal stocks (most ETFs available here are accumulating, they internally re-invest dividends; this is uncommon for US ETFs for tax reasons). Normal stocks have gains taxed under the capital gains regime (33% on gains when you sell). ETFs are different; they're taxed 40% on gains when you sell, and they are subject to 'deemed disposal'; every 8 years, you are taxed _as if you had sold and re-bought_. The ostensible reason for this is to offset the benefit from untaxed compounding of dividends.
Anyway, the magic robot 'knew' all that. Where it slipped up was in actually _working_ with it. Someone asked for a comparison of taxation on a 20 year investment in individual stocks vs ETFs, assuming re-investment of dividends and the same overall growth rate. The machine happily generated a comparison showing individual stocks doing massively better... On closer inspection, it was comparing growth for 20 years for the individual stocks to growth of 8 years for the ETFs. (It also got the marginal income tax rate wrong.)
But the nonsense it spat out _looked_ authoritative on first glance, and it was a couple of replies before it was pointed out that it was completely wrong. The problem isn't that the machine doesn't know the rules; insofar as it 'knows' anything, it knows the rules. But it certainly can't reliably apply them.
(I'd post a link, but they deleted it after it was pointed out that it was nonsense.)
Interesting, thanks. That doesn't seem like an entirely simple question, but it does demonstrate that the model is still not great at recognizing when it is out of its league and should either hedge its answer, refuse altogether, or delegate to an appropriate external tool.
This failure seems similar to a case that someone brought up earlier ( https://news.ycombinator.com/item?id=43466531 ). While better than expected at computation, the transformer model ultimately overestimates its own ability, running afoul of Dunning-Kruger much like humans tend to.
Replying here due to rate-limiting:
One interesting thing is that when one model fails spectacularly like that, its competitors often do not. If you were to cut/paste the same prompt and feed it to o1-pro, Claude 3.7, and Gemini 2.5, it's possible that they would all get it wrong (after all, I doubt they saw a lot of Irish tax law during training.) But if they do, they will very likely make different errors.
Unfortunately it doesn't sound like that experiment can be run now, but I've run similar tests often enough to tell me that wrong answers or faulty reasoning are more likely model-specific shortcomings rather than technology-specific shortcomings.
That's why I get triggered when people speak authoritatively on here about what AI models "can't do" or "will never be able to do." These people have almost always, almost without exception, been proven dead wrong in the past, but that never seems to bother them.
It's the sort of mistake that it's hard to imagine a human making, is the thing. Many humans might have trouble compounding at all, but the 20 year/8 year confusion just wouldn't happen. And I think it is on the simple side of tax questions (in particular all the _rules_ involved are simple, well-defined, and involve no ambiguity or opinion; you certainly can't say that of all tax rules). Tax gets _complicated_.
This reminds me of the early days of Google, when people who knew how to phrase a query got dramatically better results than those who basically just entered what they were looking for as if asking a human.
And indeed, phrasing your prompts is important here too, but I mean more that by having a bit of an understanding of how it works and how it differs from a human, you can avoid getting sucked in by most of these gaps in its abilities, while benefitting from what it's good at. I would ask it the question about the capital gains rules (and would verify the response probably with a link I'd ask it to provide), but I definitely wouldn't expect it to correctly provide a comparison like that. (I might still ask, but would expect to have to check its work.)
Forget OpenAI ChatGPT doing your taxes for you. Now Gemini will write up your sales slides about Gouda cheese, stating wrongly in the process that gouda makes up about 50% of all cheese consumption worldwide :) These use-cases are getting more useful by the day ;)
I mean, it's been like 3 years. 3 years after the web came out was barely anything. 3 years after the first GPU was cool, but not that cool. The past three years in LLMs? Insane.
Things could stall out and we'll have bumps and delays ... I hope. If this thing progresses at the same pace, or speeds up, well ... reality will change.
Or not. Even as they are, we can build some cool stuff with them.
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly?
The trouble is that, while incredibly amazing, mind blowing technology, it falls down flat often enough that it is a big gamble to use. It is never clear, at least to me, what it is good at and what it isn't good at. Many things I assume it will struggle with, it jumps in with ease, and vice versa.
As the failures mount, I admittedly do find it becoming harder and harder to compel myself to see if it will work for my next task. It very well might succeed, but by the time I go to all the trouble to find out it often feels that I may as well just do it the old fashioned way.
If I'm not alone, that could be a big challenge in seeing long-term commercial success. Especially given that commercial success for LLMs is currently defined as 'take over the world' and not 'sustain mom and pop'.
> the speed at which it is progressing is insane.
But same goes for the users! As a result the failure rate appears to be closer to a constant. Until we reach the end of human achievement, where the humans can no longer think of new ways to use LLMs, that is unlikely to change.
It's becoming clear to me that some people just have vastly different uses and use cases than I do. Summarizing a deep, cutting edge physics paper is I'm sure vasty different than summarizing a web page while I'm browsing HN, or writing a Python plugin for Icinga to monitor a web endpoint that spits out JSON.
The author says they use several LLMs every day and they always produce incorrect results. That "feels" weird, because it seems like you'd develop an intuition fairly quickly for the kinds of questions you'd ask that LLMs can and can't answer. If I want something with links to back up what is being said, I know I should ask Perplexity or maybe just ask a long-form prompt-like question of Google or Kagi. If I want a Python or bash program I'm probably going to ask ChatGPT or Gemini. If I want to work on some code I want to be in Cursor and am probably using Claude. For general life questions, I've been asking Claude and ChatGPT.
Running into the same issue with LLMs over and over for years, with all due respect, seems like the "doing the same thing and expecting different results" situation.
This is so true. I really hope she joins this conversation so we can have a productive discussion and understand what she's actually hoping to achieve.
The two sides are never going to understand each other because I suspect we work on entirely different things and have radically different workflows. I suspect that hackernews gets more use out of LLMs in general than the average programmer because they are far more likely to be at a web startup and more likely to actually be bottlenecked on how fast you can physically put more code in the file and ship sooner.
If you work on stuff that is at all niche (as in, stack overflow was probably not going to have the answer you needed even before LLMs became popular), then it's not surprising when LLMs can't help because they've not been trained.
For people that were already going fast and needed or wanted to put out more code more quickly, I'm sure LLMs will speed them up even more.
For those of us working on niche stuff, we weren't going fast in the first place or being judged on how quickly we ship in all likelihood. So LLMs (even if they were trained on our stuff) aren't going to be able to speed us up because the bottleneck has never been about not being able to write enough code fast enough. There are architectural and environmental and testing related bottlenecks that LLMs don't get rid of.
That's a good point, I've personally not got much use out of LLMs (I use them to generate fantasy names for my D&D campaign, but find they fall down for anything complex) - but I've also never got much use out of StackOverflow either.
I don't think I'm working on anything particularly niche, but nor is it cookie-cutter generic either, and that could be enough to drastically reduce their utility.
Because it has a sample size of our collective human knowledge and language big enough to trick our brains into believing that.
As a parallel thought, it reminds of a trick derren brown did. He picked every horse correctly across 6 races. The person who he was picking for was obviously stunned, as were the audience watching it.
The reality of course is just that people couldn't comprehend that he just had to go to extreme and tedious lengths to make this happen. They started with 7000 people and filmed every one like it was going to be the "one" and then the probability pyramid just dropped people out. It was such a vast undertaking of time and effort that we're biased towards believing there must be something really happening here.
LLMs currently are a natural language interface to a Microsoft Encarta like system that is so unbelievably detailed and all encompassing that we risk accepting that there's something more going on there. There isn't.
Again, it's not intelligence. It's a mirror that condenses our own intelligence and reflects back to us using probabilities at a scale that tricks us into the notion there is something more than just a big index and clever search interface.
There is no meaningful interpretation of the word intelligence that applies, psychologically or philosophically, to what is going on. Machine Learning is far more apt and far less misleading.
I saw the transition from ML to AI happen in academic papers and then pitch decks in real time. It was to refill the well when investors were losing faith that ML could deliver on the promises. It was not progress driven.
this doesn't make any more sense than calling LLMs "intelligence". There is no "our intelligence" beyond a concept or an idea that you or someone else may have about the collective, which is an abstraction.
What we do each have our own intelligence, and that intelligence is and likely always be, no matter how science progresses, ineffable. So my point is you can't say your made up/ill defined concept is any realer than any other made up/ill defined concept.
It really depends on the task. Like Sabine, I’m operating on the very frontier of a scientific domain that is extremely niche. Every single LLM out there is worse than useless in this domain. It spits out incomprehensible garbage.
But ask it to solve some leet code and it’s brilliant.
The question I ask afterwords then is: is solving some leet code brilliant? Is designing a simple inventory system brilliant if they've all been accomplished already? My answer tends towards no, since they still make mistakes in the process, and it harms newer developers from learning.
I should start collecting examples, if only for threads like this. Recently I tried to llm a tsserver plugin that treats lines ending with "//del" as empty. You can only imagine all the sneaky failures in the chat and the total uselessness of these results.
Anything that is not literally millions (billions?) of times in the training set is doomed to be fantasized about by an LLM. In various ways, tones, etc. After many such threads I came to conclusion that people who find it mostly useful are simply treading water as they probably have done most of their career. Their average product is a react form with a crud endpoint and excitement about it. I can't explain their success reports otherwise, cause it rarely works on anything beyond that.
Welcome to the new digital divide people, and the start of a new level of "inequality" in this world. This thread is proof that we've diverged and there is a huge subset of people that will not have their minds changed easily.
Hallucinating incorrect information is worse than useless. It is actively harmful.
I wonder how much this affects our fundraising, for example. No VC understands the science here, so they turn to advisors (which is great!) or to LLMs… which has us starting off on the wrong foot.
I work in a field that is not even close to a scientific nishe - software reverse engineering - and LLM will happily lie to me all the time, for every question I have. I find out useful to generate some initial boilerplate but... that's it. AI autocompletion saved me an order of magnitude more time, and nobody is hyped about it.
Sabine is lex Friedman for women. Stay in your lane about quantum physics and stop trying to opine on LLMs. I’m tired of seeing the huge amount of FUD from her.
Two things can be true: e.g., that LLMs are incredible tech we only dreamed of having, and that they’re so flawed that they’re hard to put to productive use.
I just tried to use the latest Gemini release to help me figure out how to do some very basic Google Cloud setup. I thought my own ignorance in this area was to blame for the 30 minutes I spent trying to follow its instructions - only to discover that Gemini had wildly hallucinated key parts of the plan. And that’s Google’s own flagship model!
I think it’s pretty telling that companies are still struggling to find product-market fit in most fields outside of code completion.
> Wah, it can't write code like a Senior engineer with 20 years of experience!
No, that's not my problem with it. My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.
Sure, sometimes it produces useful code. And often, it'll simply call the "doTheHardPart()" method. I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over.
Outside of programming, this is much worse. I've both seen online and heard people quote LLM output as if it were authoritative. That to me is the bigger danger of LLMs to society. People just don't understand that LLMs aren't high powered attorneys, or world renown doctors. And, unfortunately, the incorrect perception of LLMs is being hyped both by LLM companies and by "journalists" who are all to ready to simply run with and discuss the press releases from said LLM companies.
> What's worse, people are treating them as authoritative. … I've both seen online and heard people quote LLM output as if it were authoritative.
Thats not an LLM problem. But indeed quite bothersome. Dont tell me what Chatgpt told you. Tell me what you know. Maybe you got it from ChatGPT and verified it. Great. But my jaw kind of drops when people cite an LLM and just assume it’s correct.
It might not be an LLM problem, but it’s an AI-as-product problem. I feel like every major player’s gamble is that they can cement distinct branding and model capabilities (as perceived by the public) faster than the gradual calcification of public AI perception catches up with model improvements - every time a consumer gets burned by AI output in even small ways, the “AI version of Siri/Alexa only being used for music and timers” problem looms a tiny, tiny bit larger.
These are not similar. Wikipedia says the same thing to everybody, and when what it says is wrong, anybody can correct it, and they do. Consequently it's always been fairly reliable.
Lies and mistakes persist on Wikipedia for many years. They just need to sound truthy so they don't jump out to Wikipedia power users who aren't familiar with the subject. I've been keeping tabs on one for about five years, and its several years older than that, which I won't correct because I am IP range banned and I don't feel like making an account and dealing with any basement dwelling power editor NEETs who read Wikipedia rules and processes for fun. I know I'm not the only one to, because this glaring error isn't in a particularly obscure niche, its in the article for a certain notorious defense initiative which has been in the news lately, so this error has plenty of eyes on it.
In fact, the error might even be a good thing; it reminds attentive readers that Wikipedia is an unreliable source and you always have to check if citations actually say the thing which is being said in the sentence they're attached to.
That's true too, but the bigger difference from my point of view is that factual errors in Wikipedia are relatively uncommon, while, in the LLM output I've been able to generate, factual errors vastly outnumber correct facts. LLMs are fantastic at creativity and language translation but terrible at saying true things instead of false things.
Comments like these honestly make me much more concerned than LLM hallucinations. There have been numerous times when I've tracked down the source for a claim, only to find that the source was saying something different, or that the source was completely unreliable (sometimes on the crackpot level).
Currently, there's a much greater understanding that LLM's are unreliable. Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.
The big problem is that people in general are terrible at exercising critical thinking when they're presented with information. It's probably less of an issue with LLMs at the moment, since they're new technology and a certain amount of skepticism gets applied to their output. But the issue is that once people have gotten more used to them, they'll turn off they're critical thinking in the same manner that they turn it off when absorbing information from other sources that they're used to.
Wikipedia is fairly reliable if our standard isn't a platonic ideal of truth but real-world comparators. Reminds me of Kant's famous line. "From the crooked timber of humankind, nothing entirely straight can be made".
The sell of Wikipedia was never "we'll think so you don't have to", it was never going to disarm you of your skepticism and critical thought, and you can actually check the sources. LLMs are sold as "replace knowledge work(ers)", you cannot check their sources, and the only way you can check their work is by going to something like Wikipedia. They're just fundamentally different things.
> The sell of Wikipedia was never "we'll think so you don't have to", it was never going to disarm you of your skepticism and critical thought, and you can actually check the sources.
You can check them, but Wikipedia doesn't care what they say. When I checked a citation on the French Toast page, and noted that the source said the opposite of what Wikipedia did by annotating that citation with [failed verification], an editor showed up to remove that annotation and scold me that the only thing that mattered was whether the source existed, not what it might or might not say.
I feel like I hear a lot of criticism about Wikipedia editors, but isn't Wikipedia overall pretty good? I'm not gonna defend every editor action or whatever, but I think the product stands for itself.
Wikipedia is overall pretty good, but it sometimes contains erroneous information. LLMs are overall pretty good, but they sometimes contain erroneous information.
The weird part is when people get really concerned that someone might treat the former as a reliable source, but then turn around and argue that people should treat the latter as a reliable source.
I had a moment of pique where I was just gonna copy paste my reply to this rehash of your original point that is non-responsive to what I wrote, but I've found myself. Instead, I will link to the Wikipedia article for Equivocation [0] and ChatGPT's answer to "are wikipedia and LLMs alike?"
Wikipedia occasionally has errors, which are usually minor. The LLMs I've tried occasionally get things right, but mostly emit limitless streams of plausible-sounding lies. Your comment paints them as much more similar than they are.
In my experience, it's really common for wikipedia to have errors, but it's true that they tend to be minor. And yes, LLMs mostly just produce crazy gibberish. They're clearly worse than wikipedia. But I don't think wikipedia is meeting a standard it should be proud of.
> Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.
One of these things is not like the others! Almost always, when I see somebody claiming Wikipedia is wrong about something, it's because they're some kind of crackpot. I find errors in Wikipedia several times a year; probably the majority of my contribution history to Wikipedia https://en.wikipedia.org/wiki/Special:Contributions/Kragen consists of me correcting errors in it. Occasionally my correction is incorrect, so someone corrects my correction. This happens several times a decade.
By contrast, I find many YouTube videos and studies from advocacy groups to be full of errors, and there is no mechanism for even the authors themselves to correct them, much less for someone else to do so. (I don't know enough about posts on AskHistorians to comment intelligently, but I assume that if there's a major factual error, the top-voted comments will tell you so—unlike YouTube or advocacy-group studies—but minor errors will generally remain uncorrected; and that generally only a single person's expertise is applied to getting the post right.)
But none of these are in the same league as LLM output, which in my experience usually contains more falsehoods than facts.
> Currently, there's a much greater understanding that LLM's are unreliable.
Wikipedia being world-editable and thus unreliable has been beaten into everyone's minds for decades.
LLMs just popped into existence a few years ago, backed by much hype and marketing about "intelligence". No, normal people you find on the street do not in fact understand that they are unreliable. Watch some less computer literate people interact with ChatGPT - it's terrifying. They trust every word!
If you read a non-fiction book on any topic, you can probably assume that half of the information in it is just extrapolated from the authors experience.
Even scientific articles are full of inaccurate statements, the only thing you can somewhat trust are the narrow questions answered by the data, which is usually a small effect that may or may not be reproducible...
No, different media are different—or, better said, different institutions are different, and different media can support different institutions.
Nonfiction books and scientific papers generally only have one person, or at best a dozen or so (with rare exceptions like CERN papers), giving attention to their correctness. Email messages and YouTube videos generally only have one. This limits the expertise that can be brought to bear on them. Books can be corrected in later printings, an advantage not enjoyed by the other three. Email messages and YouTube videos are usually displayed together with replies, but usually comments pointing out errors in YouTube videos get drowned in worthless me-too noise.
But popular Wikipedia articles are routinely corrected by hundreds or thousands of people, all of whom must come to a rough consensus on what is true before the paragraph stabilizes.
Consequently, although you can easily find errors in Wikipedia, they are much less common in these other media.
Yes, though by different degrees. I wouldn't take any claim I read on Wikipedia, got from an LLM, saw in a AskHistorians or Hacker News reply, etc., as fact, and I would never use any of those as a source to back up or prove something I was saying.
Newspaper articles? It really depends. I wouldn't take paraphrased quotes or "sources say" as fact.
But as you move to generally more reliable sources, you also have to be aware that they can mislead in different ways, such as constructing the information in a particular way to push a particular narrative, or leaving out inconvenient facts.
And that is still accurate today. Information always contains a bias from the narrators perspective. Having multiple sources allows one to triangulate the accuracy of information. Making people use one source of information would allow the business to control the entire narrative. Its just more of a business around people and sentiments than being bullish on science.
And they were right, right? They recognized it had structural faults that made it possible for bad data to sip in. The same is valid for LLMs: they have structural faults.
So what is your point? You seem to have placed assumptions there. And broad ones, so that differences between the two things, and complexities, the important details, do not appear.
It is, if the purpose of LLMs was to be AI. "Large language model" as a choir of pseudorandom millions converged into a voice - that was achieved, but it is by definition out of the professional realm. If it is to be taken as "artificial intelligence", then it has to have competitive intelligence.
> But my jaw kind of drops when people cite an LLM and just assume it’s correct.
Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor, so is it totally their fault?
They've heard about the uncountable sums of money spent on creating such software, why would they assume it was anything short of advertised?
> Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor
Why does this imply that they’re always correct? I’m always genuinely confused when people pretend like hallucinations are some secret that AI companies are hiding. Literally every chat interface says something like “LLMs are not always accurate”.
> Literally every chat interface says something like “LLMs are not always accurate”.
In small, de-emphasized text, relegated to the far corner of the screen. Yet, none of the TV advertisements I've seen have spent any significant fraction of the ad warning about these dangers. Every ad I've seen presents someone asking a question to the LLM, getting an answer and immediately trusting it.
So, yes, they all have some light-grey 12px disclaimer somewhere. Surprisingly, that disclaimer does not carry nearly the same weight as the rest of the industry's combined marketing efforts.
> In small, de-emphasized text, relegated to the far corner of the screen.
I just opened ChatGPT.com and typed in the question “When was Mr T born?”.
When I got the answer there were these things on screen:
- A menu trigger in the top-left.
- Log in / Sign up in the top right
- The discussion, in the centre.
- A T&Cs disclaimer at the bottom.
- An input box at the bottom.
- “ChatGPT can make mistakes. Check important info.” directly underneath the input box.
I dislike the fact that it’s low contrast, but it’s not in a far corner, it’s immediately below the primary input. There’s a grand total of six things on screen, two of which are tucked away in a corner.
This is a very minimal UI, and they put the warning message right where people interact with it. It’s not lost in a corner of a busy interface somewhere.
Maybe it's just down to different screen sizes, but when I open a new chat in chat GPT, the prompt is in the center of the screen, and the disclaimer is quite a distance away at the very bottom of the screen.
Though, my real point is we need to weigh that disclaimer, against the combined messaging and marketing efforts of the AI industry. No TV ad gives me that disclaimer.
Then we can look at people's behavior. Look at the (surprisingly numerous) cases of lawyers getting taken to the woodshed by a judge for submitting filings to a court with chat GPT introduced fake citations! Or, someone like Ana Navarro confidentially repeating an incorrect fact, and when people pushed back saying "take it up with chat GPT" (https://x.com/ananavarro/status/1864049783637217423).
I just don't think the average person who isn't following this closely understands the disclaimer. Hell, they probably don't even really read it, because most people skip over reading most de-emphasized text in most-UIs.
So, in my opinion, whether it's right next to the text-box or not, the disclaimer simply cannot carry the same amount of cultural impact as the "other side of the ledger" that are making wild, unfounded claims to the public.
Unfortunately they are trained first and foremost as plausibility engines. The central dogma is that plausibility will (with continuing progress & scale) converge towards correctness, or "faithfulness" as it's sometimes called in the literature.
This remains very far from proven.
The null hypothesis that would be necessary to reject, therefore, is a most unfortunate one, viz. that by training for plausibility we are creating the world's most convincing bullshit machines.
> plausibility [would] converge towards correctness
That is the most horribly dangerous idea, as we demand that the agent guesses not, even - and especially - when the agent is a champion at guessing - we demand that the agent checks.
If G guesses from the multiplication table with remarkable success, we more strongly demand that G computes its output accurately instead.
Oracles that, out of extraordinary average accuracy, people may forget are not computers, are dangerous.
One man's "plausibility" is another person's "barely reasoned bullshit". I think you're being generous, because LLMs explicitly don't deal in facts, they deal in making stuff up that is vaguely reminiscent of fact. Only a few companies are even trying to make reasoning (as in axioms-cum-deductions, i.e., logic per se) a core part of the models, and they're really struggling to hand-engineer the topology and methodology necessary for that to work roughly as facsimile of technical reasoning.
I’m not really being generous. I merely think if I’m gonna condemn something as high-profile snake oil for the tragically gullible, it’s helpful to have a solid basis for doing so. And it’s also important to allow oneself to be wrong about something, however remote the possibility may currently seem, and preferably without having to revise one’s principles to recognise it.
As a sort of related anecdote... if you remember the days before google, people sitting around your dinner table arguing about stuff used to spew all sorts of bullshit then drop that they have a degree from XYZ university and they won the argument... when google/wikipedia came around turned out that those people were in fact just spewing bullshit. I'm sure there was some damage but it feels like a similar thing. Our "bullshit-radar" seems to be able to adapt to these sorts of things.
Well, with every conspiracy theories thriving on this day and age with access to technology and information at one fingertips. If you add that now the US administration effectively spewing bullshit every few minutes.
The best example of this was an arguement I had a little while ago where I was talking about self driving and I was mentioning that I have a hard time trusting any system relying only on cameras, to which I was being told that I didn't understand how machine learning works and obviously they were correct and I was wrong and every car would be self driving within 5 years. All of these things could easily be verified independently.
Suffice to say that I am not sure that the "bullshit-radar" is that adaptive...
Mind you, this is not limited to the particular issue at hand but I think those situations needs to be highlighted, because we get fooled easily by authoritative delivery...
Language models are closing the gaps that still remain at an amazing rate. There are still a few gaps, but if we consider what has happened just in the last year, and extrapolated 2-3 years out....
Some people trust Alex Jones, while the vast majority realize that he just fabricates untruths constantly. Far fewer people realize that LLMs do the same.
People know that computers are deterministic, but most don't realize that determinism and accuracy are orthogonal. Most non-IT people give computers authoritative deference they do not deserve. This has been a huge issue with things like Shot Spotter, facial recognition, etc.
I think you are discounting the fact that you can weed out people who make a habit of that, but you can't do that with LLMs if they are all doing that.
One thing I see a lot on X is people asking Grok what movie or show a scene is from.
LLMs must be really, really bad at this because not only is it never right, it actually just makes something up that doesn't exist. Every, single, time.
I really wish it would just say "I'm not good at this, so I do not know."
When your model of the world is build on the relative probabilities of the next opaque apparently-arbitrary number in context of prior opaque apparently-arbitrary numbers, it must be nearly impossible to tell the difference between “there are several plausible ways to proceed, many of which the user will find useful or informative, and I should pick one” and “I don’t know”. Attempting to adjust to allow for the latter probably tends to make the things output “I don’t know” all the time, even when the output they’d have otherwise produced would have been good.
I thought about this of course, and I think a reasonable 'hack' for now is to more or less hardcode things that your LLM sucks at, and override it to say it doesn't know. Because continually failing at basic tasks is bad for confidence in said product.
I mean, it basically does the same thing if you ask it to do anything racist or offensive, so that override ability is obviously there.
So if it identifies the request as identifying a movie scene, just say 'I don't know', for example.
Hardcode by whom? Who do we trust with this task to do it correctly? Another LLM that suffers from the same fundamental flaw or by a low paid digital worker in a developing country? Because that's the current solution. And who's gonna pay for all that once the dumb investment money runs out, who's gonna stick around after the hype?
By the LLM team (Grok team, in this case). I don't mean for the LLM to be sentient enough to know it doesn't know the answer, I mean for the LLM to identify what is being asked of it, and checking to see if that's something on the 'blacklist of actions I cannot do yet', said list maintained by humans, before replying.
No different than when asking ChatGPT to generate images or videos or whatever before it could, it would just tell you it was unable to.
> It's impossible to predict with certainty who will be the U.S. President in 2046. The political landscape can change significantly over time, and many factors, including elections, candidates, and events, will influence the outcome. The next U.S. presidential election will take place in 2028, so it would be difficult to know for sure who will hold office nearly two decades from now.
I can do this because it is in fact the most likely thing to continue with, word by word.
But the most likely thing to continue a paper with is not to say at the end „I don‘t know“. It is actually providing sources which it proceeds to do wrongly.
>> We need an AI technology that can output "don't know" when appropriate. How's that coming along?
Heh. Easiest answer in the world. To be able to say "don't know", one has first to be able to "know". And we ain't there yet, by large. Not even flying by a million miles of it.
Needs meta annotation of certainty on all nodes and tokkens that accumulates while reasoning . Also gives the ability to train in believes, as in overriding any uncertainty. Right now we are in the pure believes phase.AI is its own god right now, pure blissful believe without the sin of doubt.
Sure we have. We don't have a perfect solution but it's miles better than what we have for LLMs.
If a lawyer consistently makes stuff up on legal filings, in the worst cases they can lose their license (though they'll most likely end up getting fines).
If a doctor really sucks, they become uninsurable and ultimately could lose their medical license.
Devs that don't double check their work will cause havoc with the product and, not only will they earn low opinions from their colleges, they could face termination.
This exists, each next token has a probability assigned to it. High probability means "it knows", if there's two or more tokens of similar probability, or the prob of the first token is low in general, then you are less confident about that datum.
Of course there's areas where there's more than one possible answer, but both possibilities are very consistent. I feel LLMs (chatgpt) do this fine.
Also can we stop pretending with the generic name for ChatGPT? It's like calling Viagra sildenafil instead of viagra, cut it out, there's the real deal and there's imitations.
> low in general, then you are less confident about that datum
It’s very rarely clear or explicit enough when that’s the case. Which makes sense considering that the LLMs themselves do not know the actual probabilities
Maybe this wasn't clear, but the Probabilities are a low level variable that may not be exposed in the UI, it IS exposed through API as logprobs in the ChatGPT api. And of course if you have binary access like with a LLama LLM you may have even deeper access to this p variable
> it IS exposed through API as logprobs in the ChatGPT api
Sure but they often are not necessarily easily interpretable or reliable.
You can use it to compare a model’s confidence of several different answers to the same question but anything else gets complicated and not necessarily that useful.
This is very subjective, but I feel they are all imitators of ChatGPT. I also contend that the ChatGPT API (and UI) will or has become a de facto standard in the same manner that intel's 80886 Instruction set evolved into x86
How many companies train on data that contains 'i don't know' responses. Have you ever talked with a toddler / young child? You need to explicitly teach children to not bull shit. At least I needed to teach mine.
Never mind toddlers, have you ever hired people? A far smaller proportion of professional adults will say “I don’t know” than a lot of people here seem to believe.
No I call judgement a logical process of assessment.
You have an amount of material that speaks of the endeavours in some sport of some "Michael Jordan", the logic in the system decides that if a "Michael Jordan" in context can be construed to be "that" "Michael Jordan" then there will be sound probabilities he is a sportsman; you have very little material about a "John R. Brickabracker", the logic in the system decides that the material is insufficient to take a good guess.
Then I expect your personal fortunes are tied up in hyping the "generative AI are just like people!" meme. Your comment is wholly detached from the reality of using LLMs. I do not expect we'll be able to meet eye-to-eye on the topic.
would you rather the LLM make up something that sounds right when it doesn't know, or would you like it to claim "i don't know" for tasks it actually can figure out? because presumably both happen at some rate, and if it hallucinates an answer i can at least check what that answer is or accept it with a grain of salt.
nobody freaks out when humans make mistakes, but we assume our nascent AIs, being machines, should always function correctly all the time
> would you rather the LLM make up something that sounds right when it doesn't know, or would you like it to claim "i don't know" for tasks it actually can figure out?
And that's part of the problem - you're thinking of it like a hammer when it's not a hammer. It's asking someone at a bar a question. You'll often get an answer - but even if they respond confidently that doesn't make it correct. The problem is people assuming things are fact because "someone at a bar told them." That's not much better than, "it must be true I saw it on TV".
It's a different type of tool - a person has to treat it that way.
Asking a question is very contextual. I don't ask a lawyer house engineering problems, nor my doctor how to bake cake. That means If I'm asking someone at a bar, I'm already prepare to deal with the fact that the person is maybe drunk, probably won't know,... And more often than not, I won't even ask the question unless dire needs. Because it's the most inefficient way to get an informed answer.
I wouldn't bat an eye if people were taking code suggestions, then review it and edit it to make it correct. But from what I see, it's pretty a direct push to production if they got it to compile, which is different from correct.
3rd Order Ignorance (3OI)—Lack of Process.
I have 3OI when I don't know a suitably efficient way to find out I don't know that I don't know something. This is lack of process, and it presents me with a major problem: If I have 3OI, I don't know of a way to find out there are things I don't know that I don't know.
—- not from an llm
My process: use llms and see what I can do with them while taking their Output with a grain of salt.
But the issue of the structural fault remains. To state the phenomenon (hallucination) is not "superficial", as the root does not add value in the context.
Symptom: "Response was, 'Use the `solvetheproblem` command'". // Cause: "It has no method to know that there is no `solvetheproblem` command". // Alarm: "It is suggested that it is trying to guess a plausible world through lacking wisdom and data". // Fault: "It should have a database of what seems to be states of facts, and it should have built the ability to predict the world more faithfully to facts".
But this was true before LLMs. People would and still do take any old thing from an internet search and treat it as true. There is a known, difficult-to-remedy failure to properly adjudicate information and source quality, and you can find it discussed in research prior to the computer age. It is a user problem more than a system problem. In my experience, with the way I interact with LLMs, they are more likely to give me useful output than not, and this is borne out by mainstream non-edge-case academic peer-reviewed work. Useful does not necessarily equal 100% correct, just as a Google search does not. I judge and vet all information, whether from an LLM, search, book, paper, or wherever We can build a straw person who "always" takes LLM output as true and uses it as-is but those are the same people who use most information tools poorly, be they internet search, dictionaries, or even looking in their own files for their own work or sent mail (I say this as an IT professional who has seen the worker types from before the pre-internet days through now). In any case, we use automobiles despite others misusing them. But only the foolish among us completely take our hands off the wheel for any supposed "self-driving" features. While we must prevent and decry the misuse by fools, we cannot let their ignorance hold us back. Let's let their ignorance help make tools, as they help identify more undesirable scenarios.
My company just broadly adopted AI. It’s not a tech company and usually late to the game when it comes to tech adoption.
I’m counting down the days when some AI hallucination makes its way all the way to the C-suite. People will get way too comfortable with AI and don’t understand just how wrong it can be.
Some assumption will come from AI, no one will check it and it’ll become a basic business input. Then suddenly one day someone smart will say “thats not true” and someone will trace it back to AI. I know it.
I assume at that point in time there will be some general directive on using AI and not assuming it’s correct. And then AI will slowly go out of favor.
People fabricated a lot too. Yesterday I spent far less time fixing issues in the far more complex and larger changes Claude Code managed to churn out than what the junior developer I worked with needed. Sometimes it's the reverse. But with my time factored in, working with Claude Code is generally more productive for me than working with a junior. The only reason I still work with a junior dev is as an investment into teaching him.
Every developer I've ever worked with have gotten things wrong. Whether you call that hallucinating or not is irrelevant. What matters is the effort it takes to fix.
On the logically practical point I agree with you (what counts in the end in the specific process you mention is the gain vs loss game), but my point was that if your assistant is structurally delirious you will have to expect a big chunk of the "loss" as structural.
> It turns out that, in Claude, refusal to answer is the default behavior
I.e., boxes that incline to different approaches to heuristic will behave differently and offer different value (to be further assessed within a framework of complexity, e.g. "be creative but strict" etc.)
And my direct experience is that I often spend less time directing, reviewing and fixing code written by Claude Code at this point than I do for a junior irrespective of that loss. If anything, Claude Code "knows" my code bases better. The rest, then, to me at least is moot.
Claude is substantially cheaper for me, per reviewed, fixed change committed. More importantly to me, it demands less of my limited time per reviewed, fixed change committed.
Having a junior dev working with me at this point wouldn't be worth it to me if it wasn't for the training aspect: We still need pipelines of people who will learn to use the AI models, and who will learn to do the things it can't do well.
But my point was: it's good that Claude has become a rightful legend in the realm of coding, but before and regardless, a candidate that told you "that class will have a .SolveAnyProblem() method: I want to believe" presents an handicap. As you said no assistant revealed to be perfect, but assistants who attempt mixing coding sessions and creative fiction writing raise alarms.
> My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.
The same is true about the internet, and people even used to use these arguments to try to dissuade people from getting their information online (back when Wikipedia was considered a running joke, and journalists mocked blogs). But today it would be considered silly to dissuade someone from using the internet just because the information there is extremely unreliable.
Many programmers will say Stack Overflow is invaluable, but it's also unreliable. The answer is to use it as a tool and a jumping off point to help you solve your problem, not to assume that its authoritative.
The strange thing to me these days is the number of people who will talk about the problems with misinformation coming from LLMs, but then who seem to uncritically believe all sorts of other misinformation they encounter online, in the media, or through friends.
Yes, you need to verify the information you're getting, and this applies to far more than just LLMs.
Shades of grey fallacy. You have way more context clues about the information on the internet than you do with an LLM. In fact, with an LLM you have zero(-ish?).
I can peruse your previous posts to see how truthful you are, I can tell if your post has been down/upvoted, I can read responses to your post to see if you've been called out on anything, etc.
This applies tenfold in real life where over time you get to build comprehensive mental models of other people.
I have decided it must be attached to a sort of superiority complex. These types of people believe they are capable of deciphering fact from fiction but the general population isn’t so LLMs scare them because someone might hear something wrong and believe it. It almost seems delusional. You have to be incredibly self aggrandizing in your mind to think this way. If LLMs were actually causing “a problem” then there would be countless examples of humans making critical mistakes because of bad LLM responses, and that is decidedly not happening. Instead we’re just having fun ghiblifying the last 20 years of the internet.
Regardless of anything else it’s extremely too early to make such claims. We have to wait until people start allowing “AI agents” to make autonomous blackbox decision with minimal supervision since nobody has any clue what’s happening.
Even if we tone down the SciFi dystopia angle not that many people really use LMMs in non superficial ways yet. What I’m most afraid of would be the next generation growing without the ability to critically synthesize information on their own.
Most people - the vast majority of people - cannot critically synthesize information on their own.
But the implication of what you are saying is that academic rigour is going to be ditched overnight because of LLMs.
That’s a little bit odd. Has the scientific community ever thrown up its collective hands and said “ok, there are easier ways to do things now, we can take the rest of the decade off, phew what a relief!”
> what you are saying is that academic rigour is going to be ditched overnight
Not across all level and certainly not overnight. But a lot of children entering the pipeline might end up having a very different experience than anyone else before LLMs (unless they are very lucky to be in an environment that provides them better opportunities).
> cannot critically synthesize information on their own.
That’s true, but if we even less people will try to so that or even know where to start that will get even worse.
> I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm
Happened to me as well. Wanted it to quickly write an algorithm for standard deviation over a stream of data, which is a text-book algorithm. It did it almost right, but messed up the final formula and the code gave wrong answers. Weird, considering some correct codes exist for that problem in Wikipedia.
I don't understand the point of that share. There are likely thousands of implementations of selection sort on the internet and so being able to recreate one isn't impressive in the slightest.
And all the models are identical in not being able to discern what is real or something it just made up.
No? I mean if they refused that would actually be a reasonably good outcome. The real problem is if they generally can write selection sorts but occasionally go haywire due to additional context and start hallucinating.
Because, to be blunt, I think this is total bullshit if you're using a decent model:
"I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over."
I was part of preparing an offer a few weeks ago. The customer prepared a lot of documents for us - maybe 100 pages on total. Boss insisted on using chatgpt to summarize this stuff and read only the summary. I did a loner, slower, reading and cought on some topics chatgpt outright dropped. Our offer was based on the summary - and fell through because we missed these nuances.
But hey, boss did not read as much as previously...
I wonder if the exact phrasing has varied from the source, but even then if "consultation partners" is doing the heavy lifting there. If it was something like "useful consultation partners", I can absolutely see value as an extra opinion that is easy to override. "Oh yeah, I hadn't thought about that option - I'll look into it further."
I imagine we're talking about it as an extra resource rather than trusting it as final in a life or death decision.
> I imagine we're talking about it as an extra resource rather than trusting it
> as final in a life or death decision.
I'd like to think so. Trust is also one of those non-concrete terms that have different meanings to different people. I'd like to think that doctors use their own judgement to include the output from their trained models, I just wonder how long it is till they become the default judgement when humans get lazy.
I think that's a fair assessment on trust as a term, and incorporating via personal judgement. If this was any public story, I'd also factor in breathless reporting about new tech.
Black-box decisions I absolutely have a problem with. But an extra resource considered by people with an understanding of risks is fine by me. Like I've said in other comments, I understand what it is and isn't good at, and have a great time using ChatGPT for feedback or planning or extrapolating or brainstorming. I automatically filter out the "Good point! This is a fantastic idea..." response it inevitably starts with...
Because LLM’s, with like 20% hallucination rate, are more reliable than overworked, tired doctors that can spend only one ounce of their brainpower on the patient they’re currently helping?
In fact, the phenomenon of pseudo-intelligence scares those who were hoping to get tools that limited the original problem, as opposed to potentially boosting it.
The claim seems plausible because it doesn't say there was any formal evaluation, just that some doctors (who may or may not understand how LLMs work) hold an opinion.
> What's worse, people are treating them as authoritative.
So what? People are wrong all the time. What happens when people are wrong? Things go wrong. What happens then? People learn that the way they got their information wasn't robust enough and they'll adapt to be more careful in the future.
This is the way it has always worked. But people are "worried" about LLMs... Because they're new. Don't worry, it's just another tool in the box, people are perfectly capable of being wrong without LLMs.
Being wrong when you are building a grocery management app is one thing, being wrong when building a bridge is another.
For those sensitive use cases, it is imperative we create regulation, like every other technology that came before it, to minimize the inherent risks.
In an unrelated example, I saw someone saying recently they don't like a new version of an LLM because it no longer has "cool" conversations with them, so take that as you will from a psychological perspective.
I have a hard time taking that kind of worry seriously. In ten years, how many bridges will have collapsed because of LLMs? How many people will have died? Meanwhile, how many will have died from fentanyl or cars or air pollution or smoking. Why do people care so much about the hypothetical bad effects from new technology and so little about the things we already know are harmful
Humans bullshit and hallucinate and claim authority without citation or knowledge. They will believe all manner of things. They frequently misunderstand.
The LLM doesn’t need to be perfect. Just needs to beat a typical human.
LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.
And many, many companies are proposing and implementing uses for LLM's to intentionally obscure that accountability.
If a person makes up something, innocently or maliciously, and someone believes it and ends up getting harmed, that person can have some liability for the harm.
If a LLM hallucinates something, that somone believes and they end up getting harmed, there's no accountability. And it seems that AI companies are pushing for laws & regulations that further protect them from this liability.
These models can be useful tools, but the targets these AI companies are shooting for are going to be activly harmful in an economy that insists you do something productive for the continued right to exist.
This is correct. On top of that, the failure modes of AI system are unpredictable and incomprehensible. Present day AI systems can fail on/be fooled by inputs in surprising ways that no humans would.
1. To make those harmed whole. On this, you have a good point. The desire of AI firms or those using AI to be indemnified from the harms their use of AI causes is a problem as they will harm people. But it isn't relevant to the question of whether LLMs are useful or whether they beat a human.
2. To incentivize the human to behave properly. This is moot with LLMs. There is no laziness or competing incentive for them.
That’s not a positive at all, the complete opposite. It’s not about laziness but being able to somewhat accurately estimate and balance risk/benefit ratio.
The fact that making a wrong decision would have significant costs for you and other people should have a significant influence on decision making.
That reads as "people shouldn't trust what AI tells them", which is in opposition to what companies want to use AI for.
An airline tried to blame its chatbot for inaccurate advice it gave (whether a discount could be claimed after a flight). Tribunal said no, its chatbot was not a separate legal entity.
Yeah. Where I live, we are always reminded that our conversations with insurance provider personnel over phone are recorded and can be referenced while making a claim.
Imagine a chatbot making false promises to prospective customers. Your claim gets denied, you fight it out only to learn their ToS absolves them of "AI hallucinations".
> LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.
On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.
My problem isn't that humans are doing similar things to LLMs, my problem is that humans can understand consequences of bullshitting at the wrong time. LLMs, on the other hand, operate purely on bullshitting. Sometimes they are right, sometimes they are wrong. But what they'll never do or tell you is "how confident am I that this answer is right". They leave the hard work of calling out the bullshit on the human.
There's a level of social trust that exists which LLMs don't follow. I can trust when my doctor says "you have a cold" that I probably have a cold. They've seen it a million times before and they are pretty good at diagnosing that problem. I can also know that doctor is probably bullshitting me if they start giving me advice for my legal problems, because it's unlikely you are going to find a doctor/lawyer.
> Just needs to beat a typical human.
My issue is we can't even measure accurately how good humans are at their jobs. You now want to trust that the metrics and benchmarks used to judge LLMs are actually good measures? So much of the LLM advocates try and pretend like you can objectively measure goodness in subjective fields by just writing some unit tests. It's literally the "Oh look, I have an oracle java certificate" or "Aws solutions architect" method of determining competence.
And so many of these tests aren't being written by experts. Perhaps the coding tests, but the legal tests? Medical tests?
The problem is LLM companies are bullshiting society on how competently they can measure LLM competence.
> On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.
Some humans can, certainly. Humans as a race? Maybe, ish.
Well there are still millions that can. There is a handful of competitive LLMs and their output given the same inputs are near identical in relative terms (compared to humans).
Your second point directly contradicts your first point.
In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.
As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.
No external help, just conversations with ChatGPT and some Googling.
Obviously LLMs have issues, but if we're now in the "Beginners can program their own custom apps" phase of the cycle, the potential is huge.
> As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.
This is actually an interesting one - I’ve seen a case where some copy/pasted PDF saving code caused hundreds of thousands of subtly corrupted PDFs (invoices, reports, etc.) over the span of years. It was a mistake that would be very easy for an LLM to make, but I sure wouldn’t want to rely on chatgpt to fix all of those PDFs and the production code relying on them.
Well humans are not a monolithic hive mind that all behave exactly the same as an “average” lawyer, doctor etc. that provides very obvious and very significant advantages.
> days to go from a cold start with zero dev experience
>> In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.
This paragraph makes little sense. A negligence claim is based on a deviation from some reasonable standard, which is essentially a proxy for the level of care/service that most practitioners would apply in a given situation. If doctors were as regularly incompetent as you are trying to argue then the standard for negligence would be lower because the overall standard in the industry would reflect such incompetence. So the existence of negligence claims actually tells us little about how good a doctor is individually or how good doctors are as a group, just that there is a standard that their performance can be measured against.
I think most people would agree with you that medical negligence claims are a huge problem, but I think that most of those people would say the problem is that so many of these claims are frivolous rather than meritorious, resulting in doctors paying more for malpractice insurance than necessary and also resulting in doctors asking for unnecessarily burdensome additional testing with little diagnostic value so that they don’t get sued.
It's fine if it isn't perfect if whomever is spitting out answers assumes liability when the robot is wrong. But, what people want is the robot to answer questions and there to be no liability when it is well known that the robot can be wildly inaccurate sometimes. They want the illusion of value without the liability of the known deficiencies.
If LLM output is like a magic 8 ball you shake, that is not very valuable unless it is workload management for a human who will validate the fitness of the output.
I never ask a typical human for help with my work, why should that be my benchmark for using an information tool? Afaik, most people do not write about what they don't know, and if one made a habit of it, they would be found and filtered out of authoritative sources of information.
ok, but people are building determinative software _on top of them_. It's like saying "it's ok, people make mistakes, but lets build infrastructure on some brain in a vat". It's just inherently not at the point that you can make it the foundation of anything but a pet that helps you slop out code, or whatever visual or textual project you have.
It's one of those "quantities is so fascisnating, lets ignore how we got here in the first place"
You’re moving the goalposts. LLMs are masquerading as superb reference tools and as sources of expertise on all things, not as mere “typical humans.” If they were presented accurately as being about as fallible as a typical human, typical humans (users) wouldn’t be nearly as trusting or excited about using them, and they wouldn’t seem nearly as futuristic.
> I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out.
If you're lucky it figures it out. If you aren't, it makes stuff up in a way that seems almost purposefully calculated to fool you into assuming that it's figured everything out. That's the real problem with LLM's: they fundamentally cannot be trusted because they're just a glorified autocomplete; they don't come with any inbuilt sense of when they might be getting things wrong.
I see this complaint a lot, and frankly, it just doesn't matter.
What matters is speeding up how fast I can find information. Not only will LLMs sometimes answer my obscure questions perfectly themselves, but they also help to point me to the jargon I need to use to find that information online. In many areas this has been hugely valuble to me.
Sometimes you do just have to cut your losses. I've given up on asking LLMs for help with Zig, for example. It is just too obscure a language I guess, because the hallucination rate is too high to be useful. But for webdev, Python, matplotlib, or bash help? It is invaluable to me, even though it makes mistakes every now and then.
We're talking about getting work done here, not some purity dance about how you find your information the "right way" by looking in books in libraries or something. Or wait, do you use the internet? How very impure of you. You should know, people post misinformation on there!
> Yeah but if your accountant bullshits when doing your taxes, you can sue them.
What is the point of limiting delegation to such an extreme dichotomy? As apposed to getting more things done?
The vast majority of useful things we delegate, or do for others ourselves, are not as well specified, or as legally liable for any imperfections, as an accountant doing accounting.
Let's try it this way: give me one or two prompts that you personally have had trouble with, in terms of hallucinated output and lack of awareness of potential errors or ambiguity. I have paid accounts on all the major models except Grok, and I often find it interesting to probe the boundaries where good responses give way to bad ones, and to see how they get better (or worse) between generations.
Sounds like your experiences, along with zozbot234's, are different enough from mine that they are worth repeating and understanding. I'll report back with the results I see on the current models.
I am so confused too. I hold these beliefs at the same time, and I don't feel they don't contradict each other, but apparently for many people some of these do:
- LLMs are a miraculous technology that are capable of tasks far beyond what we believed would be achievable with AI/ML in the near future. Playing with them makes me constantly feel like "this is like sci-fi, this shouldn't be possible with 2025's technology".
- LLMs are fairly clueless for many tasks that are easy enough for humans, and they are nowhere near AGI. It's also unclear whether they scale up towards that goal. They are also worse programmers than people make them to be. (At least I'm not happy with their results.)
- Achieving AGI doesn't seem impossibly unlikely any more, and doing so is likely to be an existentially disastrous event for humanity, and the worst fodder of my nightmares. (Also in the sense of an existential doomsday scenario, but even just the tought of becoming... irrelevant is depressing.)
Having one of these beliefs makes me the "AI hyper" stereotype, another makes me the "AI naysayer" stereotype and yet another makes me the "AI doomer" stereotype. So I guess I'm all of those!
I guess that Sabine's beef with LLM's that they are hyped as a legit "human level assistant" -kind of thing by the business people, which they clearly aren't yet. Maybe I've just managed to... manage my expectations?
That's on her then for fully believing what marketing and business execs are 'telling her' about LLMs. Does she get upset when she buys a coke around Christmas and her life doesn't become all warm and fuzzy with friendliness and cheer all around?
Seems like she's given a drill with a flathead, and just complains for months on end that it often fails (she didnt charge the drill) or gives her useless results (she uses philipheads). How about figuring out what works and what doesn't, and adjusting your use of the tool accordingly? If she is a painter, don't blame the drill for messing up her painting.
I kinda agree. But she seems smart and knowledgeable. It's kinda disappointing, like... She should know better. I guess it's the Gell-Mann amnesia effect once again.
> but even just the tought of becoming... irrelevant is depressing
In my opinion, there can exist no AI, person, tool, ultra-sentient omniscient being, etc. that would ever render you irrelevant. Your existence, experiences, and perception of reality are all literally irreplaceable, and (again, just my opinion) inherently meaningful. I don't think anyone's value comes from their ability to perform any particular feat to any particular degree of skill. I only say this because I had similar feelings of anxiety when considering the idea of becoming "irrelevant", and I've seen many others say similar things, but I think that fear is largely a product of misunderstanding what makes our lives meaningful.
Back when handwriting recognition was a new thing I was greatly impressed by how good it was. This was primarily because being an engineer I knew how difficult the problem is to solve. %90 recognition seemed really good to me.
When I tried to use the technology that %90 meant 1 out of every 10 things I wrote were incorrect. If it had been a keyboard I would have thrown it in the trash. That is were my Palm ended up.
People expect their technology to do things better not almost as well as a human. Waymo with LIDAR hasn't killed people. Tesla, with camera only, has done so multiple times. I will ride in a Waymo never in a Tesla self driving car.
Anyone who doesn't understand this either isn't required to use to utility it provides or has no idea how to prompt it correctly. My wife is a bookkeeper. There are some tasks that are a pain in the ass without writing some custom code. In her case, we just saved her about 2 hours by asking Claude to do it. It wrote the code, applied the code to a CSV we uploadrd and gave us exactly what we needed in 2 minutes.
>Anyone who doesn't understand this either isn't required to use to utility it provides or has no idea how to prompt it correctly.
Almost every counter-criticism of LLMs almost boil down to
1. you're holding it wrong
2. Well, I use it $DAYJOB and it works great for me! (And $DAYJOB is software engineering).
I'm glad your wife was able to save 2 hours of work, but forgive me if that doesn't translate to the trillion dollar valuation OpenAI is claiming. It's strange you don't see the inherent irony in your post. Instead of your wife just directly uploading the dataset and a prompt, she first has to prompt it to write code. There are clear limitations and it looks like LLMs are stuck at some sort of wall.
When computers/internet first came about, there were (and still are!) people who would struggle with basic tasks. Without knowing the specific task you are trying to do, its hard to judge whether its a problem with the model or you.
I would also say that prompting isn't as simple as made out to be. It is a skill in itself and requires you to be a good communicator. In fact, I would say there is a reasonable chance that even if we end up with AGI level models, a good chunk of people will not be able to use it effectively because they can't communicate requirements clearly.
So it's a natural language interface, except it can only be useful if we stick to a subset of natural language. Then we're stuck trying to reverse engineer a non documented, non deterministic API. One that will keep changing under whatever you build that uses it. That is a pretty horrid value proposition.
Short of it being able to mind read, you need to communicate with it in some way. No different from the real world where you'll have a harder time getting things done if you don't know how to effectively communicate. I imagine for a lot of popular use-cases, we'll build a simpler UX for people to click and tap before it gets sent to a model.
Boiling down to a couple cases would be more useful if you actually tried to disprove those cases or explain why they're not good enough.
> It's strange you don't see the inherent irony in your post. Instead of your wife just directly uploading the dataset and a prompt, she first has to prompt it to write code. There are clear limitations and it looks like LLMs are stuck at some sort of wall.
What's ironic about that? That's such a tiny imperfection. If that's anything near the biggest flaw then things look amazing. (Not that I think it is, but I'm not here to talk about my opinion, I'm here to talk about your irony claim.)
>Boiling down to a couple cases would be more useful if you actually tried to disprove those cases or explain why they're not good enough.
This reply is 4 comments deep into such cases, and the OP is about a well educated person who describes their difficulties.
>What's ironic about that? That's such a tiny imperfection.
I'd argue it's not tiny - it highlights the limitations of LLMs. LLMs excel at writing basic code but seem to struggle, or are untrustworthy, outside of those tasks.
Imagine generalizing his case: his wife goes to work and tells other bookkeepers "ChatGPClaudeSeek is amazing, it saved 2 hours for me". A coworker, married to a lawyer, instead of a software engineer, hearing this tries it for himself, and comes up short. Returning to work the next day and talking about his experience is told - "oh you weren't holding it right, ChatGPClaudeSeek can't do the work for you, you have to ask it to write code, that you must then run". Turns out he needs an expert to hold it properly and from the coworker's point of view he would probably need to hire an expert to help automate the task, which will likely only be marginally less expensive than it was 5 years ago.
From where I stand, things don't look amazing; at least as amazing as the fundraisers have claimed. I agree that LLMs are awesome tools - but I'm evaluating from a point of a potential future where OpenAI is worth a trillion dollars and is replacing every job. You call it a tiny imperfection, but that comes across as myopic to me - large swaths of industries can't effectively use LLMs! How is that tiny?
> Turns out he needs an expert to hold it properly and from the coworker's point of view he would probably need to hire an expert to help automate the task, which will likely only be marginally less expensive than it was 5 years ago.
The LLM wrote the code, then used the code itself, without needing a coder around. So the only negative was needing to ask it specifically to use code, right? In that case, with code being the thing it's good at, "tell the LLM to make and use code" is going to be in the basic tutorials. It doesn't need an expert. It really is about "holding it right" in a non-mocking way, the kind of instructions you expect to go through for using a new tool.
If you can go through a one hour or less training course while only half paying attention, and immediately save two hours on your first use, that's a great return on the time investment.
I´m in the same boat and I think it boils down to this: some people are actually quite passive, while others are more active in their use of technology.
It`d take more time for me to flesh this out than I want to give but the basic idea is I am not just sitting there "expecting things". I´ve been puzzled too at why so many people don´t seem to get it or are so frustrated like this lady, and in my observation this is their common element. It just looks very passive to me, the way they seem to use the machines and expect a result to be "given" to them.
PS. It reminds me very strongly of how our parent generation uses computers. Like the whole way of thinking is different, I cannot even understand why they would act certain ways or be afraid of acting in other ways, it´s like they use a different compass or have a very different (and wrong) model in their head of how this thing in front of them works.
It's more like duct-taping a VR headset to your head, calibrating your environment to a bunch of cardboard boxes and walls, and calling it a holodeck. It actually kinda works until you push at it too hard.
It reminds me a lot of when I first started playing No Man's Sky (the video game). Billions of galaxies! Exotic, one of a kind life forms on every planet! Endless possibilities! I poured hundreds of hours into the game! But, despite all the variety and possibilities, the patterns emerge, and every 'new' planet just feels like a first-person fractal viewer. Pretty, sometimes kinda nifty, but eventually very boring and repetitive. The illusion wore off, and I couldn't really enjoy it anymore.
I have played with a LOT of models over the years. They can be neat, interesting, and kinda cool at times, but the patterns of output and mistakes shatters the illusion that I'm talking to anything but a rather expensive auto-complete.
It's definitely a tech that's here to stay, unlike block chain/nfts
But I mirror the confusion why people are still bullish on it.
The current valuation for it is because the market thinks that it's able to write code like a senior engineer and have AGI, because that's how they're marketed by the LLM providers.
I'm not even certain if they'll be ubiquitous after the venture capital investments are gone and the service needs to actually be priced without losing money, because they're (at least currently) mostly pretty expensive to run.
There seems to be a widely held misconception that company valuations have any basis in the underlying fundamentals of what the companies do. This is not and has not been the case for several years. The US stock market’s darlings are Kardashians, they are valuable for being valuable the way the Kardashians are famous for being famous.
In markets, perception is reality, and the perception is that these companies are innovative. That’s it.
NFT is still a great tool if you want a bunch of unique tokens as part of a blockchain app. ERC-721 was proven a capable protocol in a variety of projects. What it isn't, and never will be, is an amazing investment opportunity, or a method to collect cool rare apes and go to yacht parties.
LLMs will settle in and have their place too, just not in the forefront of every investors mind.
I am more than happy to pay for access to LLMs, and models continue to get smaller and cheaper. I would be very surprised if they are not far more widely used in 5 or 10 years time than they are today.
None of that means that the current companies will be profitable or that their valuations are anywhere close to justified though. The future could easily be "Open-weight models are moderately useful for some niches, no-name cloud providers charge slightly higher than the cost of electricity to use them at low profit margins".
Dot-com boom/bubble all over again. A whole bunch of the current leaders will go bust. A new generation of companies will take over, actually focused on specific customer problems and growing out of profitable niches.
The technology is useful, for some people, in some situations. It will get more useful for more people in more situations as it improves.
Current valuations are too high (Gartner hype cycle), after they collapse valuations will be too low (again, hype cycle), then it'll settle down and the real work happens.
The existing tech giants will just hoover up all the niche LLM shops once the valuations deflate somewhat.
There's almost a negligible chance any one of these shops stays truly independent, unless propped up by a state-level actor (China/EU)
You might have some consulting/service companies that will promise to tailor big models to your specific needs, but they will be valued accordingly (nowhere near billions).
That's been the 'endgame' of technology improvements since the industrial revolution - there are many industries that mechanized, replaced nearly their entire human workforce, and were never terribly profitable. Consider farming - in developed countries, they really did replace like 98% of the workforce with machines. For every farm that did so, so did all of their competitors, and the increased productivity caused the price of their crops to fall. Cheap food for everyone, but no windfall for farmers.
If machines can easily replace all of your workers, that means other people's machines can also replace your workers.
I think it will go in the opposite direction. Very massive closed-weight models that are truly miraculous and magical. But that would be sad because of all the prompt pre-processing that will prevent you from doing much of what you'd really want to do with such an intelligent machine.
I expect it to eventually be a duopoly like android and iOS. At world scale, it might divide us in a way that politics and nationalities never did. Humans will fall into one of two AI tribes.
Except that we've seen that bigger models don't really scale in accuracy/intelligence well, just look at GPT4.5. Intelligence scales logarithmically with parameter count, the extra parameters are mostly good for baking in more knowledge so you don't need to RAG everything.
Additionally, you can use reasoning model thinking with non-reasoning models to improve output, so I wouldn't be surprised if the common pattern was routing hard queries to reasoning models to solve at a high level, then routing the solution plan to a smaller on device model for faster inference.
Exactly. If some company ever does come up with an AI that is truly miraculous and magical the very last thing they'll do is let people like you and me play with it at any price. At best, we'd get some locked down and crippled interface to heavily monitored pre-approved/censored output. My guess is that the miracle isn't going to happen.
If I'm wrong though and some digital alchemy finally manages to turn our facebook comments into a super-intelligence we'll only have a few years of an increasingly hellish dystopia before the machines do the smart thing and humanity gets what we deserve.
By the time the capital runs out, I suspect we'll be able to get open models at the level of current frontier and companies will buy a server ready to run it for internal use and reasonable pricing. It will be useful but a complete commodity.
I know folk now that are selling, basically, RAG on lammas, "in a box". Seems a bunch of mid-level at SME are ready to burn budget on hype (to me). Gotta get something deployed in the hype-cycle for quarterly bonus.
I think we can already get open-weight frontier class models today. I've run Deepseek R1 at home, and it's every bit as good as any of the ChatGPT models I can use at work.
Which companies? Google and Microsoft are only up a little over the past several years, and I doubt much of their valuation is coming from LLM hype. Most of the discussions about x.com say it's worth substantially less than some years ago.
I feel like a lot of people mean that OpenAI is burning through venture capital money. It's debatable, but it's a huge jump to go from that to thinking it's going to crash the stock market (OpenAI isn't even publicly traded).
The "Magnificent Seven" stocks (Apple, Amazon, Alphabet, Meta, Microsoft, Nvidia, and Tesla) were collectively up >60% last year and are now 30% of the entire S&P500. They are all heavily invested in AI products.
I just checked the first two, Apple and Amazon, and they're trading 28% and 23% higher than they were 3 years ago. Annualized returns from the SP 500 have been a little over 10%. Some of that comes from dividends, but Apple and Amazon give out extremely little in the way of dividends.
I'm not going to check all of the companies, but at least looking at the first two, I'm not really seeing anything out of the ordinary.
Currently, Nvidia enjoys a ton of the value capture from the LLM hype. But that's a weird state of affairs and once LLM deployments are less dependent on Nvidia hardware, the value capture will likely move to software companies. Or the LLM hype will reduce to the point that there isn't a ton of value to capture here anymore. This tech may just get commoditized.
Nvidia is trading below its historical PE from pre-AI times at this point. This is just on confirmed revenue, and its profitability keeps increasing. NVIDIA is undervalued right now
Sure, as long as it keep selling $130B worth of GPUs each year. Which is entirely predicated on the capital investment in Machine Learning attracting revenue streams that are still imaginary at this point.
> None of that means that the current companies will be profitable ... The future could easily be "Open-weight models are moderately useful for some niches, no-name cloud providers charge slightly higher than the cost of electricity to use them at low profit margins".
They just need to stay a bit ahead of the open source releases, which is basically the status quo. The leading AI firms have a lot of accumulated know-how wrt. building new models and training them, that the average "no-name cloud" vendor doesn't.
> They just need to stay a bit ahead of the open source releases, which is basically the status quo
No, OpenAI alone additionally need approximately $5B of additional cash each and every year.
I think Claude is useful. But if they charged enough money to be cashflow positive, it's not obvious enough people would think so. Let alone enough money to generate returns to their investors.
I don't doubt the first part, but how true is the second?
Is there a shortage of React apps out there that companies are desperate for?
I'm not having a go at you--this is a genuine inquiry.
How many average people are feeling like they're missing some software that they're able to prompt into existence?
I think if anything, the last few years have revealed the opposite, that there's a large/huge surplus of people in the greater software business at large that don't meet the demand when money isn't cheap.
I think anyone in the "average" range of skill looking for a job can attest to the difficulties in finding a new/any job.
I think there is plenty of demand for software but not enough economic incentive to fulfill every single demand. Even for the software that is being worked on, we are constantly prioritizing between the features we need or want, deciding whether to write our own vs modifying something open source etc etc. You can also look at stuff like electron apps which is a hack to reduce programmer dev time and time to market for cross platform apps. Ideally, you should be writing highly performant native apps for each.
IMO if coding models get good enough to replace devs, we will see an explosion of software before it flattens out.
We're several years in now, and have lots of A:B comparisons to study across orgs that allowed and prohibited AI assistants. Is one of those groups running away with massive productivity gains?
Because I don't think anybody's noticed that yet. We see layoffs that makes sense on their own after a boom, and cut across AI-friendly and -unfriendly orgs. But we don't seem to see anybody suddenly breaking out with 2x or 5x or 10x productivity gains on actual deliverables. In contrast, the enshittening just seems to be continuing as it has for years and the pace of new products and features is holding steady. No?
> We're several years in now, and have lots of A:B comparisons to study across orgs that allowed and prohibited AI assistants. Is one of those groups running away with massive productivity gains?
You mean... two years in? Where was the internet 2 years into it?
You’re not making the argument you think you’re making when you ask “Where was the [I]ntwenet 2 years into it?”
You may be intending to refer to 1971 (about two years after the creation of ARPANet) but really the more accurate comparison would be to 1995 (about two years since ISPs started offering SLIP/PPP dialup to the general public for $50/month or less).
And I think the comparison to 1995, the year of the Netscape IPO and URLs starting to appear in commercials and on packaging for consumer products, is apt: LLMs have been a research technology for a while, it’s their availability to the general public that’s new in the last couple of years. Yet while the scale of hype is comparable, the products aren’t: LLMs still don’t anything remotely like what their boosters claim, and have done nothing to justify the insane amounts of money being poured into them. With the Internet, however, there were already plenty of retailers starting to make real money doing electronic commerce by 1995, not just by providing infrastructure and related services.
It’s worth really paying attention to Ed Zitron’s arguments here: The numbers in the real world just don’t support the continued amount of investment in LLMs. They’re a perfectly fine area of advanced research but they’re not a product, much less a world-changing one, and they won’t be any time soon due to their inherent limitations.
They're not a product? Isn't Cursor on the leaderboard for fastest to $100m ARR? What about just plain usage or dependency. College kids are using chrome extensions that direct their searches to chatgpt by default. I think your connection to the internet uptake is a bit weak, and then you've ended by basically saying too much money is being thrown at this stuff, which is quite disconnected from the start of you arg.
I think it's pretty fair to say that they have close to doubled my productivity as a programmer. My girlfriend uses ChatGPT daily for her work, which is not "tech" at all. It's fair to be skeptical of exactly how far they can go but a claim like this is pretty wild.
Both your and her usage is currently being subsidized by venture capital money.
It remains to be seen how viable this casual usage actually is once this money dries up and you actually need to pay per prompt.
We'll just have to see where the pricing will eventually settle, before that we're all just speculating.
> And I think the comparison to 1995, the year of the Netscape IPO and URLs starting to appear in commercials and on packaging for consumer products, is apt
My grandfather didn’t care about these and you don’t care about LLMs, we get it
> They’re a perfectly fine area of advanced research but they’re not a product
No, it lets good engineers parallelize work. I can be adding a route to the backend while Cline with Sonnet 3.7 adds a button to the frontend. Boilerplate work that would take 20-30 minutes is handled by a coding agent. With Claude writing some of the backend routes with supervision, you've got a very efficient workflow. I do something like this daily in a 80k loc codebase.
I look forward to good standard integrations to assign a ticket to an agent and let it go through ci and up for a preview deploy & pr. I think there's lots of smaller issues that could be raised and sorted without much intervention.
Even if the VC-backed companies jacked up their prices, the models that I can run on my own laptop for "free" now are magical compared to the state of the art from 2 years ago. Ubiquity may come from everyone running these on their own hardware.
Takes like yours are just crazy given the pace of things. We can argue all day if people are "too bullish" or literally on the market size of enterprise AI, but truly, absolutely no one knows how good these things will get and the problems they'll overcome in the next 5 years. You saying "I am confused on why people are still bullish" is implicitly building in some huge assumptions about the near future.
Most “AI” companies are simply wrapping the ChatGPT API in some form. You can tell from the job posts.
They aren’t building anything themselves. I find this to be disingenuous as best, and is a sign to me of bubble attribution.
I also think that re-branding Machine Learning as AI to also be disingenuous.
These technologies of course have their use cases and excel in some things, but this isn’t the ushering of actual, sapient intelligence, that for the majority of the term’s existence was the de facto agreed standard for the term “AI”. This technology does lack the actual markers of what is generally accepted as intelligence to begin with
Remember the quote that IBM thought there would be a total market for maybe 10 or 15 computer computers in the entire world? They were giant, and expensive, and very limited in application.
A popular myth, it seems to be made-up from a way-less-interesting statement about a single specific model of computer during a 1953 stockholder meeting:
> IBM had developed a paper plan for such a machine and took this paper plan across the country to some 20 concerns that we thought could use such a machine. I would like to tell you that the [IBM 701] machine rents for between $12,000 and $18,000 a month, so it was not the type of thing that could be sold from place to place. But, as a result of our trip, on which we expected to get orders for five machines, we came home with orders for 18.
And that might have been true for a period of time. Advancements made it so they could become smaller and more efficient, and opened up a new market.
LLMs today feel like the former, but are being marketed as the latter. Fully believe that advancements will make them better, but in their current state they're being touted for their possibilities, not their actual capabilities.
I'm for using AI now as the tool they are, but AI is a while off taking senior development jobs. So when I see them being hyped for doing that it just feels like a hype bubble.
Tesla is valued based on the hope that it'll be the first to full self-driving cars. I don't think stock markets need to make sense, you invest in things that if true, could have huge growth, that's why LLM is being invested in, because alternatives will make you some ROI, but if LLM do break through major disruption in even a handful of large markets, your ROI will be huge.
That's not really true. Just the entertainment value alone is already causing OpenAI to rate limit its systems, and they're buying up significant amounts of NVIDIA's capacity, and NVIDIA itself is buying up significant portions of the entire world's chip-making budget. Even if just limited to entertainment, the value is immense, apparently.
That's a funny comparison, I can and do use cryptocurrency to pay web hosting, VPN and a few other things as it's become the native currency of the internet. I love llms too but agree with the parent comment that says it's inevitable they'll be replaced with something better well Bitcoin seems to be sticking around for the long long term.
In my office most people use chatGPT or a similar LLM every day. I don't know a single coworker that's ever used a cryptocurrency. One guy has bought some crypto stocks.
> The current valuation for it is because the market thinks that it's able to write code like a senior engineer and have AGI, because that's how they're marketed by the LLM providers.
No it's not. If it was valued for that it'd be at least 10X what it is now.
While it could be said that LLMs are in the 'peak of inflated expectations', blockchain is definitely still in the 'trough of disillusionment'. Even if there was a way for blockchain to affordably facilitate everyday transactions without destroying the planet and somehow sideloading into government acceptance, it's not clear that there would be anything novel enough to motivate people to use it vs a bank - beyond a complete collapse of the banking system.
Blockchain is here to stay, this is way past the point of "believing in the tech" - recently an wss:// order book exchange (Hyperliquid) crossed $1T volume traded, and they started in 2023.
Blockchains are becoming real-time data structures where everyone has admin level read-only access to everyone.
HN doesn't like blockchain. They had the chance to get in very early and now they're salty. I first heard about bitcoin on HN, before Silk Road made headlines.
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly?
IMO there are two distinct reasons for this:
1. You've got the Sam Altman's of the world claiming that LLMs are or nearly are AGI and that ASI is right around the corner. It's obvious this isn't true even if LLMs are still incredibly powerful and useful. But Sam doing the whole "is it AGI?" dance gets old really quick.
2. LLMs are an existential threat to basically every knowledge worker job on the planet. Peoples' natural response to threats is to become defensive.
I’m not sure how anyone can claim number 2 is true, unless it’s someone who is a programmer doing mostly grunt code and thinks every knowledge worker job is similar.
Just off the top of my head there are plenty of knowledge worker jobs where the knowledge isn’t public, nor really in written form anywhere. There just simply wouldn’t be anything for AI to train on.
> LLMs are an existential threat to basically every knowledge worker job on the planet.
Given the typical problems of LLMs they are not. You still need them to check the results. It’s like FSD, impressive when it works, bad if not, scary because you never known beforehand when it’s failing
Yeah, the vast majority of what I spend my time on in a day isn’t something an LLM can help with.
My wife and I both work on and with LLMs and they seem to be, like… 5-10% productivity boosters on a good day. I’m not sure they’re even that good averaged over a year. And they don’t seem to be getting a lot better in ways that change that. Also, they’re that good if you’re good at using them and I can tell you most people really, really are not.
I remember when it was possible to be “good at Google”. It was probably a similar productivity boost. I was good at Google. Most (like, over 95% of) people were not, and didn’t seem to be able to get there, and… also remained entirely employable despite that.
how much time do I need to devote to see anything but garbage?
For reference, I program systems code in C/C++ in a large, proprietary codebase.
My experiences with OpenAI(a year ago or more), and more recently, Cursor, Grok-v3 and Deepseek-r1, were all failures. The later two started out OK and got worse over time.
What I haven't done is asked "AI" to whip up a more standard application. I have some ideas(an ncurses frontend to p4 written in python similar to tig, for instance), but haven't gotten around to it.
I want this stuff to work, but so far it hasn't. Now I don't think "programming" a computer in english is a very good idea anyway, but I want a competent AI assistant to pair program with. To the degree that people are getting results, to me it seems they are leveraging very high-level APIs/libraries of code which are not written by AI and solving well-solved, "common" problems(simple games, simple web or phone apps). Sort of like how people gloss over the heavy lifting done by language itself when they praise the results from LLMs in other fields.
I know it eventually will work. I just don't know when. I also get annoyed by the hype of folks who think they can become software engineers because they can talk to an LLM. Most of my job isn't programming. Most of my job is thinking about what the solution should be, talking to other people like me in meetings, understanding what customers really want beyond what they are saying, and tracking what I'm doing in various forms(which is something I really do want AI to help me with).
Vibe coding is aptly named because it's sort of the VB6 of the modern era. Holy cow! I wrote a Windows GUI App!!!. It's letting non-programmers and semi-programmers(the "I write glue code in Python to munge data and API ins/outs" crowd) create usable things. Cool! So did spreadsheets. So did Hypercard. Andrej tweeting that he made a phone app was kinda cool but also kinda sad. If this is what the hundreds of billions spent on AI(and my bank account thanks you for that) delivers then the bubble is going to pop soon.
I think there is a big problem of expectations. People are told that it is great for software development, so they try to use it on big existing software projects, and it sucks.
Usually that's because of context: LLMs are not very good at understanding a very large amount of context, but if you don't give LLMs enough context, they can't magically figure it out on their own. This relegates AI to only really being useful for pretty self-contained examples where the amount of code is small, and you can provide all the context it needs to do its job in a relatively small amount of text (few thousand words or lines of code at most).
That's why I think LLMs are only useful right now in real-world software development for things like one-off functions, new prototypes, writing small scripts, or automating lots of manual changes you have to do. For example, I love using o3-mini-high to take existing tests that I have and modifying them to make a new test case. Often this involves lots of tiny changes that are annoying to write, and o3-mini-high can make those changes pretty reliably. You just give it a TODO list of changes, and it goes ahead and does it. But I'm not asking these models how to implement a new feature in our codebase.
I think this is why a lot of software developers have a bad view of AI. It's just not very good at the core software development work right now, but it's good enough at prototypes to make people freak out about how software development is going to be replaced.
That's not to mention that often when people first try to use LLMs for coding, they don't give the LLMs enough context or instructions to do well. Sometimes I will spend 2-3 minutes writing a prompt, but I often see other people putting the bare minimum effort into it, and then being surprised when it doesn't work very well.
Serious question as someone who has also tried these things out and not found them very useful in the context of working on a large, complex codebase in not python not javascript: when I imagine the amount of time it would take me to select some test cases, copy and paste them, and then think of a todo list or prompt to generate another case, even assuming the output is perfect, I feel like I’m getting close to the amount of time and mental effort it would take me to just write the test. In a way, having to ask in english for what I want in code for me adds an additional barrier: rather than just doing the thing I have to also think of a promptable description. Is that not a problem? Is it just fast enough that it doesn’t matter? What’s the deal?
I mean, for me personally, I am writing out the English TODO list while I am figuring out exactly what changes I need to make. So, the thinking and writing the prompt take up the same unit of time.
And in terms of time saved, if I am just changing string constants, it’s not going to help much. But if I’m restructuring the test to verify things in a different way, then it is helpful. For example, recently I was writing tests for the JSON output of a program, using jq. In this case, it’s pretty easy to describe the tests I want to make in English, but translating that to jq commands is annoying and a bit tedious. But o3-mini-high can do it for me from the English very well.
Annoying to do myself, but easy to describe, is the sweet spot. It is definitely not universally useful, but when it is useful it can save me 5 minutes of tedium here or there, which is quite helpful. I think for a lot of this, you just have to learn over time what works and what doesn't.
Thanks for the reply, that makes sense. jq syntax is one of those things that I’m just familiar enough with to remember what’s possible but not how to do it, so I could definitely see an LLM being useful for that.
Maybe one of my problems is that I tend to jump into writing simple code or tests without actually having the end point clearly in mind. Often that works out pretty well. When it doesn’t, I’ll take a step back and think things through. But when I’m in the midst of it, it feels like being interrupted almost, to go figure out how to say what I want in English.
Will definitely keep toying with it to see where I can find some utility.
That definitely makes a lot of sense. I think if you are coding in a flow state on something, and LLMs interrupt that, then you should avoid them for those cases.
The areas that I've found LLMs work well for are usually small simple tasks I have to do where I would end up Googling something or looking at docs anyway. LLMs have just replaced many of these types of tasks for me. But I continue to learn new areas where they work well, or exceptions where they fail. And new models make it a moving target too.
> I think if you are coding in a flow state on something, and LLMs interrupt that, then you should avoid them for those cases.
Maybe that's why I don't like them. I'm always in a flow state, or reading docs and/or a lot of code to understand something. By the time I'm typing, I already know what exactly to write, and thanks to my vim-fu (and emacs-fu), getting it done is a breeze. Then comes the edit-compile-run, or edit-test cycle, and by then it's mostly tweaks.
I get why someone would generate boilerplate, but most of the time, I don't want the complete version from the get go. Because later changes are more costly, especially if I'm not fully sure of the design. So I want something minimal that's working, then go work on things that are dependent, then get back when I'm sure of what the interface should be. I like working iteratively which then means small edits (unless refactoring). Not battling with a big dump of code for a whole day to get it working.
Yeah, I think it matters a lot what type of work you do. I have to jump between projects a lot that are all in different languages with a lot of codebases I'm not deeply familiar with. So for me, LLMs are really useful to get up-to-speed on the knowledge I need to work on new projects.
If I've got a clear idea of what I want to write, there's no way I'm touching an LLM. I'm just going to smash out the code for exactly what I need. However, often I don't get that luxury as I'll need to learn different file system APIs, different sets of commands, new jargon, different standard libraries for the new languages, new technologies, etc...
It does an ok job with C# but it’s generally outdated code I.e [required] as an annotation rather than as a keyword. Plus it generates some unnecessary constructors occasionally.
Mostly I use it for stupid templates stuff for which it isn’t bad. It’s not the second coming but it definitely speeds you up
> Most of my job is thinking about what the solution should be, talking to other people like me in meetings, understanding what customers really want beyond what they are saying, and tracking what I'm doing in various forms
None of this is particularly unique to software engineering. So if someone can already do this and add the missing component with some future LLM why shouldn’t they think they can become a software engineer?
Yeah I mean, if you can reason about, say, how an automobile engine works, then you can reason about how a modern computer works too, right? If you can discuss the tradeoffs in various engine design parameters then surely you understand amdahl's law, caching strategies of a modern CPU, execution pipelining, etc... We just need to give those auto guys an LLM and then they can do systems software engineering, right?
Did you catch the sarcasm there?
Are you a manager by any chance? The non-coding parts of my job largely require domain experience. How does an LLM provide you with that?
If your mind has trouble expanding outside the domain of "use this well known tool to do something that has already been done" then no amount of improvements will free you from your belief that chatbots are glorified autocomplete.
I hear you, I'm tired of getting people that don't care to care. It's the people that should know how cool this stuff is and don't - they frustrate me!
You people frustrate me because you don't listen that I've tried to use AI to do help with my job and it fails horribly in every way. I see that it is useful to you, and that's great, but that doesn't make it useful for everybody... I don't understand why you must have everyone agree with you, and that it's "tires" you out to hear other people's contracting opinions. It feels like a religion.
I mean, it is trivial to show that it can do things literally impossible even 5 years ago. And you don't acknowledge that fact, and that's what drives me crazy.
It's like showing someone from 1980 a modern smart phone and them saying, yeah but it can't read my mind.
I'm not trying to pick on you or anything, but at the top of the thread you said "I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out" and now you're saying "it is trivial to show that it can do things literally impossible even 5 years ago"
This leads me to believe that the issue is not that llm skeptics refuse to see, but that you are simply unaware of what is possible without them--because that sort of fuzzy search was SOTA for information retrieval and commonplace about 15 years ago (it was one of the early accomplishments of the "big data/data science" era) long before LLMs and deepnets were the new hotness.
This is the problem I have with the current crop of AI tools: what works isn't new and what's new isn't good.
It's also a red flag to hear "it is trivial to show that it can do things literally impossible even 5 years ago" 10 comments deep without anybody doing exactly that...
Are people really this hung up on the term “AI”? Who cares? The fact that this is a shockingly useful piece of technology has nothing to do with what it’s called.
Because the AI term makes people anthropomorphize those tools.
They "hallucinate", they "know", they "think".
They're just the result of matrix calculus on which your own pattern recognition capacities fool you into thinking there is intelligence there. There isn't. They don't hallucinate, their output is wrong.
The worst example I've seen of anthropomorphism was the blog from a searcher working on adverse prompting. The tool spewing "help me" words made them think they were hurting a living organism https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-...
Speaking with AI proponents feels like speaking with cryptocurrencies proponents: the more you learn about how things work, the more you understand they don't and just live in lalaland.
If you lived before the invention of cars, and if when they were invented, marketers all said "these will be able to fly soon" (which of course, we know now wouldn't have been true), you would be underwhelmed? You wouldn't think it was extremely transformative technology?
From where does the premise that "artificial intelligence" is supposed to be infallible and super human come from? I think 20th century science fiction did a good job of establishing the premise that artificial intelligence will be sometimes useful but will often fail in bizarre ways that seem interesting to humans. Misunderstandings orders, applying orders literally in a way humans never would, or just flat out going haywire. Asimov's stories, HAL9000, countless others. These were the popular media tropes about artificial intelligence and the "real deal" seems to line up with them remarkably well!
When businessmen sell me "artificial intelligence", I come prepared for lots of fuckery.
Have you considered that the problems you encounter in daily life just happen to be more present in the training data than problems other users encounter?
Stitching together well-known web technologies and protocols in well-known patterns, probably a good success rate.
Solving issues in legacy codebases using proprietary technologies and protocols, and non-standard patterns. Probably not such a good success rate.
I think you would benefit from a personalized approach. If you like, send me a Loom or similar of you attempting to complete one software task with AI, that fails as you said, and I'll give you my feedback. Email in profile.
Far from just programming too. They're useful for so many things. I use it for quickly coming up with shell scripts (or even complex piped commands (or if I'm being honest even simple commands since it's easier than skimming the man page)). But I also use it to bounce ideas off of when negotiating contracts. Or to give me a spoiler-free reminder of a plot point I'd forgotten in a book or TV series. Or to explain legal or taxation issues (which I of course verify, but it points me in the right direction). Or any number of other things.
As the parent says, while far from perfect, they're an incredible aid in so many areas. When used well, they help you produce not just faster but also better results. The only trick really is that you need to treat it as a (very knowledgeable but overconfident) collaborator rather than an oracle.
I love using it to boilerplate code for a new API I want to integrate. Much better than having to manually search. In the near future, not knowing how to effectively use AI to enhance productivity will be a disadvantage to potential employers.
I use ChatGPT all the time. I really like it. It's not perfect; how I've described it (and I doubt that I'm unique in this): it's like having a really smart and eager intern at your disposal.
I say "intern" in the sense that its error-prone and kind of inexperienced, but also generally useful. I can ask it to automatically create a lot of the bootstrapping or tedious code that I always dread writing so that I can focus on the fun stuff, which is often the stuff that's pawned off onto interns and junior-level engineers. I think for the most part, when you treat it like that, it lives up to and sometimes even surpasses expectations.
I mean, I can't speak for everyone, but whenever I begin a new project, a large percentage of the first ~3 hours is simply copying and pasting and editing from documentation, either an API I have to call or some bootstrapping code from a framework or just some cruft to make built-in libraries work how you want. I hate doing all that, it actively makes me not want to start a new project. Being able to get ChatGPT to give me stuff that I need to actually get started on my project has made coding a lot more fun for me again. At this point, you can take my LLM from my cold dead hands.
I do think it will keep getting better, but I'm also at a point where even if it never improves I will still keep using it.
I guess it totally depends on the specific use cases.
Just yesterday, I asked Gemini 2.5 for how to test a Laravel artisan command that internally calls another artisan command using the Pest testing framework.
Gemini showed me a method that looked perfect.
But when I tried to execute it, I got an error that the method doesn't exist.
So I asked Gemini about this, and it told me: "Ah, apologies! That's a classic sign that you might be using a version of Laravel older than 5.7."
So I told it that I am actually using the latest release (12.3), to which it responded with: "You're right to question things when you get an error! However, let's clarify the Laravel version first.
As of today, March 27, 2025, the latest stable version of Laravel is Laravel 11, which was released in March 2024. Laravel 12 has not been released yet (it's expected roughly in Q1 2026 based on the usual schedule).
Could you please double-check the exact Laravel version you are using?"
So it did not believe me and I had to convince it first that I was using a real version. This went on for a while, with Gemini not only hallucinating stuff, but also being very persistent and difficult to convince of anything else.
Well, in the end it was still certain that this method should exist, even though it could not provide any evidence for it and my searching through the internet and the Git history of the related packages did also not provide any results.
So I gave up and tried it with Claude 3.7 which could also not provide any working solution.
In the end, I found an entirely different solution for my problem, but that wasn't based on anything the AIs told me, but just my own thinking and talking to other software developers.
I would not go that far to call these AIs useless. In software development they can help with simple stuff and boilerplate code, and I found them a lot more helpful in creative work. This is basically the opposite from what I would have expected 5 years ago ^^
But for any important tasks, these LLMs are still far too unreliable.
They often feel like they have a lot of knowledge, but no wisdom.
They don't know how to apply their knowledge ideally, and they often basically brute-force it with a mix of strange creativity and statistical models that are apparently based on a vast amount of internet content that has big parts of troll content and satire.
My issue with higher ups pushing LLMs is that what slows me down at work is not having to write the code. I can write the code. If all I had to do was sit down and write code, then I would be incredibly productive because I'm a good programmer.
But instead, my productivity is hampered by issues with org communication, structure, siloed knowledge, lack of documentation, tech debt, and stale repos.
I have for years tried to provide feedback and get leadership to do something about these issues, but they do nothing and instead ask "How have you used AI to improve your productivity?"
Ive had the same experience as you, and also rather recently. I had to learn two lessons: first, what I could trust it with (as with Wikipedia when it was new), and second, what makes sense to ask it (as with YouTube when it was new). Once I got that down, it is one fabulous tool to have on my belt, among many other tools.
Thing is, the LLMs that I use are all freeware, and they run on my gaming PC. Two to six tokens per second are alright honestly. I have enough other things to take care of in the meantime. Other tools to work with.
I don't see the billion dollar business. And even if that existed, the means of production would be firmly in the hands of the people, as long as they play video games. So, have we all tripled our salaries?
If we haven't, is that because knowledge work is a limited space that we are competing in, and LLMs are an equalizer because we all have them? Because I was taught that knowledge work was infinite. And the new tool should allow us to create more, and better, and more thoroughly. And that should get us all paid better.
There are basically 3 categories of LLM users (very roughly).
1. People creating or dealing with imprecise information. People doing SEO spam, people dealing with SEO spam, almost all creative arts people, people writing corporatese- or legalese- documents or mails, etc. For these tasks LLMs are god-like.
2. People dealing with precise information and or facts. For these people LLMs is no better than a parrot.
3. Subset of 2 - programmers. Because of the huge amount of stolen training data, plus almost perfect proofing software is the form of compilers, static analyzers etc. for this case LLMs are more or less usable, the more data was used the better (JS is the best as I understand).
This is why people's reaction is so polarizing. Their results differ.
Depends on your use case. If you don't need them to be the source of truth, then they work great, but if you do, the experience sucks because they're so unreliable.
The problems start when people start hyperventilating because they think since LLMs can generate tests for a function for you, that they'll be replacing engineers soon. They're only suitable for generating output that you can easily verify to be correct.
LLM training is designed to distill a massive corpus of facts, in the form of token sequences, into a much, much smaller bundle of information that encodes (somehow!) the deep structure of those facts minus their particulars.
They’re not search engines, they’re abstract pattern matchers.
I asked Grok to describe a picture I took of me and my kid at Hilton Head island. Based on the plant life it guesses it was a southeast barrier island in Georgia or the Carolinas. It guessed my age and my son’s age. LLMs are completely insane tech for a 90s kid. The first fundamental advance in tech I’ve seen in my lifetime—like what it must’ve been like for people who used a telephone for the first time, or watched a television.
Flat TVs, digital audio players (the iPod), the smartphone, laptops, the smartwatches,... You have a very selective definition of advance in tech. Compare today (minus LLMs) with any movies depicting life in the nineties and you can see how much tech have developed.
The crisis in programming hasn’t been writing code. It has been developing languages and tools so that we can write less of it that is easy to verify as correct. These tools generate more code. More than you can read and more than you will want to before you get bored and decide to trust the output. It is trained on the most average code available that could be sucked up and ripped off the Internet. It will regurgitate the most subtle errors that humans are not good at finding. It only saves you time if you don’t bother reading and understanding what it outputs.
I don’t want to think about the potential. It may never materialize. And much of what was promised even a few years ago hasn’t come to fruition. It’s always a few years away. Always another funding round.
Instead we have massive amounts of new demand for liquid methane, infrastructure struggling to keep up, billions of gallons of fresh water wasted, all so that rich kids can vibe code their way to easy money and realize three months later they’ve been hacked and they don’t know what to do. The context window has been lost and they ran out of API credits. Welcome to the future.
Yeah basically this. If I look at how it helps me as an individual, I can totally see how AI can sometimes be useful. If I take a look at what societal effect of AI, it becomes apparent that AI just is a net negative. Some examples:
- AI is great for disinformation
- AI is great at generating porn of women without their consent.
- Open source projects massively struggle as AI scrapers DDOS them.
- AI uses massive amounts of energy and water, most importantly the expectation is that energy usage will rise when we drastically in a world where we need to lower it. If Sam Altman gets his way, we're toast.
- AI makes us intellectually lazy and worse thinkers. We were already learning less and less in school because of our impoverished attention span. This is even worse now with AI.
- AI makes us even more dependent on cloud vendors and third-parties, further creating a fragile supply chain.
Like AI ostensibly empowers us as individuals, but in reality I think it's a disservice, and the ones it truly empowers are the tech giants, as citizens become dumber and even more dependent on them and tech giants amass more and more power.
I can't believe I had to dig this deep to find this comment.
I have yet to see an AI-generated image that was "really cool".
AI images and videos strike me as the coffee pods of the digital world -- we're just absolutely littering the internet with garbage. And as a bonus, it's also environmentally devastating to the real world!
I live nearby a landfill, and go there often to get rid of yard waste, construction materials, etc. The sheer volume of perfectly serviceable stuff people are throwing out in my relatively small city (<200k) is infuriating and depressing. I think if more people visited their local landfills, they might get a better sense for just how much stuff humans consume and dispose. I hope people are noticing just how much more full of trash the internet has become in the last few years. It seems like it, but then I read this thread full of people that are still hyped about it all and I wonder.
This isn't even to mention the generated text... it's all just so inane and I just don't get it. I've tried a few times to ask for relatively simple code and the results have been laughable.
If you ask for obscure things how do you know you are getting right answers? From my experience unless the thing you are looking for is not found easily with a google search LLMs have no hope getting it correct. This is mostly trying to code against obscure API that isn’t well documented and the little documentation there is is spread across multiple wikis. And the LLMs keep hallucinating functions that simply do not exist
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly
Hossenfelder is a scientist. There's a certain level of rigour that she needs to do her job, which is where current LLMs often fall down. Arguably it's not accelerating her work to have to check every single thing the LLM says.
I use them everyday and they save me so much time and enable me to do things that I wouldn't be able to do otherwise just due to the amount of time it would take.
I think some people just aren't using them correctly or don't understand their limitations.
They are especially helpful for helping me get over thought paralysis when starting new project.
It is an amazing technology and like crypto/blockchain it is nerdy to understand how it works and play with it. I think there are two things at stake here:
1. Some people are just uncomfortable with it because it “could” replace their jobs.
2. Some people are warning that the ecosystem bubble is significantly out of proportions. They are right and having the whole stock market, companies and US economy attached to LLMs is just down right irresponsible.
> Some people are just uncomfortable with it because it “could” replace their jobs.
What jobs are seriously at risk of being totally replaced by LLM's? Even in things like copywriting and natural language translation, which is somewhat of a natural "best case" for the underlying tech, their output is quite sub par compared to the average human's.
They can do fun and interesting stuff, but we keep hearing how they’re going to replace human workers, and too many people in positions of power not only believe they are capable of this, but are taking steps to replace people with LLMs.
But while they are fun to play with, anything that requires a real answer, but can’t be directly and immediately checked, like customer support, scientific research, teaching, legal advice, identifying humans, correctly summarizing text - LLMs are very bad at these things, make up answers, mix contexts inappropriately, and more.
I’m not sure how you can have played with LLMs so much and missed this. I hope you don’t trust what they say about recipes or how to handle legal problems or how to clean things or how to treat disease or any fact-checking whatsoever.
>I’m not sure how you can have played with LLMs so much and missed this. I hope you don’t trust what they say about recipes or how to handle legal problems or how to clean things or how to treat disease or any fact-checking whatsoever.
This is like a GPT3.5 level criticism. o1-pro is probably better at pure fact retrieval than most PhDs in any given field. I challenge you to try it.
The main issue is that if you ask most LLMs to do something they aren't good at, they don't say "Sorry, I'm not sure how to do that yet," they says "Sure, absolutely! Here you go:" and proceed to make things up, provide numbers or code that don't actually add up, and make up references and sources.
To someone who doesn't actually check or have the knowledge or experience to check the output, it sounds like they've been given a real, useful answer.
When you tell the LLM that the API it tried to call doesn't exist it says "Oh, you're right, sorry about that! Here's a corrected version that should work!" and of course that one probably doesn't work either.
Yes. One of my early observations about LLMs was that we've now produced software that regularly lies to us. It seems to be a quite intractable problem. Also, since there's no real visibility as to how an LLM reaches a conclusion, there's no way to validate anything.
One takeaway from this is that labelling LLMs as "intelligent" is a total misnomer. They're more like super parrots.
For software development, there's also the problem of how up to date they are. If they could learn on the fly (or be constantly updated) that would help.
They are amazing in some ways, but they've been over-hyped tremendously.
The frustration of using an LLM is greater than the frustration of doing it myself. If it's going to be a tool, it needs to work. Otherwise, it's just a research a toy.
I agree, they are an amazing piece of technology, but the investment and hype doesn't match the reality. This might age like milk, but I don't think OpenAI is going to make it. They burnt $9B to lose $5B in 2024, trying to raise money like their life depends on it.. because their life depends on it. From what I can tell, none of the AI model produces are profiting from their model usage at this point, except maybe Deepseek. This will be a market, they are useful, astonishing impressive even, but IMO they are either going to become waaayy more expensive to use and/or/combo of market/investment will greatly shrink to be sustainable.
> can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out. It talks to me like a person. It generates really cool images. It helps me write code. And just tons of other stuff that astounds me.
It is an impressive technology but is it US$244.22bn [1] impressive (I know this stat is supposed to account for computer vision as well but seeing as to how LLMs are now a big chunk of that I think it's a safe assumption)? It's projected to grow to over US$1tr by 2031. That's higher than the market size of commercial aviation at its peak [2]. I'm sorry if I agree that a cool chatbot is not approximately as important as flying.
When I saw GPT-3 in action in 2023, I couldn’t believe my eyes. I thought I was being tricked somehow. I’d seen ads for “AI-powered” services and it was always the same unimpressive stuff. Then I saw GPT-3 and within minutes I knew it was completely different. It was the first thing I’d ever seen that felt like AI.
That was only a few years ago. Now I can run something on my 8GB MacBook Air that blows GPT-3 out of the water. It’s just baffling to me when people say LLM’s are useless or unimpressive. I use them constantly and I can still hardly believe they exist!!
LLMs are better at formally verifiable tasks like coding, also coding makes more money on a pure demand basis so development for it gets more resources. In descriptive science fields, it's not great because science fields don't generate a lot of text compared to other things, so the training data is dwarfed by the huge corpus of general internet text. The software industry created the internet and loves using it, so they have published a lot more text in comparison. It can be really bad in bio for example.
Is your testing adversarial or merely anecdotal curiosity? If you don't actively look for it why would you expect to find it?
It's bad technology because it wastes a lot of labor, electricity, and bandwidth in a struggle to achieve what most human beings can with minimal effort. It's also a blatant thief of copyrighted materials.
If you want to like it, guess what, you'll find a way to like it. If you try to view it from another persons use case you might see why they don't like it.
I think it’s more that the people who are boosting LLMs are claiming that perfect super intelligence is right around the corner.
Personally, I look back at how many years ago it was that we were seeing claims that truck drivers were all going to lose there jobs and society would tear itself apart over it within the next few years… and yet here we still are.
You no longer have the console as the primary interface, but a GUI, which 99.9+% of computer users control via a mouse.
You no longer have the screen as the primary interface, but an AUI, which 99.9+% of computer users control via a headset, earbuds, or a microphone and speaker pair.
You mostly speak and listen to other humans, and if you're not reading something they've written, you could have it read to you in order to detach from the screen or paper.
You'll talk with your computer while in the car, while walking, or while sitting in the office.
An LLM makes the computer understand you, and it allows you to understand the computer.
Even if you use smart glasses, you'll mostly talk to the computer generating the displayed results, and it will probably also talk to you, adding information to the displayed results. It's LLMs that enable this.
Just don't focus too much on whether the LLM knows how high Mount Kilimanjaro is; its knowledge of that fact is simply a hint that it can properly handle language.
Still, it's remarkable how useful they are at analyzing things.
LLMs have a bright future ahead, or whatever technology succeeds them.
I don’t even argue that they might get useful at some point, but when I point a mouse at a button and press the button it usually results in a reliable action.
When I use the LLM (I have so far tried: Claude, ChatGPT, DeepSeek, Mistral) it does something but that something usually isn’t what I want (~the linked tweet).
Prompting, studying and understanding the result and then cleaning up the mess for the low price of an expensive monthly sub leaves me with worse results than if I did the thing myself, usually takes longer and often leaves me with subtle bugs I’m genuinely afraid of growing into exploitable vulnerabilities.
Using it strictly as a rubber duck is neat but also largely pointless.
Since other people are getting something out of the tech, I’ll just assume that the hammer doesn’t fit my nails.
These are the beginnings and it will only improve. The premise is "I genuinely don't understand why some people are still bullish about LLMs", which I just can't share.
When the mouse and GUI was invented nobody needed to say "just wait a couple years for it to improve and you'll understand why it's useful, until then please give me money". The benefits are immediately obvious and improve the experience for practically every computer user.
LLMs are very useful for some (mostly linguistic) tasks, but the areas where they're actually reliable enough to provide more value than just doing it yourself are narrow. But companies really need this tech to be profitable and so they try to make people use it for as many things as possible and shove it in everyone's face[0] in hopes that someone finds a use-case where the benefits are indeed immediately obvious and revolutionary.
[0] For example my dad's new Android phone by default opens a Gemini AI assistant when you hold the power button and it took me minutes of googling to figure out how to make it turn off the damn thing. Whoever at Google thought that this would make people like AI more is in the wrong profession.
It's like a mouse that some variable proportion of the time pretends it's moved the cursor and clicked a button, but actually it hasn't and you have to put a lot of work in to find out whether it did or didn't do what you expected.
It used to be annoying enough just having to clean the trackball, but at least you knew when it wasn't working.
This is true. But it needs to be more than a toy if it is to be economically viable.
So far the industrial applications haven't been that promising, code writing and documentation is probably the most promising but even there it's not like it can replace a human or even substantially increase their productivity.
I'm completely with you. The technology is absolutely fascinating in its own right.
That said, I do experience frustrations:
- Getting enraged when it messes up perfectly good code it wrote just 10 minutes ago
- Constantly reminding it we're NOT using jest to write tests
- Discovering it's created duplicate utilities in different folders
There's definitely a lot of hand-holding required, and I've encountered limitations I initially overlooked in my optimism.
But here's what makes it worthwhile: LLMs have significantly eased my imposter syndrome when it comes to coding. I feel much more confident tackling tasks that would have filled me with dread a year ago.
I honestly don't understand how everyone isn't completely blown away by how cool this technology is. I haven't felt this level of excitement about a new technology since I discovered I could build my own Flash movies.
It depends. For small tasks like summarization or self-contained code snippets, it’s really good—like figuring out how to inspect a binary executable on Linux, or designing a ranking algorithm for different search patterns. If you only want average performance or don’t care much about the details, it can produce reasonable results without much oversight.
But for larger tasks—say, around 2,000 lines of code—it often fails in a lot of small ways. It tends to generate a lot of dead code after multiple iterations, and might repeatedly fail on issues you thought were easy to fix. Mentally, it can get exhausting, and you might end up rewriting most of it yourself. I think people are just tired of how much we expect LLMs to deliver, only for them to fail us in unexpected ways. The LLM is good, but we really need to push to understand its limitations.
I think its perception of usefulness depends on how often you ask/google questions. If you are constantly wondering about X thing, LLMs are amazing - especially compared to previous alternatives like googling or asking on Reddit.
If you don’t constantly look for information, they might be less useful.
I'm a senior engineer with 20 years of experience and mostly find all of the AI bs for the last couple of years to be occasionally helpful for general stuff but absolutely incompetent when I need help mildly complicated tasks.
I did have a eureka moment the other day with deepseek and a very obscure bug I was trying to tackle. One api query was having a very weird, unrelated side effect. I loaded up cursor with a very extensive prompt and it actually figured out the call path I hadn't been able to track down.
Today, I had a very simple task that eventually only took me half an hour to manually track. But I started with cursor using very similar context as the first example. It just kept repeatedly dreaming up non-existent files in the PR and making suggestions to fix code that doesn't exist.
So what's the worth to my company of my very expensive time? Should I spend 10,20,50 percent of my time trying to get answers from a chatbot, or should I just use my 20 years of experience to get the job done?
I’ve been playing with Gemini 2.5 pro throwing all kinds of problems that will help me with personal productivity and it’s mostly one shoting them. I’m still in disbelief tbh.
A lot of people who don’t understand how to use LLM effectively will be at an economic disadvantage.
Can you give some examples? Do you mean things like "How do I control my crippling anxiety", things like "What highways would be best to take to Chicago", things like "Write me a Python library to parse the file format in this hex dump", or things like "What should I make for dinner"?
As a 50+ nerd, for decades I carried the idea that can't we just build a sufficiently large neural net, throw some data at it and have somehow be usefully intelligent? So it's kind of showing strong signs of something I've been waiting for.
In the 70's I read in some science book for kids about how one day we will likely be able to use light emitting diodes for illumination instead of light bulbs, and this "cold light" will save us lots of energy. Waited out that one too; it turned out so.
Same as reading books, Internet, Wikipedia, working towards/keeping your health and fitness, etc...
The quote about books being a mirror reflecting genius or idiocy seems to apply.
I see LLMs a kind of hyper-keyboard. Speeding up typing AND structuring content, completing thoughts, and inspiring ideas.
Unlike a regular keyboard, an LLM transforms input contextually. One no longer merely types but orchestrates concepts and modulates language, almost like music.
Yet mastery is key. Just as a pianist turns keystrokes into a symphony through skill, a true virtuoso wields LLMs not as a crutch but as an amplifier of thought.
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly
More like we note the frequency with which these tools produce shallow bordering on useless responses, note the frequency with which they produce outright bullshit, and conclude their output should not be taken seriously. This smells like the fervor around ELIZA, but with several multinational marketing campaigns behind it pushing.
I’m reminded of how I always think current cutting edge good examples of CG in movies looks so real and then, consistently, when I watch it again in 10 years it always looks distractingly shitty.
I honestly believe GP comment demonstrates a level of gullibility that AI hypesters are exploiting.
Generative LLMs are text[1] generators, statistical machines extruding plausible text.
To the extent that it a human believes it to be credible,
the output exhibits all the hallmarks of a confidence game.
Once you know the trick,
after Toto pulls back the curtain[2],
its not all that impressive.
1. I'm aware that LLMs can generate images and video as well. The point applies.
Perhaps you have already paid off your mortgage and saved up a million dollars for retirement? And you're not threatened by dismissal or salary reduction because supposedly "AI will replace everyone."
By the way, you don't need to be a 50+ year old nerd. Nerds are a special culture-pen where smart straight-A students from schools are placed so they can work, increase stakeholder revenues, and not even accidentally be able to do anything truly worthwhile that could redistribute wealth in society.
The speed at which anything progresses is impressive if you're not paying attention while other people toil away on it for decades, until one day you finally look up and say, "Wow, the speed at which this thing progressed is insane!"
I remember seeing an AI lab in the late 1980's and thinking "that's never going to work" but here we are, 40 years later. It's finally working.
Yeah, like I. I. Rabi said in regard to people no longer being amazed by the achievements of physics, "What more do you want, mermaids?"
Anyone who remembers further back than a decade or so remembers when the height of AI research was chess programs that could beat grandmasters. Yes, LLMs aren't C3PO or the like, but they are certainly more like that than anything we could imagine just a few years ago.
I'm glad I'm not the only person in awe with LLMs. It feels like it came straight out of science fiction novel. What does it take to impress people nowadays?
I feel like if teleportation was invented tomorrow, people would complain that it can't transport large objects so it's useless.
“The growth of the Internet will slow drastically, as the flaw in ‘Metcalfe’s law' becomes apparent: most people have nothing to say to each other! By 2005, it will become clear that the Internet’s impact on the economy has been no greater than the fax machine’s”
Yeah the amount of "piffle work" that LLMs save me is astounding. Sure, I can look up fifty different numbers and copy them into excel. Or I can just tell an LLM "make a chart comparing measurements XYZ across devices ABC" and I'm looking at the info right there.
Because we're being told it is a perfect super intellegence, that it is going to replace senior engineers. The hype cycle is real, and worse than blockchain ever was. I'm sure llms will be able to code a full enterprise app about the same time moon coin replaces $usd.
Probably because you don't have the same use-case as them... doing "code" is an "easy" use-case, but pondering on a humanities subject is much harder... you cannot "learn the structure" of humanities, you have to know the facts... and LLMs are bad at that
I often ask "So you say LLMs are worthless because you can't blindly trust the first thing they say? Do you blindly trust the first google search result? Do you trust every single thing your family members tell you?" It reminds me of my high school teachers saying Wikipedia can't be trusted.
I wholeheartedly agree with you and it’s funny reading the replies to your comment.
Basically people just doubling down on everything you just described. I can’t quite put a finger on it but it has a tinge of insecurity or something like that, hope that’s not the case and me just misinterpreting
Indeed, it is the stuff of science fiction, and the you get an "akshually, it's just statistics" comment. I feel people projecting their fears, because deep down, they're simply afraid.
I like LLMs for what they are. Classifiers. I don’t trust them as search engines because of hallucinations. I use them to get a bearing on a subject but then I’ll turn to Google to do the real research.
Because the marketers oversold it. That is why you are seeing a pushback. I also outright rejected them because 1) they were sold and marketed as end all be all replacements for human thought, and 2) they promised to replace only the parts of my job that I enjoy. Billboards were up in San Francisco telling my "bosses" that I was soon to be replaced, and the loudest and earliest voices told me that the craft I love is dead. Imagine Nascar drivers excitedly discussing how cool it was they wouldn't have to turn left anymore - made me wonder why everyone else was here.
It was, more or less, the same narrative arc as Bitcoin, and was (is) headed for a crash.
That said, I've spent a few weeks with augment, and it is revelatory, certainly. All the marketing - aimed at a suite I have no interest in - managed to convince me it was something it wasn't. It isn't a replacement, any more than a power drill is a replacement for a carpenter.
What it is, is very helpful. "The world's most fully functioning scaffolding script", an upgrade from copilot's "the world's most fully functioning tab-completer". I appreciate it usefulness as a force multiplier, but I am already finding corners and places where I'd just prefer to do it myself. And this is before we get into the craft of it all - I am not excited by the pitch "worse code, faster", but the utility is undeniable in this capitalistic hell planet, and I'm not a huge fan of writing SQL queries anyway, so here we are!
It's like computer graphics and VR: Amazing advances over the years, very impressive, fun, cool, and by no means a temporary fad...
... But I do not believe we're on the cusp of a Lawnmower-Man future where someone's Metaverse eats all the retail-conference-halls and movie-theaters and retail-stores across the entire globe in an unbridled orgy of mind-shattering investor returns.
Similarly, LLMs are neat and have some sane uses, but the fervor about how we're about to invent the Omnimind and usher in the singularity and take over the (economic) world? Nah.
For exploring topics in a shallow fashion is fine with LLMs, doing anything deep is just too unreliable due to hallucination. All models I’ve talked to desperately want to give a positive answer, and thus will often just lie.
Today's models are far from autonomous thinking machines. It is a cognitive bias among the masses that agree. It is just a giant calculator. It predicts "the most probable next word" from a sea of all combinations of next words.
What you're impressed with is 40% human skill in creating an LLM, 0.5% value created by the model. And 59.5% the skills of all the people it ate and is now trying to destroy the livelihood of
I don't see it as a bigger leap than the internet itself. I recall needing books on my desk or a road trip to the local bookshop to find out coding answers. Stack Overflow beats AI most days, but the LLMs are another nice tool.
As others have pointed out already, the hype about writing code like senior engineer, or in general acting as a competent assistant is what created the expectation in the first place. They keep over-promising, but underdelivering. Who is the guy talking about AGI most of the time? Could it be the top-executive of one of the largest gen AI companies, do you think? I won't deny it has occasionally a certain 'star-trek-computer' flair to it, but most of the time it feels like having a heavily degraded version of "rain man". He may count your cards perfectly one moment, then will get stuck trying to untie his shoes. I stopped counting how many times it produced just outright wrong outputs, to the point of suggesting literally the opposite of what one is asking of them. I would not mind it so much, if they were being advertised for what they are, not for what they could potentially be, if only another half a trillion dollar were invested in data-centers. It is not going to happen with this technology, the issue is structural, not resource-related.
I go back and forth. I share your amazement. I used Gemini Deep Research the other day and was blown away. It claimed to go read 20 websites, I showed its "thinking" and steps. Its conclusions at each step. Then it wrote a large summary (several pages)
On the other hand, I saw github recently added Copilot as a code reviewer. For fun I let it review my latest pull request. I hated its suggestions but could imagine a not too distant future where I'm required by upper management to satisfy the LLM before I'm allowed to commit. Similarly, I've asked ChatGPT questions and it's been programmed to only give answers that Silicon Valley workers have declared "correct".
The thing I always find frustrating about the naysayers is that they seem to think how it works today is the end if it. Like I recently listened to an episode of Econtalk interviewing someone on AI and education. See lives in the UK and used Tesla FSD as an example of how bad AI is. Yet I live in California and see Waymo mostly working today and lots of people using it. I believe she wouldn't have used the Tesla FSD example, and would possibly have changed her world view at least a little, if she'd updated on seeing self driving work.
> And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
Except this isn't true. The code quality varies dramatically depending on what you're doing, the length of the chat/context, etc. It's an incredible productivity booster, but even earlier today, I wasted time debugging hallucinated code because the LLM mixed up methods in a library.
The problem isn't so much that it's not an amazing technology, it's how it's being sold. The people who stand to benefit are speaking as though they've invented a god and are scaring the crap out of people making them think everyone will be techno-serfs in a few years. That's incredibly careless, especially when as a technical person, you understand how the underlying system works and know, definitively, that these things aren't "intelligent" the way they're being sold.
Like the startups of the 2010s, everyone is rushing, lying, and huffing hopium deluding themselves that we're minutes away from the singularity.
Really? I just get garbage. Both Claude and CoPilot kept insisting that it was ok to use react hooks outside of function components. There have been many other situations where it gave me some code and even after refining the prompt it just gave me wrong or non working code. I’m not expecting perfection, but at least don’t waste my time with hallucinations or just flat out stuff that doesn’t work.
You forget the large group of people that proudly declare they invent AGI and they can make everyone lose jobs and starve. complains are for them, not for you.
Keep in mind it understands nothing. The notion that LLMs understand anything is fundamentally flawed, as they do not demonstrate any markers of understanding
The fact that you don't know what Markov chain means and get angry over others over that pisses me off.
Both are Markov chains, that you used to erroneously think Markov chain is a way to make a chatbot rather than a general mathematical process is on you not them.
not one of them have managed to generate a successful promise based implementation of recaptcha v2 in javascript from scratch https://developers.google.com/recaptcha/docs/loading they have a million+ references for this
I mean. How would you feel if you coded a menu in Python with certain choices but when you used it the choices were never the same or in the same order, sometimes there were fake choices, sometimes they are improperly labelled and sometimes the menu just completely fails to open. And you as a coder and you as a user have absolutely no control over any of those issues. Then, when you go online to complain people say useful stuff like "Isn't it amazing that it does anything at all!? Give us a break, we're working on it bro."
That's how I see LLMs and the hype surrounding them.
a lot of it is just plain denial. a certain subgenre of person will forever attack everything AI does because they feel threatened by it and a certain other subgenre of person will just copy this behaviour and parrot opinions for upvotes/likes/retweets.
For me, LLMs are a bit like if you were shown a talking dog with the education and knowledge of a first grad student: a talking dog is amazing in itself, and a truly impressive technical feat, that said you wouldn't make the dog file your taxes or represent you in court.
To quote Joel Spolsky, "When you’re working on a really, really good team with great programmers, everybody else’s code, frankly, is bug-infested garbage, and nobody else knows how to ship on time.", and that's the state we end up if we believe in the hype and use LLMs willy-nilly.
That's why people are annoyed, not because LLMs cannot code like a senior engineer, but because lots of content marketing a company valuation is dependent on making people believe it's the case.
I blame overpromised expectations from startups and public companies, screaming about AGI and superintelligence.
Truly amazing technology which is very good at generating and correcting texts is marketed as senior developer, talented artist, and black box that has solution to all your problems. This impression shatters on the first blatant mistake, e.g. counting elephant legs: https://news.ycombinator.com/item?id=38766512
For me, I think they're valuable but also overhyped. They're not at the point they're replacing entire dev teams like some articles point out. In addition, they are amazingly accurate sometimes and amazingly misleading other times. I've noticed some ardent advocates ignore the latter.
It's incredibly frustrating when people think they're a miracle tool and blindly copy/paste output without doing any kind of verification. This is especially frustrating when someone who's supposed to be a professional in the field is doing it (copy lasting non working AI generated code and putting it up for review)
That said, on one hand, they multiply productive and useful information. On the other hand, they kill productive and spread misinformation. That said, I still seem them as useful but not a miracle
I'll keep bringing up this example whenever people dismiss LLMs.
I can ask Claude the most inane programming question and got an answer. If I were to do that on StackOverflow, I'd get downvoted, rude comments, and my question closed for being off-topic. I don't have to be super knowledgeable about the thing I'm asking about with Claude (or any LLM for that matter).
Even if you ignore the rudeness and elitism of power-users of certain platforms, there's no more waiting for someone to respond to your esoteric questions. Even if the LLM spews bullshit, you can ask it clarifying questions or rephrase until you see something that makes sense.
I love LLMs, I don't care what people say. Even when I'm just spitballing ideas[1], the output is great.
It's the classic HN-like anti-anything bubble we see with Javascript frameworks. Hundreds of thousands of people are productive with them and enjoy them. They created entire industries and job fields. The same is happening with LLMs, but the usual counter-culture dev crowd is denying it while it's happening right before their eyes. I too use LLMs every day. I never click and a link and it doesn't exist. When I want to take my mind off of things, I just talk with GPT.
You're being disingenuous. The tweet was talking about asserting the existence of fake articles, claiming that a paper was written in one year while summarizing a paper that explicitly says it was written in another, and severe hallucinations. Nowhere does she even imply that she's looking for superintelligence.
What I find interesting is that my experience has been 100% the opposite. I’ve been using ChatGPT, Claude, and Gemini for almost a year (well only the ChatGPT for a year since the rest are more recent.) I’ve been using them to help build circuits and write code. They are almost always wrong with circuit design, and create code that doesn’t work north of 80% of the time. My patience has dropped off to the point where I only experiment with LLM a few times a week because they are so bad. Yes it is miraculous that we can have a conversation, but it means nothing if the output is always wrong.
But I will admit the dora muckbang feet shit is fucking insane. And that just flat out scares the pants off me.
>They are almost always wrong with circuit design, and create code that doesn’t work north of 80% of the time.
Sorry but this is a total skill issue lol. 80% code failure rate is just total nonsense. I don't think 1% of the code I've gotten from LLMs has failed to execute correctly.
LLMs can't be trusted. They aé like an overconfident idiot who is pretending quite impressively, butif you check on the result it's just a bit too much bullshit in the result. So there's practically zero gain in using LLMs except WHEN you actually need a text that's nice and eloquent bullshit.
Almost everytime I've tried using LLMs I've fallen into thepattern on calling out, correcting and argueing with the LLMs which is of course in itself sillyto do, because they don't learn, they don't really "get it" when they are wrong. There's no benefit to talking to a human.
This is the place where tech shiny meets actual use cases, and users aren’t really good at articulating their problems.
Its also a slow burn issue - you have to use it for a while for what is obvious to users, to become obvious to people who are tech first.
The primary issue is the hype and forecasted capabilities vs actual use cases. People want something they can trust as much as an authority, not as much as a consultant.
If I were to put it in a single sentence? These are primarily narrative tools, being sold as factual /scientific tools.
When this is pointed out, the conversation often shifts to “well people aren’t that great either”. This takes us back to how these tools are positioned and sold. They are being touted as replacements to people in the future. When this claim is pressed, we get to the start of this conversation.
Frankly, people on HN aren’t pessimistic enough about what is coming down the pipe. I’ve started looking at how to work in 0 Truth scenarios, not even 0 trust. This is a view held by everyone I have spoken to in fraud, misinformation, online safety.
There’s a recent paper which showed that GAI tools improved the profitability of Phishing attempts by something like 50x in some categories, and made previously loss making (in $/hour terms) targets, profitable. Schneier was one of the authors.
A few days ago I found out someone I know who works in finance, had been deepfaked and their voice/image used to hawk stock tips. People were coming to their office to sue them.
I love tech, but this is the dystopia part of cyberpunk being built. These are narrative tools, good enough to make people think they are experts..
The thing LLMs are really really good at, is sounding authoritative.
If you ask it random things the output looks amazing, yes. At least at first glance. That's what they do. It's indeed magical, a true marvel that should make you go: Woooow, this is amazing tech: Coming across as convincing, even if based on hallucinations, is in itself a neat trick!
But is it actually useful? The things they come up with are untrustworthy and on the whole far less good than previously available systems. In many ways, insidiously worse: It's much harder to identify bad information than it was before.
It's almost like we designed a system to pass turing tests with flying colours but forgetting that usefulness is what we actually wanted, not authoritative, human sounding bullshit.
I don't think the LLM naysayers are 'unimpressed', or that they demand perfection. I think they are trying to make statements aimed at balancing things:
Both the LLMs themselves, and the humans parroting the hype, are severely overstating the quality of what such systems produce. Hence, and this is a natural phenomenon you can observe in all walks of life, the more skeptical folks tend to swing the pendulum the other way, and thus it may come across to you as them being overly skeptical instead.
I totally agree, and this community far is from the worst. In trans communities there's incredible hostility towards LLMs - even local ones. "You're ripping off artists", "A pissing contest for tech bros", etc.
I'm trans, and I don't disagree that this technology has aspects that are problematic. But for me at least, LLMs have been a massive equalizer in the context of a highly contentious divorce where the reality is that my lawyer will not move a finger to defend me. And he's lawyer #5 - the others were some combination of worse, less empathetic, and more expensive. I have to follow up a query several times to get a minimally helpful answer - it feels like constant friction.
ChatGPT was a total game-changer for me. I told it my ex was using our children to create pressure - feeding it snippets of chat transcripts. ChatGPT suggested this might be indicative of coercive control abuse. It sounded very relevant (my ex even admitted in a rare, candid moment that she feels a need to control everyone around her one time), so I googled the term - essentially all the components were there except physical violence (with two notable exceptions).
Once I figured that out, I asked it to tell me about laws related to controlling relationships - and it suggested laws either directly addressing (in the UK and Australia), and the closest laws in Germany (Nötigung, Nachstellung, violations of dignity, etc., translating them to English - my best language). Once you name specific laws broken and provide a rationale for why there's a Tatbestand (ie the criterion for a violation is fulfilled), your lawyer has no option but to take you more seriously. Otherwise he could face a malpractice suit.
Sadly, even after naming specific law violations and pointing to email and chat evidence, my lawyer persists in dragging his feet - so much so that the last legal letter he sent wasn't drafted by him - it was ChatGPT. I told my lawyer: read, correct, and send to X. All he did was to delete a paragraph and alter one or two words. And the letter worked.
Without ChatGPT, I would be even more helpless and screwed than I am. It's far from clear I will get justice in a German court, but at least ChatGPT gives me hope, a legal strategy. Lastly - and this is a godsend for a victim of coercive control - it doesn't degrade you. Lawyers do. It completely changed the dynamics of my divorce (4 years - still no end in sight, lost my custody rights, then visitation rights, was subjected to confrontational and gaslighting tactics by around a dozen social workers - my ex is a social worker -, and then I literally lost my hair: telogen effluvium, tinea capitis, alopecia areata... if it's stress-related, I've had it), it gave me confidence when confronting my father and brother about their family violence.
It's been the ONLY reliable help, frankly, so much so I'm crying as I write this. For minorities that face discrimination, ChatGPT is literally a lifeline - and that's more true the more vulnerable you are.
I agree. I recently asked if a certain GPU would fit in a certain computer... And it understood that fit could mean physically inside by could also mean that the interface is compatible, and answered both.
It did. It mentioned PCIe connectors, what connects to what, and said this computer has motherboard with such and such PCIe, the card needs such and such, so it's compatible. Regarding physical size it, it said that it depends on the physical size of the case (implying that it understood that the size of the card is known but the size of the computer isn't know to it)
It's quite insulting that you just assume I don't know how to read specs. You're either assuming based on nothing, or you're inferring from my comment in which case I worry for your reading comprehension. At no point did I say I didn't know how to find the answer or indeed that I didn't know the answer.
TBH, they produce trash results for almost any question I might want to ask them. This is consistently the case. I must use them differently than other people.
LLMs produce midwit answers. If you are an expert in your domain, the results are kind of what you would expect for someone who isn’t an expert. That is occasionally useful but if I wanted a mediocre solution in software I’d use the average library. No LLM I have ever used has delivered an expert answer in software. And that is where all the value is.
I worked in AI for a long time, I like the idea. But LLMs are seemingly incapable of replacing anything of value currently.
The elephant in the room is that there is no training data for the valuable skills. If you have to rely on training data to be useful, LLMs will be of limited use.
Here’s when we can start getting excited about LLMs: when they start making new and valid scientific discoveries that can radically change our world.
When an AI can say “Here’s how you make better, smaller, more powerful batteries, follow these plans”, then we will have a reason to worship AI.
When AI can bring us wonders like room temperature semiconductors, fast interstellar travel, anti-gravity tech, solutions to world hunger and energy consumption, then it will have fulfilled the promise of what AI could do for humanity.
Until then, LLMs are just fancy search and natural language processors. Puppets with strings. It’s about as impressive as Google was when it first came out.
My experience (almost exclusively Claude), has just been so different that I don't know what to say. Some of the examples are the kinds of things I explicitly wouldn't expect LLMs to be particularly good at so I wouldn't use them for, and others, she says that it just doesn't work for her, and that experience is just so different than mine that I don't know how to respond.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air.
Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
> Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
The problem is that I feel I am constantly being bombarded by people bullish on AI saying "look how great this is" but when I try to do the exact same things they are doing, it doesn't work very well for me
Of course I am skeptical of positive claims as a result.
I don't know what you are doing or why it's failed. Maybe my primary use cases really are in the top whatever percentile for AI usefulness, but it doesn't feel like it. All I know is that frontier models have already been good enough for more than a year to increase my productivity by a fair bit.
Your use case is in fact in the top whatever percentile for AI usefulness. Short simple scripting that won't have to be relied on due to never being widely deployed. No large codebase it has to comb through, no need for thorough maintenance and update management, no need for efficient (and potentially rare) solutions.
The only use case that would beat yours is the type of office worker that cannot write professional sounding emails but has to send them out regularly manually.
I fully believe it's far better at the kind of coding/scripting that I do than the kind that real SWEs do. If for no other reason than the coding itself that I do is far far simpler and easier, so of course it's going to do better at it. However, I don't really believe that coding is the only use case. I think that there are a whole universe of other use cases that probably also get a lot of value from LLMs.
I think that HN has a lot of people who are working on large software projects that are incredibly complex and have a huge numbers of interdependencies etc., and LLMs aren't quite to the point that they can very usefully contribute to that except around the edges.
But I don't think that generalizing from that failure is very useful either. Most things humans do aren't that hard. There is a reason that SWE is one of the best paid jobs in the country.
Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.
Real programming is on a totally different scale than what you're describing.
I think that's true for most jobs. Superficially an AI looks like it can do good.
But LLMs:
1. Hallucinate all the time. If they were human we'd call them compulsive liars
2. They are consistenly inconsistent, so are useless for automation
3. Are only good at anything they can copy from their data set. They can't create, only regurgitate other people's work
4. AI influencing hasn't happened yet, but will very soon start making AI LLMs useless, much like SEO has ruined search. You can bet there are a load of people already seeding the internet with a load of advertising and misinformation aimed solely at AIs and AI reinforcement
> Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.
For what it's worth, I mostly work on projects in the 100-200 files range, at 20-40k LoC. When using proper tooling with appropriate models, it boosts my productivity by at least 2x (being conservative). I've experimented with this by going a few days without using them, then using them again.
Definitely far from the massive codebases many on here work on, small beans by HN standards. But also decidedly not just writing one-off scripts.
> Real programming is on a totally different scale than what you're describing.
How "real" are we talking?
When I think of "real programming" I think of flight control software for commercial airplanes and, I can assure you, 1 month != 5,000 LoC in that space.
And... I know people who now use AI to write their professional-sounding emails, and they often don't sound as professional as they think they do. It can be easy to just skim what an AI generates and think it's okay to send if you aren't careful, but the people you send those emails to actually have to read what was written and attempt to understand it, and doing that makes you notice things that a brief skim doesn't catch.
It's actually extremely irritating that I'm only half talking to the person when I email with these people.
It's kinda like machine translated novels. You have to really be passionate about the novel to endure these kinds of translations. That's when you realize how much work novel translators do to get a coherent result.
Especially jarring when you have read translation that put thought in them. Noticed this in Xianxia so Chinese power-fantasy. Where selection of what to translate and what to transliterate can have huge impact. And then editorial work also becomes important if something in past need to be changed based on future information.
I literally had a developer of an open source package I’m working with tell me “yeah that’s a known problem, I gave up on trying to fix it. You should just ask ChatGPT to fix it, I bet it will immediately know the answer.”
Annoying response of course. But I’d never used an LLM to debug before, so I figured I’d give it a try.
First: it regurgitated a bunch of documentation and basic debugging tips, which might have actually been helpful if I had just encountered this problem and had put no thought into debugging it yet. In reality, I had already spent hours on the problem. So not helpful
Second: I provided some further info on environment variables I thought might be the problem. It latched on to that. “Yes that’s your problem! These environment variables are (causing the problem) because (reasons that don’t make sense). Delete them and that should fix things.” I deleted them. It changed nothing.
Third: It hallucinated a magic numpy function that would solve my problem. I informed it this function did not exist, and it wrote me a flowery apology.
Clearly AI coding works great for some people, but this was purely an infuriating distraction. Not only did it not solve my problem, it wasted my time and energy, and threw tons of useless and irrelevant information at me. Bad experience.
The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.
If I give it all my information and add "I think the problem might be X, but I'm not sure", the LLM always agrees that the problem is X and will reinterpret everything else I've said to 'prove' me right.
Then the conversation is forever poisoned and I have to restart an entirely new chat from scratch.
98% of the utility I've found in LLMs is getting it to generate something nearly correct, but which contains just enough information for me to go and Google the actual answer. Not a single one of the LLMs I've tried have been any practical use editing or debugging code. All I've ever managed is to get it to point me towards a real solution, none of them have been able to actually independently solve any kind of problem without spending the same amount of time and effort to do it myself.
> The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.
I'm seeing this sentiment a lot in these comments, and frankly it shows that very few here have actually gone and tried the variety of models available. Which is totally fine, I'm sure they have better stuff to do, you don't have to keep up with this week's hottest release.
To be concrete - the symptom you're talking about is very typical of Claude (or earlier GPT models). o3-mini is much less likely to do this.
Secondly, prompting absolutely goes a huge way to avoiding that issue. Like you're saying - if you're not sure, don't give hints, keep it open-minded. Or validate the hint before starting, in a separate conversation.
I literally got this problem earlier today on ChatGPT, which claims to be based on o4-mini. So no, does not sound like it's just a problem with Claude or older GPTs.
And on "prompting", I think this is a point of friction between LLM boosters and haters. To the uninitiated, most AI hype sounds like "it's amazing magic!! just ask it to do whatever you want and it works!!" When they try it and it's less than magic, hearing "you're prompting it wrong" seems more like a circular justification of a cult follower than advice.
I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively. I buy that. But some more specific advice would be helpful. Cause as is, it sounds more like "LLMs are magic!! didn't work for you? oh, you must be holding it wrong, cause I know they infallibly work magic".
> I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively
I don't buy it this at all.
At best "learning to prompt" is just hitting the slot machine over and over until you get something close to what you want, which is not a skill. This is what I see when people "have a conversation with the LLM"
At worst you are a victim of sunk cost fallacy, believing that because you spent time on a thing that you have developed a skill for this thing that really has no skill involved. As a result you are deluding yourself into thinking that the output is better.. not because it actually is, but because you spent time on it so it must be
On the other hand, when it works it's darn near magic.
I spent like a week trying to figure out why a livecd image I was working on wasn't initializing devices correctly. Read the docs, read source code, tried strace, looked at the logs, found forums of people with the same problem but no solution, you know the drill. In desperation I asked ChatGPT. ChatGPT said "Use udevadm trigger". I did. Things started working.
For some problems it's just very hard to express them in a googleable form, especially if you're doing something weird almost nobody else does.
i started (re)using AI recently. it/i mostly failed until i decided on a rule.
if it's "dumb and annoying" i ask the AI, else i do it myself.
since that AI has been saving me a lot of time on dumb and annoying things.
also a few models are pretty good for basic physics/modeling stuff (get basic formulas, fetching constants, do some calculations). these are also pretty useful. i recently used it for ventilation/co2 related stuff in my room and the calculations matched observed values pretty well, then it pumped me a broken desmos syntax formula, and i fixed that by hand and we were good to go!
---
(dumb and annoying thing -> time-consuming to generate with no "deep thought" involved, easy to check)
> For some problems it's just very hard to express them in a googleable form
I had an issue where my Mac would report that my tethered iPhone's batteries were running low when the battery was in fact fine. I had tried googling an answer, and found many similar-but-not-quite-the-same questions and answers. None of the suggestions fixed the issue.
I then asked the 'MacOS Guru' model for chatGPT my question, and one of the suggestions worked. I feel like I learned something about chatGPT vs Google from this - the ability of an LLM to match my 'plain English question without a precise match for the technical terms' is obviously superior to a search engine. I think google etc try synonyms for words in the query, but to me it's clear this isn't enough.
Google isn't the same for everyone. Your results could be very different from mine. They're probably not quite the same as months ago either.
I may also have accidentally made it harder by using the wrong word somewhere. A good part of the difficulty of googling for a vague problem is figuring out how to even word it properly.
Also of course it's much easier now that I tracked down what the actual problem was and can express it better. I'm pretty sure I wasn't googling for "devices not initializing" at the time.
But this is where I think LLMs offer a genuine improvement -- being able to deal with vagueness better. Google works best if you know the right words, and sometimes you don't.
This morning I was using an LLM to develop some SQL queries against a database it had never seen before. I gave it a starting point, and outlined what I wanted to do. It proposed a solution, which was a bit wrong, mostly because I hadn't given it the full schema to work with. Small nudges and corrections, and we had something that worked. From there, I iterated and added more features to the outputs.
At many points, the code would have an error; to deal with this, I just supply the error message, as-is to the LLM, and it proposes a fix. Sometimes the fix works, and sometimes I have to intervene to push the fix in the right direction. It's OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three.
A key part of the workflow, imo, was that we were working in the medium of the actual code. If the code is broken, we get an error, and can iterate. Asking for opinions doesn't really help...
I often wonder if people who report that LLMs are useless for code haven't cracked the fact that you need to to have a conversation with it - expecting a perfect result after your first prompt is setting it up for failure, the real test is if you can get to a working solution after iterating with it for a few rounds.
As someone who has finally found a way to increase productivity by adding some AI, my lesson has sort of been the opposite. If the initial response after you've provided the relevant context isn't obviously useful: give up. Maybe start over with slightly different context. A conversation after a bad result won't provide any signal you can do anything with, there is no understanding you can help improve.
It will happily spin forever responding in whatever tone is most directly relevant to your last message: provide an error and it will suggest you change something (it may even be correct every once in a while!), suggest a change and it'll tell you you're obviously right, suggest the opposite and you will be right again, ask if you've hit a dead end and yeah, here's why. You will not learn anything or get anywhere.
A conversation will only be useful if the response you got just needs tweaks. If you can't tell what it needs feel free to let it spin a few times, but expect to be disappointed. Use it for code you can fully test without much effort, actual test code often works well. Then a brief conversation will be useful.
Because once you get good at using LLMs you can write it with 5 rounds with an LLM in way less time than it would have taken you to type out the whole thing yourself, even if you got it exactly right first time coding it by hand.
Most of the code in there is directly copied and pasted in from https://claude.ai or https://chatgpt.com - often using Claude Artifacts to try it out first.
Some changes are made in VS Code using GitHub Copilot
If you do a basic query to GPT-4o every ten seconds it uses a blistering... hundred watts or so. More for long inputs, less when you're not using it that rapidly.
I know. That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do. (One recent example: https://simonwillison.net/2024/Sep/10/software-misadventures... )
(I get boosts from LLMs to a bunch of activities too, like researching and planning, but those are less obvious than the coding acceleration.)
> That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do
This explains it then. You aren't a software developer
You get a productivity boost from LLMs when writing code because it's not something you actually do very much
That makes sense
I write code for probably between 50-80% of any given week, which is pretty typical for any software dev I've ever worked with at any company I've ever worked at
So we're not really the same. It's no wonder LLMs help you, you code so little that you're constantly rusty
I very much doubt you spend 80% of your working time actively typing code into a computer.
My other activities include:
- Researching code. This is a LOT of my time - reading my own code, reading other code, reading through documentation, searching for useful libraries to use, evaluating if those libraries are any good.
- Exploratory coding in things like Jupyter notebooks, Firefox developer tools etc. I guess you could call this "coding time", but I don't consider it part of that 10% I mentioned earlier.
- Talking to people about the code I'm about to write (or the code I've just written).
- Filing issues, or updating issues with comments.
- Writing documentation for my code.
- Straight up thinking about code. I do a lot of that while walking the dog.
- Staying up-to-date on what's new in my industry.
- Arguing with people about whether or not LLMs are useful on Hacker News.
You must not be learning very many new things then if you can't see a benefit to using an LLM. Sure, for the normal crud day-to-day type stuff, there is no need for an LLM. But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.
Sure, it often spits out incomplete, non-ideal, or plain wrong answers, but that's where having SWE experience comes in to play to recognize it
> But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.
In the middle of this thought, you changed the context from "learning new things" to "not being faster than an LLM"
It's easy to guess why. When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything
But yes, you're right. I don't learn new things from scratch very often, because I'm not changing contexts that frequently.
I want to be someone who had 10 years of experience in my domain, not 1 year of experience repeated 10 times, which means I cannot be starting over with new frameworks, new languages and such over and over
Exactly! I learn all kinds of things besides coding-related things, so I don't see how it's any different. ChatGPT 4o does an especially good job of walking thru the generated code to explain what it is doing. And, you can always ask for further clarification. If a coder is generating code but not learning anything, they are either doing something very mundane or they are being lazy and just copy/pasting without any thought--which is also a little dangerous, honestly.
It really depends on what you're trying to achieve.
I was trying to prototype a system and created a one-pager describing the main features, objectives, and restrictions. This took me about 45 minutes.
Then I feed it into Claude and asked to develop said system. It spent the next 15 minutes outputting file after file.
Then I ran "npm install" followed by "npm run" and got a "fully" (API was mocked) functional, mobile-friendly, and well documented system in just an hour of my time.
It'd have taken me an entire day of work to reach the same point.
Yeah nah. The endless loop of useless suggestions or ”solutions” is very easily achiavable and common, at least on my use cases, not matter how much you iterate with it. Iterating gets counter-productive pretty fast, imo. (Using 4o).
When I use Claude to iterate/troubleshoot I do it in a project and in multiple chats. So if I test something and it throws and error or gives an unexpected result I’ll start a new chat to deal with that problem, correct the code, update that in the project, then go back to my main thread and say “I’ve update this” and provide it the file, “now let’s do this”. When I started doing this it massively reduced the LLM getting lost or going off on weird quests. Iteration in side chats, regroup in the main thread. And then possibly another overarching “this is what I want to achieve” thread where I update it on the progress and ask what we should do next.
I have been thinking about this a lot recently. I have a colleague who simply can’t use LLMs for this reason - he expects them to work like a logical and precise machine, and finds interacting with them frustrating, weird and uncomfortable.
However, he has a very black and white approach to things and he also finds interacting with a lot of humans frustrating, weird and uncomfortable.
The more conversations I see about LLMs the more I’m beginning to feel that “LLM-whispering” is a soft skill that some people find very natural and can excel at, while others find it completely foreign, confusing and frustrating.
It really requires self-discipline to ignore the enthusiasm of the LLM as a signal for whether you are moving in the direction of a solution. I blame myself for lazy prompting, but have a hard time not just jumping in with a quick project, hoping the LLM can get somewhere with it, and not attempt things that are impossible, etc.
> OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three
If you have any reasonable understanding of SQL, I guarantee you could brush up on it and write it yourself in less than a couple of hours unless you're trying to do something very complex
Sure, I could do that. But I would learn where to put my join statements relative to the where statements, and then forget it again in a month because I have lots of other tihngs that I actually need to know on a daily basis. I can easily outsource the boilerplate to the LLM and get to a reasonable starting place for free.
Think of it as managing cognitive load. Wandering off to relearn SQL boilerplate is a distraction from my medium-term goal.
edit: I also believe I'm less likely to get a really dumb 'gotcha' if I start from the LLM rather than cobbling together knowledge from some random docs.
If you don’t take care to understand what the LLM outputs, how can you be confident that it works in the general case, edge cases and all? Most of the time that I spend as a software engineer is reasoning about the code and its logic to convince myself it will do the right thing in all states and for all inputs. That’s not something that can be offloaded to an LLM. In the SQL case, that means actually understanding the semantics and nuances of the specific SQL dialect.
Obviously to a mega super genius like yourself an LLM is useless. But perhaps you can consider that others may actually benefit from LLMs, even if you’re way too talented to ever see a benefit?
You might also consider that you may be over-indexing on your own capabilities rather than evaluating the LLM’s capabilities.
Lets say an llm is only 25% as good as you but is 10% the cost. Surely you’d acknowledge there may be tasks that are better outsourced to the llm than to you, strictly from an ROI perspective?
It seems like your claim is that since you’re better than LLMs, LLMs are useless. But I think you need to consider the broader market for LLMs, even if you aren’t the target customer.
Knowing SQL isn't being a "mega super genius" or "way talented". SQL is flawed, but being hard to learn is not among its flaws. It's designed for untalented COBOL mainframe programmers on the theory that Codd's relational algebra and relational calculus would be too hard for them and prevent the adoption of relational databases.
However, whether SQL is "trivial to write by hand" very much depends on exactly what you are trying to do with it.
That makes sense, and from what I’ve heard this sort of simple quick prototyping is where LLM coding works well. The problem with my case was I’m working with multiple large code bases, and couldn’t pinpoint the problem to a specific line, or even file. So I wasn’t gonna just copy multiple git repos into the chat
(The details: I was working with running a Bayesian sampler across multiple compute nodes with MPI. There seemed to be a pathological interaction between the code and MPI where things looked like they were working, but never actually progressed.)
I wonder if it breaks like this: people who don't know how to code find LLMs very helpful and don't realize where they are wrong. People who do know immediately see all the things they get wrong and they just give up and say "I'll do it myself".
This is exactly my experience, every time! If I offer it the slightest bit of context it will say 'Ah! I understand now! Yes, that is your problem, …' and proceed to spit out some non-existent function, sometimes the same one it has just suggested a few prompts ago which we already decided doesn't exist/work. And it just goes on and on giving me 'solutions' until I finally realise it doesn't have the answer (which it will never admit unless you specifically ask it to – forever looking to please) and give up.
I’ve followed your blog for a while, and I have been meaning to unsubscribe because the deluge of AI content is not what I’m looking for.
I read the linked article when it was posted, and I suspect a few things that are skewing your own view of the general applicability of LLMs for programming. One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
I think it’s great that it’s a technology you’re passionate about and that it’s useful for you, but my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful. And that’s okay, it doesn’t have to be all things to all people. But it’s not fair to say that we’re just holding it wrong.
"my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful."
It's possible that changed this week with Gemini 2.5 Pro, which is equivalent to Claude 3.7 Somnet in terms of code quality but has a 1 million token context (with excellent scores on long context benchmarks) and an increased output limit too.
I've been dumping hundreds of thousands of times of codebase into it and getting very impressive results.
See this is one of the things that’s frustrating about the whole endeavor. I give it an honest go, it’s not very good, but I’m constantly exhorted to try again because maybe now that Model X 7.5qrz has been released, it’ll be really different this time!
It’s exhausting. At this point I’m mostly just waiting for it to stabilize and plateau, at which point it’ll feel more worth the effort to figure out whether it’s now finally useful for me.
Not going to disagree that it's exhausting! I've been trying to stay on top of new developments for the past 2.5 years and there are so many days when I'll joke "oh, great, it's another two new models day".
Just on Tuesday this week we got the first widely available high quality multi-modal image output model (GPT-4o images) and a new best-overall model (Gemini 2.5) within hours of each other. https://simonwillison.net/2025/Mar/25/
> One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
Take a look at the 2024 StackOverflow survey.
70% of professional developer respondents had only done extensive work over the last year in one of:
LLMs are of course very strong in all of these. 70% of developers only code in languages LLMs are very strong at.
If anything, for the developer population at large, this number is even higher than 70%. The survey respondents are overwhelmingly American (where the dev landscape is more diverse), and self-select to those who use niche stuff and want to let the world know.
Similar argument can be made for median codebase size, in terms of LOC written every year. A few days ago he also gave Gemini Pro 2.5 a whole codebase (at ~300k tokens) and it performed well. Even in huge codebases, if any kind of separation of concerns is involved, that's enough to give all context relevant to the part of the code you're working on. [1]
What’s 300k tokens in terms of lines of code? Most codebases I’ve worked on professionally have easily eclipsed 100k lines, not including comments and whitespace.
But really that’s the vision of actual utility that I imagined when this stuff first started coming out and that I’d still love to see: something that integrates with your editor, trains on your giant legacy codebase, and can actually be useful answering questions about it and maybe suggesting code. Seems like we might get there eventually, but I haven’t seen that we’re there yet.
We hit "can actually be useful answering questions about it" within the last ~6 months with the introduction of "reasoning" models with 100,000+ token contest limits (and the aforementioned Gemini 1 million/2 million models).
The "reasoning" thing is important because it gives models the ability to follow execution flow and answer complex questions that down many different files and classes. I'm finding it incredible for debugging, eg: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8...
I built a files-to-prompt tool to help dump entire codebases into the larger models and I use it to answer complex questions about code (including other people's projects written in languages I don't know) several times a week. There's a bunch of examples of that here: https://simonwillison.net/search/?q=Files-to-prompt&sort=dat...
After more than a few years working on a codebase? Quite a lot. I know which interfaces I need and from where, what the general areas of the codebase are, and how they fit together, even if I don’t remember every detail of every file.
> But it’s not fair to say that we’re just holding it wrong.
<troll>Have you considered that asking it to solve problems in areas it's bad at solving problems is you holding it wrong?</troll>
But, actually seriously, yeah, I've been massively underwhelmed with the LLM performance I've seen, and just flabbergasted with the subset of programmer/sysadmin coworkers who ask it questions and take those answers as gospel. It's especially frustrating when it's a question about something that I'm very knowledgeable about, and I can't convince them that the answer they got is garbage because they refuse to so much as glance at supporting documentation.
LLMs need to stay bad. What is going to happen if we have another few GPT-3.5 to Gemini 2.5 sized steps? You're telling people who need to keep the juicy SWE gravy train running for another 20 years to recognize that the threat is indeed very real. The writing is on the wall and no one here (here on HN especially) is going to celebrate those pointing to it.
I don't think people really realize the danger of mass unemployment
Go look up what happens in history when tons of people are unemployed at the same time with no hope of getting work. What happens when the unemployed masses become desperate?
Naw I'm sure it will be fine, this time will be different
Just wanted to chime in and say how appreciative I’ve been about all your replies here, and overall content on AI. Your takes are super reasonable and well thought out.
I see people say, "Look how great this is," and show me an example, and the example they show me is just not great. We're literally looking at the same thing, and they're excited that this LLM can do a college grads's job to the level of a third grader, and I'm just not excited about that.
What changed my point of view regarding LLMs was when I realized how crucial context is in increasing output quality.
Treat the AI as a freelancer working on your project. How would you ask a freelancer to create a Kanban system for you? By simply asking "Create a Kanban system", or by providing them a 2-3 pages document describing features, guidelines, restrictions, requirements, dependencies, design ethos, etc?
Which approach will get you closer to your objective?
The same applies to LLM (when it comes to code generation). When well instructed, it can quickly generate a lot of working code, and apply the necessary fixes/changes you request inside that same context window.
It still can't generate senior-level code, but it saves hours when doing grunt work or prototyping ideas.
"Oh, but the code isn't perfect".
Nor is the code of the average jr dev, but their codes still make it to production in thousands of companies around the world.
They're sophisticated tools at much as any other software.
About 2 weeks ago I started on a streaming markdown parser for the terminal because none really existed. I've switched to human coding now but the first version was basically all llm prompting and a bunch of the code is still llm generated (maybe 80%). It's a parser, those are hard. There's stacks, states, lookaheads, look behinds, feature flags, color spaces, support for things like links and syntax highlighting... all forward streaming. Not easy
> LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Exactly this.
I once had a function that would generate several .csv reports. I wanted these reports to then be uploaded to s3://my_bucket/reports/{timestamp}/.csv
I asked ChatGPT "Write a function that moves all .csv files in the current directory to and old_reports directory, calls a create_reports function, then uploads all the csv files in the current directory to s3://my_bucket/reports/{timestamp}/.csv with the timestamp in YYYY-MM-DD format""
And it created the code perfectly. I knew what the correct code would look like, I just couldn't be fucked to look up the exact calls to boto3, whether moving files was os.move or os.rename or something from shutil, and the exact way to format a datetime object.
It created the code far faster that I would have.
Like, I certainly wouldn't use it to write a whole app, or even a whole class, but individual blocks like this, it's great.
I have been saying this about llms for a while - if you know what you want, how to ask for it, and what the correct output will look like, LLMs are fantastic (at least Claude Sonnet is). And I mean that seriously, they are a highly effective tool for productive development for senior developers.
I use it to produce whole classes, large sql queries, terraform scripts, etc etc. I then look over that output, iterate on it, adjust it to my needs. It's never exactly right at first, but that's fine - neither is code I write from scratch. It's still a massive time saver.
> they are a highly effective tool for productive development for senior developers
I think this is the most important bit many people miss. It is advertised as an autonomous software developer, or something that can take a junior to senior levels, but that's just advertising.
It is actually most useful for senior developers, as it does the grunt work for them, while grunt work is actually useful work for a junior developer as a learning tool.
Precisely -- you have to be experienced in your field to use these tools effectively.
These are power tools for the mind. We've been working with the equivalent of hand tools, now something new came along. And yeah, a hole hawg will throw you clear off a ladder if you're not careful -- does that mean you're going to bore 6" holes in concrete ceilings by hand? Think not.
> It is advertised as an autonomous software developer
By a few currently niche VC players, I guess. I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
> I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
Are you sure about that? [1]:
> "I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code," Amodei said at a Council of Foreign Relations event on Monday.
"How to ask for it" is the most important part. As soon as you realize that you have to provide the AI with CONTEXT and clear instructions (you know, like a top-notch story card on a scrum board), the quality and assertiveness of the results increase a LOT.
Yes, it WON'T produce senior-level code for complex tasks, but it's great at tackling down junior to mid-level code generation/refactoring, with minor adjustments (just like a code review).
So, it's basically the same thing as having a freelancer jr dev at your disposal, but it can generate working code in 5 min instead of 5 hours.
I've had so many cases exactly like your example here. If you build up an intuition that knows that e.g. Claude 3.7 Sonnet can write code that uses boto3, and boto3 hasn't had any breaking changes that would affect S3 usage in the past ~24 months, you can jump straight into a prompt for this kind of task.
It doesn't just save me a ton of time, it results in me building automations that I normally wouldn't have taken on at all because the time spent fiddling with os.move/boto3/etc wouldn't have been worthwhile compared to other things on my plate.
I think you have an interesting point of view and I enjoy reading your comments, but it sounds a little absurd and circular to discount people's negativity about LLMs simply because it's their fault for using an LLM for something it's not good at. I don't believe in the strawman characterization of people giving LLMs incredibly complex problems and being unreasonably judgemental about the unsatisfactory results. I work with LLMs every day. Companies pay me good money to implement reliable solutions that use these models and it's a struggle. Currently I'm working with Claude 3.5 to analyze customer support chats. Just as many times as it makes impressive, nuanced judgments it fails to correctly make simple trivial judgements. Just as many times as it follows my prompt to a tee, it also forgets or ignores important parts of my prompt. So the problem for me is it's incredibly difficult to know when it'll succeed and when it'll fail for a given input. Am I unreasonable for having these frustrations? Am I unreasonable for doubting the efficacy of LLMs to address problems that many believe are already solved? Can you understand my frustration to see people characterize me as such because ChatGPT made a really cool image for them once?
It's a weird circle with these things. If you _can't_ do the task you are using the LLM for, you probably shouldn't.
But if you can do the task well enough to at least recognize likely-to-be-correct output, then you can get a lot done in less time than you would do it without their assistance.
Is that worth the second order effects we're seeing? I'm not convinced, but it's definitely changed the way we do work.
I think this points to much of the disagreement over LLMs. They can be great at one-off scripts and other similar tasks like prototypes. Some folks who do a lot of that kind of work find the tools genuinely amazing. Other software engineers do almost none of that and instead spend their coding time immersed in large messy code bases, with convoluted business logic. Looping an LLM into that kind of work can easily be net negative.
Maybe they are just lazy around tooling. Cursor with Claude works well for project sizes much larger than I expected but it takes a little set up. There is a chasm between engineers who use tools well and who do not.
I don't really agree with framing it as lazy. Adding more tools and steps to your workflow isn't free, and the cost/benefit of each tool will be different for everyone. I've lost count of how many times someone has evangelized a software tool to me, LLM or not. Once in a while they turn out to be useful and I incorporate them into my regular workflow, but far more often I don't. This could be for any number of reasons like it does not fit with my workflow well, or I already have a better way of doing whatever it does, or the tool adds more friction than it remove.
I'm sure spending more time fiddling with the setup of LLM tools can yield better results, but that doesn't mean that it will be worth it for everyone. In my experience LLMs fail often enough at modestly complex problems that they are more hassle than benefit for a lot of the work I do. I'll still use them for simple tasks, like if I need some standard code in a language I'm not too familiar with. At the same time, I'm not at all surprised that others have a different experience and find them useful for larger projects they work on.
I'm tired of people bashing LLMs. AI is so useful in my daily work that I can't understand where these people are coming from. Well, whatever...
As you said, examples where I wouldn't expect LLMs to be good at from people who dismiss the scenarios where LLMs are great at. I don't want to convince anyone, to be honest - I just want to say they are incredibly useful for me and a huge time saver. If people don't want to use LLMs, it's fine for me as I'll have an edge over them in the market. Thanks for the cash, I guess.
I'll give you a simple and silly example which could give you additional ideas. LLMs can be great for checking whether people can understand something.
One day I came up with a joke and wondered whether people would "get it". I told the joke to ChatGPT and asked it to explain it back to me. ChatGPT did a great job and nailed what's supposedly funny about the joke. I used it in an email so I have no idea whether anyone found it funny, but at least I know it wasn't too obscure. If an AI can understand a joke, there's a good chance people will understand it too.
This might not be super useful but demonstrates that LLMs aren't only about generating text for copy-and-paste or retrieving information. It's "someone" you can bounce ideas with, ask opinions and that's how I use it most frequently.
every time someone brings up "Code that doesn't need to deal with edge cases" I like to point at that such code is not likely to be used for anything that matters
Oh, but it is. I can have code that does something nice to have, needs not to be 100% correct etc. For example, I want a background for my playful webpage. Maybe a WebGL shader. It might not be exactly what I asked for, but I can have it in few minutes up and running. Or some non-critical internal tools - like scraper for lunch menus from restaurants around office. Or simple parking spot sharing app. Or any kind of prototypes which in some companies are being created all the time. There are so many use cases that are forgiving regarding correctness and are much more sensitive to development effort.
There is a cost burden to not being 100% correct when it comes to programming. You simply have chosen to ignore that burden, but it still exists for others. Whether it's for example a percent of your users now getting stalled pages due to the webgl shader, or your lunch scraper ddosing local restaurants. They aren't actually forgiving regarding correctness.
Which is fine for actual testing you're doing internally, since that cost burden is then remedied by you fixing those issues. However, no feature is as free as you're making it sound, not even the "nice to have" additions that seem so insignificant.
I never said it's free. (But also aiming for 100% correctness is very very expensive) I'm talking about trading correctness, readability, security and maybe others for another metrics. What I said is just not every project that has value should be optimized for the same metrics. Bank or medical software needs to be correct as close to 100% as possible. Some tool I'm creating for my team to simplify a process does not necessarily need to. I would not mind my webgl shader possibly causing problems to some users. It would get reported and fixed. Or not. It's my call what I would spend my effort on.
Of course the tradeoffs should be well considered. That's why it may get out of hand real bad if software will be created (or vibe coded) by people with little understanding of these metrics and tradeoffs. I'm absolutely not advocating for that.
I’m always amazed in these discussions how many people apparently have jobs doing a bunch of stuff that either doesn’t need to be correct or is simple enough that it doesn’t require any significant amount of external context.
The point is more that everyone seems to acknowledge that a) output is spotty, and b) it’s difficult to provide enough context to work on anything that’s not fairly self-contained. And yet we also constantly have people saying that they’re using AI for some ridiculous percentage of their actual job output. So, I’m just curious how one reconciles those two things.
Either most people’s jobs consist of a lot more small, self-contained mini-projects than my jobs generally have, or people’s jobs are more accepting of incorrect output than I’m used to, or people are overstating their use of the tool.
Automating the easy 80% sounds useful, but in practice I'm not convinced that's all that helpful. Reading and putting together code you didn't write is hard enough to begin with.
The things I'm wary of are pitfalls that are often only in the command/function docs. Kinda like rsync with how it handles terminating slashes at the end of the path. Which is why I always took a moment to read them.
Not GP, but more often than not I reach out to tools I already know (sed,awk,python) or read the docs which don't take that much time if you know how to get to the sections you need.
I write code like that all the time. It's used for very specific use cases, only by myself or something I've also written. It's not exposed to random end users or inputs.
> Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
I’ve never seen it from my students. Why do you think this? It’s trivial to pick a real book/article. No student is generating fake material whole cloth and fake references to match. Even if they could, why would they risk it?
TBD whether that makes the effort to spot-check their references greater (does actually say what the student - explicitly or implicitly - claims it does?), or less (proving the non-existence of an obscure references is proving a negative)?
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air.
Perfectly put, IMO.
I know arguments from authority aren't primary, but I think this point highlights some important context: Dr. Hossenfelder has gained international renown by publishing clickbait-y YouTube videos that ostensibly debunk scientific and technological advances of all kinds. She's clearly educated and thoughtful (not to mention otherwise gainfully employed), but her whole public persona kinda relies on assuming the exclusively-critical standpoint you mention.
I doubt she necessarily feels indebted to her large audience expecting this take (it's not new...), but that certainly does seem like a hard cognitive habit to break.
More often than not, when I inquire deeper, I often find their prompting isn't very good at all.
"Garbage in, garbage out" as the law says.
Of course, it took a lot of trial and error for me to get to my current level of effectiveness with LLMs. It's probably our responsibility to teach these who are willing.
It seems hard to be bullish on LLMs as a generally useful tool if the solution to problems people have is "use trial and error to improve how you write your prompts, no, it's not obvious how to do so, yes, it depends heavily on the exact model you use."
A Mitre Saw is an amazing thing to have in a woodshop, but if you don't learn how to use it you're probably going to cut off a finger.
The problem is that LLMs are power tools that are sold as being so easy to use that you don't need to invest any effort in learning them at all. That's extremely misleading.
> Are legally liable for defects in design or manufacture that cause injury, death, or property damage
Except when you use them for purposes other than declared by them - then it's on you. Similarly, you get plenty of warnings about limitation and suitability of LLMs from the major vendors, including even warnings directly in the UI. The limitations of LLMs are common knowledge. Like almost everyone, you ignore them, but then consequences are on you too.
> Provide manuals that instruct the operator how to effectively and safely use the power tool
LLMs come with manuals much, much more extensive than any power tool ever (or at least since 1960s or such, as back then hardware was user-serviceable and manuals weren't just generic boilerplate).
As for:
> Know how they work
That is a real difference between power tool manufacturers and LLM vendors, but then if you switch to comparing against pharmaceutical industry, then they don't know how most of their products work either. So it's not a requirement for useful products that we benefit from having available.
Using LLMs to write SQL is a fascinating case because there are so many traps you could fall into that aren't really the fault of the LLM.
My favorite example: you ask the LLM for "most recent restaurant opened in California", give it a schema and it tries "select * from restaurants where state = 'California' order by open_date desc" - but that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
There are tricks that can help here - I've tried sending the LLM an example row from each table, or you can set up a proper loop where the LLM gets to see the results and iterate on them - but it reflects the fact that interacting with databases can easily go wrong no matter how "smart" the model you are using is.
> that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
As you’ve identified, rather than just giving it the schema you give it the schema and a some data when you tell it what you want.
A human might make exactly the same error - based on misassumption - and would then look at the data to see why it was failing.
If we assume that a LLM would magically realise that when you ask it to find something based on an identifier which you tell it is ‘California’ it would magically assume that the query should be based on ‘CA’ rather than what you told it, then that’s not really the fault of the LLM.
Agreed. If one compares ChatGPT to, say, the Cline IDE plugin backed by Claude 3.7, they might well be blown away by how far behind ChatGPT seems. A lot of the difference has to do with prompting, for sure -- Cline helps there by generating prompts from your IDE and project context automatically.
Every once in a while I send a query off to ChatGPT and I'm often disappointed and jam on the "this was hallucinated" feedback button (or whatever it is called). I have better luck with Claude's chat interface but nowhere near the quality of response that I get with Cline driving.
I want to sit next to you and stop you every time you use your LLM and say, “Let me just carefully check this output.” I bet you wouldn’t like that. But when I want to do high quality work, I MUST take that time and carefully review and test.
What I am seeing is fanboys who offer me examples of things working well that fail any close scrutiny— with the occasional example that comes out actually working well.
I agree that for prototyping unimportant code LLMs do work well. I definitely get to unimportant point B from point A much more quickly when trying to write something unfamiliar.
What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time? Nobody knows! A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result. If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
> What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time?
Benchmarks could track that too - I don't know if they do, but that information should actually be available and easy to get.
When models are scored on e.g. "pass10", i.e. pass the challenge in under 10 attempts, and then the benchmark is rerun periodically, that literally produces the information you're asking for: how frequently a given model fails at particular task.
> A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result.
For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place. For those tasks, LLMs are very useful.
> If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
How can you be sure whether a human you're asking isn't hallucinating/guessing the answer, or straight up bullshitting you? Apply the same approach to LLMs as you apply to navigating this problem with humans - for example, don't ask it to solve high-consequence problems in areas where you can't evaluate proposed solutions quickly.
I think part of it is that, from eons of experience, we have a pretty good handle on what kinds of mistakes humans make and how. If you hire a competent accountant, he might make a mistake like entering an expense under the wrong category. And since he's watching for mistakes like that, he can double-check (and so can you) without literally checking all his work. He's not going to "hallucinate" an expense that you never gave him, or put something in a category he just made up.
I asked Gemini for the lyrics to a song that I knew was on all the lyrics sites. To make a long story short, it gave me the wrong lyrics three times, apparently making up new ones the last two times. Someone here said LLMs may not be allowed to look at those sites for copyright reasons, which is fair enough; but then it should have just said so, not "pretended" it was giving me the right answer.
I have a python script that processes a CSV file every day, using DictReader. This morning it failed, because the people making the CSV changed it to add four extra lines above the header line, so DictReader was getting its headers from the wrong line. I did a search and found the fix on Stack Overflow, no big deal, and it had the upvotes to suggest I could trust the answer. I'm sure an LLM could have told me the answer, but then I would have needed to do the search anyway to confirm it--or simply implemented it, and if it worked, assume it would keep working and not cause other problems.
That was just a two-line fix, easy enough to try out and see if it worked, and guess how it worked. I can't imagine implementing a 100-line fix and assuming the best.
It seems to me that some people are saying, "It gives me the right thing X% of the time, which saves me enough developer time (mine or someone else's) that it's worth the other (100-X)% of the time when it gives me garbage that takes extra time to fix." And that may be a fair trade for some folks. I just haven't found situations where it is for me.
Better yet than the whole 'unknown, fluctuating and non deterministic rates of failure' is the whole 'agentic' shtick. People proposing to chain together these fluctuating plausibility engines should study probability theory a bit deeper to understand just what they are in for with these rube goldberg machines of text continuation.
I think it’s very odd that you think that people using LLMs regularly aren’t carefully checking the outputs. Why do you think that people using LLMs don’t care about their work?
> invented references that just don't exist"...all I can say is "press X to doubt
This doesn’t include lying and cheating which LLMs can’t.
On the other hand AI is used to solve problems that are already solved. I just recently got an ad about a software for process modeling where one claim was you don’t need always to start from the ground up but can say the AI give me the customer order process to start from that point. That is basically what templates are for with much less energy consumption.
I've noticed there seems to be a gatekeeping archetype that operates as a hard cynic to nearly everything, so that when they finally judge something positively they get heaps of attention.
It doesn't always correlate with narcissism, but it happens much more than chance.
>A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
Yes somewhat. Its good for powershell/bash/cmd scripts and configs, but early models it would hallucinate PowerShell cmdlets especially.
One thing I think is clear is society is now using a lot of words to describe things when the words being used are completely devoid of the necessary context. It's like calling a powder you've added to water "juice" and also freshly-squeezed fruit just picked perfectly ripe off a tree "juice". A word stretched like that becomes nearly devoid of meaning.
"I write code all day with LLMs, it's amazing!" is in the exact same category. The code you (general you, I'm not picking on you in particular) write using LLMs, and the code I write apart from LLMs: they are not the same. They are categorically different artifacts.
all fun and games until your AI generated script deletes the production database. I think that's the point, fault tolerance in academic and financial settings is too high for LLMs to be useful
The point is that given the current valuations, being good at a bunch of narrow use cases is just not good enough. It needs to be able to replace humans in every role where the primary output is text or speech to meet expectations.
I don't think that "replacing humans in every role" is the line for "being bullish on AI models". I think they could stop development exactly where they are, and they would still make pretty dramatic improvements to productivity in a lot of places. For me at least, their value already exceeds the $20/month I'm paying, and I'm pretty sure that way more than covers inference costs.
> I think they could stop development exactly where they are, and they would still make pretty dramatic improvements to productivity in a lot of places.
Yup. Not to mention, we don't even have time to figure out how to effectively work with one generation of models, before the next generation of models get released and rises the bar. If development stopped right now, I'd still expect LLMs to get better for years, as people slowly figure out how to use them well.
Completely agree. As is, Cursor and ChatGPT and even Bing Image Create (for free generation of shoddy ideas, styles, concepts, etc) are very useful to me. In fact, it would suit me if everything stalled at this point rather than improve to the point that everyone can catch up in how they use AI.
The most interesting thing about this post is how it reinforces how terrible the usability of LLMs still is today:
"I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error. I Google for the alleged quote, it doesn't exist. They reference a scientific publication, I look it up, it doesn't exist."
To experienced LLM users that's not surprising at all - providing citations, sources for quotes, useful URLs are all things that they are demonstrably terrible at.
But it's a computer! Telling people "this advanced computer system cannot reliably look up facts" goes against everything computers have been good at for the last 40+ years.
One of the things that’s hard about these discussions is that behind them is an obscene amount of money and hype. She’s not responding to realists like you. She’s responding to the bulls. The people saying these tools will be able to run the world by the end of this year, maybe next.
And that’s honestly unfair to you since you do awesome realistic and level headed work with LLM.
But I think it’s important when having discussions to understand the context within which they are occurring.
Without the bulls she might very well be saying what you are in your last paragraph. But because of the bulls the conversation becomes this insane stratified nonsense.
Possibly a reaction to Bill Gates recent statements that it will begin replacing doctors and teachers. It's ridiculous to say LLMs are incredibly useful and valuable. It's highly dubious to think they can be trusted with actual critical tasks without careful supervision.
I think it's ridiculous to say LLMs are NOT "incredibly useful and valuable", but I 100% agree that it's "highly dubious to think they can be trusted with actual critical tasks without careful supervision".
It's honestly so scary because Sam Altman and his ilk would gladly replace all teachers with LLM's right now, because it makes their lines go up, doesn't matter to them that would result in a generation of dumb people in like 10 years. Honestly would just create more LLM users for them to sell to so its a win win I guess, but it completely fucks up our world.
Teachers are there to observe and manage behavior, resolve conflict, identify psychological risks and get in front of fixing them, set and maintain a positive tone (“setting the weather”), lift pupils up to that tone, and to summarize, assess and report on progress.
They are also there to grind through papers, tests, lesson plans, reports, marking, and letter writing. All of that will get easier with machine assistance.
Teaching is one of the most human-nature centric jobs in the world and will be the last to go. If AI can help focus the role of teacher more on using expert people skills and less on drudgery it will hopefully even improve the prospects of teaching as a career, not eliminate it.
This isn't really a problem in tool-assisted LLMs.
Use google AI studio with search grounding. Provides correct links and citations every time. Other companies have similar search modes, but you have to enable those settings if you want good results.
How would that ever work? The only thing you can do is continue to refine high quality data sets to train on. The rate of hallucination only trends downwards on the high end models as they improve in various ways.
I become more and more convinced with each of these tweets/blogs/threads that using LLMs well is a skill set akin to using Search well.
It’s been a common mantra - at least in my bubble of technologists - that a good majority of the software engineering skill set is knowing how to search well. Knowing when search is the right tool, how to format a query, how to peruse the results and find the useful ones, what results indicate a bad query you should adjust… these all sort of become second nature the longer you’ve been using Search, but I also have noticed them as an obvious difference between people that are tech-adept vs not.
LLMs seems to have a very similar usability pattern. They’re not always the right tool, and are crippled by bad prompting. Even with good prompting, you need to know how to notice good results vs bad, how to cherry-pick and refine the useful bits, and have a sense for when to start over with a fresh prompt. And none of this is really _hard_ - just like Search, none of us need to go take a course on prompting - IMO folks jusr need to engage with LLMs as a non-perfect tool they are learning how to wield.
The fact that we have to learn a tool doesn’t make it a bad one. The fact that a tool doesn’t always get it 100% on the first try doesn’t make it useless. I strip a lot of screws with my screwdriver, but I don’t blame the screwdriver.
I don't know if she is a fraud, but she has definitely greatly amplified Rage Bait Farming and talking about things that are far outside of her domain of expertise as if she were an expert.
In no way am I credentialing her, lots of people can make astute observations about things they weren't trained in, but she both mastered sounding authoritative and at the same time, presenting things to go the most engagement possible.
I've frequently heard that once you get sucked into the YouTube algorithm, you have to keep making content to maintain rankings.
This trap reminds me of the Perry Bible Fellowship comic "Catch Phrase" which has been removed for being too dark but can still be found with a search.
Thanks for sharing this. I was heavily involved in graduate physics when I was in school, and was very worried about what direction shed take after the first big viral vid "telling her story." I wasn't sure it was well understood, or even understood at all, how blinkered her...viewpoint?...was.
LLMs function as a new kind of search engine, one that is amazingly useful because it can surface things that traditional search could never dream of. Don't know the name of a concept, just describe it vaguely and the LLM will pull out the term. Are you not sure what kind of information even goes into a cover letter or what's customary to talk about? Ask an LLM to write you one, it will be bland and generic sure but that's not the point because you now know the "shape" of what they're supposed to look like and that's great for getting unblocked. Have you stumbled across a passage of text that's almost English but you're not really sure what to look up to decipher it? Paste it into the LLM and it will tell you that it's "Early Modern English" which you can look up to confirm and get a dictionary for.
Broader than that, it’s critical thinking skills. Using search and LLMs requires analyzing the results and being able to separate what is accurate and useful from what isn’t.
From my experience this is less an application of critical skills and more a domain knowledge check. If you know enough about the subject to have accumulated heuristics for correctness and intuition for "lgtm" in the specific context, then it's not very difficult or intellectually demanding to apply them.
If you don't have that experience in this domain, you will spend approximately as much effort validating output as you would have creating it yourself, but the process is less demanding of your critical skills.
No, it is critical thinking skills, because the LLMs can teach you the domain, but you have to then understand what they are saying enough to tell if they are bsing you.
> you don't have that experience in this domain, you will spend approximately as much effort validating output as you would have creating it yourself,
Not true.
LLMs are amazing tutors. You have to use outside information, they test you, you test them, but they aren't pathologically wrong in the way that they are trying to do a Gaussian magic smoke psyop against you.
Knowledge certainly helps, but I’m talking about something more fundamental: your bullshit detector.
Even when you lack subject matter expertise about something, there are certain universal red flags that skeptics key in on. One of the biggest ones is: “There’s no such thing as a free lunch” and its corollary: “If it sounds too good to be true, it probably is.”
I'm not so sure about that. I was really Anti llm in the previous generation of LLMs (GPT3.5/4) but never stopped trying them out. I just found the results to be subpar.
Since reasoning models came about I've been significantly more bullish on them purely because they are less bad. They are still not amazing but they are at a poiny where I feel like including them in my workflow isn't an impediment.
They can now reliably complete a subset of tasks without me needing to rewrite large chunks of it myself.
They are still pretty terrible at edge cases ( uncommon patterns / libraries etc ) but when on the beaten path they can actually pretty decently improve productivity. I still don't think 10x ( well today was the first time I felt a 10x improvement but I was moving frontend code from a custom framework to react, more tedium than anything else in that and the AI did a spectacular job ).
You're using them wrong. Everyone is though I can't fault you specifically. Chatbot is like the worst possible application of these technologies.
Of late, deaf tech forums are taken over by language model debates over which works best for speech transcription. (Multimodal language models are the the state of the art in machine transcription. Everyone seems to forget that when complaining they can't cite sources for scientific papers yet.) The debates are sort of to the point that it's become annoying how it has taken over so much space just like it has here on HN.
But then I remember, oh yeah, there was no such thing as live machine transcription ten years ago. And now there is. And it's going to continue to get better. It's already good enough to be very useful in many situations. I have elsewhere complained about the faults of AI models for machine transcription - in particular when they make mistakes they tend to hallucinate something that is superficially grammatical and coherent instead - but for a single phrase in an audio transcription sporadically that's sometimes tolerable. In many cases you still want a human transcriber but the cost of that means that the amount of transcription needed can never be satisfied.
It's a revolutionary technology. I think in a few years I'm going have glasses that continuously narrate the sounds around me and transcribe speech and it's going to be so good I can probably "pass" as a hearing person in some contexts. It's hard not to get a bit giddy and carried away sometimes.
> You're using them wrong. Everyone is though I can't fault you specifically.
If everyone is using them wrong, I would argue that says something more about them than the users. Chat-based interfaces are the thing that kicked LLMs into the mainstream consciousness and started the cycle/trajectory we’re on now. If this is the wrong use case, everything the author said is still true.
There are still applications made better by LLMs, but they are a far cry from AGI/ASI in terms of being all-knowing problem solvers that don’t make mistakes. Language tasks like transcription and translation are valuable, but by no stretch do they account for the billions of dollars of spend on these platforms, I would argue.
LLM providers actually have an incentive not to write literature on how to use LLM optimally, as that causes friction which means less engagement/money spent on the provider. There's also the typical tin-foil hat explanation of "it's bad so you'll keep retrying it to get the LLM to work which means more money for us."
Isn't this more a product of the hype though? At worst you're describing a product marketing mistake, not some fundamental shortcoming of the tech. As you say "chat" isn't a use case, it's a language-based interface. The use case is language prediction, not an encyclopedic storage and recall of facts and specific quotes. If you are trying to get specific facts out of an LLM, you'd better be using it as an interface that accesses some other persistent knowledge store, which has been incorporated into all the major 'chat' products by now.
Surely you're not saying everyone is using them wrong. Let's say only 99% of them are using LLMs wrong, and the remaining 1% creates $100B of economic value. That's $100B of upside.
Yes the costs of training AI models these days are really high too, but now we're just making a quantitative argument, not a qualitative one.
The fact that we've discovered a near-magical tech that everyone wants to experiment with in various contexts, is evidence that the tech is probably going somewhere.
Historically speaking, I don't think any scientific invention or technology has been adopted and experimented with so quickly and on such a massive scale as LLMs.
It's crazy that people like you dismiss the tech simply because people want to experiment with it. It's like some of you are against scientific experimentation for some reason.
I think all the technology is already in place. There are already smart glasses with tiny text displays. Also smartphones have more than enough processing capacity to handle live speech transcription.
Thru the 90s and 00s and well into the 10s I generally dismissed speech recognition as useless to me, personally.
I have a minor speech impediment because of the hearing loss. They never worked for me very well. I don't speak like a standard American - I have a regional accent and I have a speech impediment. Modern speech recognition doesn't seem to have a problem with that anymore.
IBM's ViaVoice from 1997 in particular was a major step. It was really impressive in a lot of ways but the accuracy rate was like 90 - 95% which in practice means editing major errors with almost every sentence. And that was for people who could speak clearly. It never worked for me very well.
You also needed to speak in an unnatural way [pause] comma [pause] and it would not be fair to say that it transcribed truly natural speech [pause] full stop
Such voice recognition systems before about 2016 also required training on the specific speaker. You would read many pages of text to the recognition engine to tune it to you specifically.
It could not just be pointed at the soundtrack to an old 1980s TV show then produce a time-sync'd set of captions accurate enough to enjoy the show. But that can be done now.
If there's one common thread across LLM criticisms, it's that they're not perfect.
These critics don't seem to have learned the lesson that the perfect is the enemy of the good.
I use ChatGPT all the time for academic research. Does it fabricate references? Absolutely, maybe about a third of the time. But has it pointed me to important research papers I might never have found otherwise? Absolutely.
The rate of inaccuracies and falsehoods doesn't matter. What matters is, is it saving you time and increasing your productivity. Verifying the accuracy of its statements is easy. While finding the knowledge it spits out in the first place is hard. The net balance is a huge positive.
People are bullish on LLM's because they can save you days' worth of work, like every day. My research productivity has gone way up with ChatGPT -- asking it to explain ideas, related concepts, relevant papers, and so forth. It's amazing.
> Verifying the accuracy of its statements is easy.
For single statements, sometimes, but not always. For all of the many statements, no. Having the human attention and discipline to mindfully verify every single one without fail? Impossible.
Every software product/process that assumes the user has superhuman vigilance is doomed to fail badly.
> Automation centaurs are great: they relieve humans of drudgework and let them focus on the creative and satisfying parts of their jobs. That's how AI-assisted coding is pitched [...]
> But a hallucinating AI is a terrible co-pilot. It's just good enough to get the job done much of the time, but it also sneakily inserts booby-traps that are statistically guaranteed to look as plausible as the good code (that's what a next-word-guessing program does: guesses the statistically most likely word).
> This turns AI-"assisted" coders into reverse centaurs. The AI can churn out code at superhuman speed, and you, the human in the loop, must maintain perfect vigilance and attention as you review that code, spotting the cleverly disguised hooks for malicious code that the AI can't be prevented from inserting into its code. As qntm writes, "code review [is] difficult relative to writing new code":
> Having the human attention and discipline to mindfully verify every single one without fail? Impossible.
I mean, how do you live life?
The people you talk to in your life say factually wrong things all the time.
How do you deal with it?
With common sense, a decent bullshit detector, and a healthy level of skepticism.
LLM's aren't calculators. You're not supposed to rely on them to give perfect answers. That would be crazy.
And I don't need to verify "every single statement". I just need to verify whichever part I need to use for something else. I can run the code it produces to see if it works. I can look up the reference to see if it exists. I can Google the particular fact to see if it's real. It's really very little effort. And the verification is orders of magnitude easier and faster than coming up with the information in the first place. Which is what makes LLM's so incredibly helpful.
> I just need to verify whichever part I need to use for something else. I can run the code it produces to see if it works. I can look up the reference to see if it exists. I can Google the particular fact to see if it's real. It's really very little effort. And the verification is orders of magnitude easier and faster than coming up with the information in the first place. Which is what makes LLM's so incredibly helpful.
Well put.
Especially this:
> I can run the code it produces to see if it works.
You can get it to generate tests (and easy ways for you to verify correctness).
It's really funny how most anecdotes and comments about the utility and value of interacting with LLM's can be applied to anecdotes and comments about human beings themselves.
Majority of people havent realized yet that consciousness is assumed by our society, and that we, in fact, don't know what it is or if we have it. Let alone prescribing another entity with it.
> Does it fabricate references? Absolutely, maybe about a third of the time
And you don't have concerns about that? What kind of damage is that doing to our society, long term, if we have a system that _everyone_ uses and it's just accepted that a third of the time it is just making shit up?
No, I don't. Because I know it does and it's incredibly easy to type something into Google Scholar and see if a reference exists.
Like, I can ask a friend and they'll mistakenly make up a reference. "Yeah, didn't so-and-so write a paper on that? Oh they didn't? Oh never mind, I must have been thinking of something else." Does that mean I should never ask my friend about anything ever again?
Nobody should be using these as sources of infallible truth. That's a bonkers attitude. We should be using them as insanely knowledgeable tutors who are sometimes wrong. Ask and then verify.
No, that doesn't mean you should never ask your friend things again if they make that mistake. But, if 30% of all their references are made up then you might start to question everything your friend says. And looking up references to every claim you're reading is not a productive use of time.
If my friend has a million times more knowledge than the average human being, then I'm willing to put up with a 30% error rate on references.
And I'm talking about references when doing deep academic research. Looking them up is absolutely a productive use of time -- I'm asking for the references so I can read them. I'm not asking for them for fun.
Remember, it's hundreds of times easier to verify information than it is to find it in the first place. That's the basic principle of what makes LLM's so incredibly valuable.
But how can you be sure that the info is correct if it made up the reference? Where did it pull the info? What good is a friend that's just bullshiting their way through every conversation hoping you wouldn't notice?
A third of the time is an insane number, if 30% of code that I wrote contained non existent headers I would be fired long ago.
A person who's bullshitting their way doesn't get a 70% accuracy. For yes/no questions they'll get 50%. For open ended questions they'll be lucky to get 1%.
You're really underestimating the difficulty of getting 70% accuracy for general open-ended questions.
And while you might think you're better than 70%, I'm pretty sure if you didn't run your code through compilers and linters, and testing for at least a couple times, your code doesn't get anywhere near 70% correct.
Maybe I'm getting old, but sometimes it feels like everybody is young now and has only lived in a world where they can look up anything at a moments notice and now things they are infallible.
Having lived a decent chunk of my life pre-internet, or at least fast and available internet, looking back at those days you realize just how often people were wrong about things. Old wives tales, made up statistics, imagined scenarios, people really do seem to confabulate a lot of information.
> And you don't have concerns about that? What kind of damage is that doing to our society, long term, if we have a system that _everyone_ uses and it's just accepted that a third of the time it is just making shit up?
Main problem with our society is that two thirds of what _everyone_ says is made up shit / motivated reasoning. The random errors LLMs make are relatively benign, because there is no motivation behind them. They are just noise. Look through them.
You are not a trusted authority relied on by millions and expected to make decisions for them, and you could choose not to say something you aren't sure that you know.
Could it end up being a net benefit? will the realistic sounding but incorrect facts generated by A.I. make people engage with arguments more critically, and be less likely to believe random statements they're given?
Now, I don't know, or even think it is likely that this will happen, but I find it an interesting thought experiment.
That's hilarious; I had no idea it was that bad. And for every conscientious researcher who actually runs down all the references to separate the 2/3 good from the 1/3 bad, how many will just paste them in, adding to the already sky-high pile of garbage out there?
LLMs will spit out responses with zero backing with 100% conviction. People see citations and assume it's correct. We're conditioned for it thanks to....everything ever in history. Rarely do I need to check a wikipedia entry's source.
So why do people not understand that: this is absolutely going to pour jet fuel on misinformation in the world. And we as a society are allowed to hold a bar higher for what we'll accept get shoved down our throats by corporate overlords that want their VC payout.
The solution is to set expectations, not to throw away one of the most valuable tools ever created.
If you read a supermarket tabloid, do you think the stories about aliens are true? No, because you've been taught that tabloids are sensationalist. When you listen to campaign ads, do you think they're true? When you ask a buddy about geography halfway across the world, do you assume every answer they give is right?
It's just about having realistic expectations. And people tend to learn those fast.
> Rarely do I need to check a wikipedia entry's source.
I suggest you start. Wikipedia is full of citations that don't back up the text of the article. And that's when there are even citations to begin with. I can't count the number of times I've wanted to verify something on Wikipedia, and there either wasn't a citation, or there was one related to the topic but that didn't have anything related to the specific assertion being made.
I think many people are just not really good at dealing with "imperfect" tools. Different tools can have different success probability, let's call that probability p here. People typically use tool that have p=100%, or at least very close to it. But LLM is a tool that is far from that, so making use of it takes different approach.
Imagine there is an probabilistic oracle that can answer any question with a yes/no with success probability p. If p=100% or p=0% then it is apparently very useful. If p=50% then it is absolutely worthless. In other cases, such oracle can be utilized in different way to get the answer we want, and it is still a useful thing.
One of the magic things about engineering is that I can make usefulness out of unreliability. Voltage can fluctuate and I can transmit 1s and 0s, lines can fizz, machines can die, and I can reliably send video from one end to the other.
Unreliability is something we live in. It is the world. Controlling error, increasing signal over noise, extracting energy from the fluctuations. This is life, man. This is what we are.
I can use LLMs very effectively. I can use search engines very effectively. I can use computers.
Many others can’t. Imagine the sheer fortune to be born in the era where I was meant to be: tools transformative and powerful in my hands; useless in others’.
Your point reminded me of Terrence Tao’s point that AI has a “plausibility problem”. When it can’t be accurate, it still disguises itself as accurate.
Its true success rate is by no means 100%, and sometimes is 0%, but it always tries to make you feel confident.
I’ve had to catch myself surrendering too much judgment to it. I worry a high school kid learning to write will have fewer qualms surrendering judgment
A scientific instrument that is unreliably accurate is useless. Imagine a kitchen scale that always gave +/- 50% every 3rd time you use it. Or maybe 5th time. Or 2nd.
So we're trying to use tools like this currently to help solve deeper problems and they aren't up to the task. This is still the point we need to start over and get better tools. Sharpening a bronze knife will never be as sharp or have the continuity as a steel knife. Same basic elements, very different material.
A bad analogy doesn't make a good argument. The best analogy for LLMs is probably a librarian on LSD in a giant library. They will point you in a direction if you have a question. Sometimes they will pull up the exact page you need, sometimes they will lead you somewhere completely wrong and confidently hand you a fantasy novel, trying to convince you it's a real science book.
It's completely up to your ability to both find what you need without them and verify the information they give you to evaluate their usefulness. If you put that on a matrix, this makes them useful in the quadrant of information that is both hard to find, but very easy to verify. Which at least in my daily work is a reasonable amount.
I think people confuse the power of the technology with the very real bubble we’re living in.
There’s no question that we’re in a bubble which will eventually subside, probably in a “dot com” bust kind of way.
But let me tell you…last month I sent several hundred million requests to AI, as a single developer, and got exactly what I needed.
Three things are happening at once in this industry…
(1) executives are over promising a literal unicorn with AGI, that is totally unnecessary for the ongoing viability of LLM’s and is pumping the bubble.
(2) the technology is improving and delivery costs are changing as we figure out what works and who will pay.
(3) the industry’s instincts are developing, so it’s common for people to think “AI” can do something it absolutely cannot do today.
But again…as one guy, for a few thousand dollars, I sent hundreds of millions of requests to AI that are generating a lot of value for me and my team.
Our instincts have a long way to go before we’ve collectively internalized the fact that one person can do that.
That's exactly what happened – I called the OpenAI API, using custom application code running on a server, a few hundred million times.
It is trivial for a server to send/receive 150 requests per second to the API.
This is what I mean by instincts...we're used to thinking of developers-pressing-keys as a fundamental bottleneck, and it still is to a point. But as soon as the tracks are laid for the AI to "work", things go from speed-of-human-thought to speed-of-light.
A lot of people are feeding all the email and slack messages for entire companies through AI to classify sentiment (positive, negative, neutral etc), or summarize it for natural language search using a specific dictionary. You can process each message multiple ways for all sorts of things, or classify images. There's a lot of uses for the smaller cheaper faster llms
In general terms, we had to call the OpenAI API X00,000,000 times for a large-scale data processing task. We ended up with about 2,000,000 records in a database, using data created, classified, and cleaned by the AI.
There were multiple steps involved, so each individual record was the result of many round trips between the AI and the server, and not all calls are 1-to-1 with a record.
None of this is rocket science, and I think any average developer could pull off a similar task given enough time...but I was the only developer involved in the process.
The end product is being sold to companies who benefit from the data we produced, hence "value for me and the team."
The real point is that generative AI can, under the right circumstances, create absurd amounts of "productivity" that wouldn't have been possible otherwise.
My experience is starkly different. Today I used LLMs to:
1. Write python code for a new type of loss function I was considering
2. Perform lots of annoying CSV munging ("split this CSV into 4 equal parts", "convert paths in this column into absolute paths", "combine these and then split into 4 distinct subsets based on this field.." - they're great for that)
3. Expedite some basic shell operations like "generate softlinks for 100 randomly selected files in this directory"
4. Generate some summary plots of the data in the files I was working with
5. Not to mention extensive use in Cursor & GH Copilot
The tool (Claude 3.7 mostly, integrated with my shell so it can execute shell commands and run python locally) worked great in all cases. Yes I could've done most of it myself, but I personally hate CSV munging and bulk file manipulations and its super nice to delegate that stuff to an LLM agent
These seem like fine use cases: trivial boilerplate stuff you’d otherwise have to search for and then munge to fit your exact need. An LLM can often do both steps for you. If it doesn’t work, you’ll know immediately and you can probably figure out whether it’s a quick fix or if the LLM is completely off-base.
> When something was impossible only 3 years ago, barely worked 2 years ago, but works well now
Are you talking of what exactly? What are you stating works well now and did not years ago? Claude as a milestone of code writing?
Also in that case, if there are current apparent successes coming from a realm of tentative responses, we would need proof that the unreliable has become reliable. The observer will say "they were tentative before, they often look tentative now, why should we think they will pass the threshold to a radical change".
I hacked something together a while back - a hotkey toggles between standard terminal mode and LLM mode. LLM mode interacts with Claude, and has functions / tool calls to run shell commands, python code, web search, clipboard, and a few other things. For routine data science tasks it's been super useful. Claude 3.7 was a big step forward because it will often examine files before it begins manipulating them and double-checks that things were done correctly afterwards (without prompting!). For me this works a lot better than other shell-integration solutions like Warp
I’ve been using Claude a lot lately, and I must say I very much disagree.
For example, the other day I was chatting with it about the health risks associated with my high consumption of grown salmon. It then generated a small program to simulate the accumulation of PCB in my body. I could review the program, ask questions about the assumptions, etc. It all seemed very reasonable. A toxicokinetic analysis it called it.
It then struck me how immensely valuable this is to a curious and inquisitive mind. This is essentially my gold standard of intelligence: take a complex question and break it down in a logical way, explaining every step of the reasoning process to me, and be willing to revise the analysis if I point out errors / weaknesses.
Now try that with your doctor. ;)
Can it make mistakes? Sure, but so can your doctor. The main difference is that here the responsibility is clearly on you. If you do not feel comfortable reviewing the reasoning then you shouldn’t trust it.
With an LLM it should be "Don't trust, verify" - but it isn't that hard to verify LLM claims, just ask it for original sources.
Compare to ye olde scientific calculators (90s), they were allowed in tests because even though they could solve equations, they couldn't show the work. And showing the work was 90% of the score. At best you could use one to verify your solution.
But then tech progressed and now calculators can solve equations step by step -> banned from tests at school.
I don’t mind to be honest. I don’t expect more intelligence than that from my doctor either. I want them to identify the relevant science and regurgitate / apply it.
>It's not intelligence mate, it's just copying an existing program.
Isn't the 'intelligence' part, the bit that gets a previously-constructed 'thing', and makes it work in 'situation'.
Pretty sure that's how humans work, too.
So many people putting expectations up to knock down about models. Infinite reasons to critique them.
Please dispense with anyone's "expectations" when critiquing things! (Expectations are not a fault or property of the object of the expectations.)
Today's models (1) do things that are unprecedented. Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, in real time (and faster if we want), is beyond any other artifact in existence. Including humans.
They are (2) progressing quickly. AI has been an active field (even through its famous "winters") for several decades, and they have never moved forward this fast.
Finally and most importantly (3), many people, including myself, continue to find serious new uses for them in daily work, that no other tech or sea of human assistants could replace cost effectively.
The only way I can make sense out of anyone's disappointment is to assume they simply haven't found the right way to use them for themselves. Or are unable to fathom that what is not useful for them is useful for others.
They are incredibly flexible tools, which means a lot of value, idiosyncratic to each user, only gets discovered over time with use and exploration.
That that they have many limits isn't surprising. What doesn't? Who doesn't? Zeus help us the day AI doesn't have obvious limits to complain about.
> Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, is beyond any other artifact in existence
Very well said. That’s perhaps the area where I have found LLMs most useful lately. For several years, I have been trying to find a solution to a complex and unique problem involving the laws of two countries, financial issues, and my particular individual situation. No amount of Googling could find an answer, and I was unable to find a professional consultant whose expertise spans the various domains. I explained the problem in detail to OpenAI’s Deep Research, and six minutes later it produced a 20-page report—with references that all checked out—clearly explaining my possible options, the arguments for and against each, and why one of those options was probably best. It probably saved me thousands of dollars.
Are they progressing quickly? Or was there a step-function leap about 2 years ago, and incremental improvements since then?
I tried using AI coding assistants. My longest stint was 4 months with Copilot. It sucked. At its best, it does the same job as IntelliSense but slower. Other times it insisted on trying to autofill 25 lines of nonsense I didn't ask for. All the time I saved using Copilot was lost debugging the garbage Copilot wrote.
Perplexity was nice to bounce plot ideas off of for a game I'm working on... until I kept asking for more and found that it'll only generate the same ~20ish ideas over and over, rephrased every time, and half the ideas are stupid.
The only use case that continues to pique my interest is Notion's AI summary tool. That seems like a genuinely useful application, though it remains to be seen if these sorts of "sidecar" services will justify their energy costs anytime soon.
Now, I ask: if these aren't the "right" use cases for LLMs, then what is, and why do these companies keep putting out products that aren't the "right" use case?
have you tried it recently? o3-mini-high is really impressive. If you ease into talking to it about your intent and outlining the possible edge and corner cases it will write nuanced rust code 1000 lines at a time no problem
My anecdotal experience is similar. For any important or hard technical questions relevant to anything I do, the LLM results are consistently trash. And if you are an expert in the domain you can’t not notice this.
On the other hand, for trivial technical problems with well known solutions, LLMs are great. But those are in many senses the low value problems; you can throw human bodies against that question cheaply. And honestly, before Google results became total rubbish, you could just Google it.
I try to use LLMs for various purposes. In almost all cases where I bother to use them, which are usually subject matters I care about, the results are poorer than I can quickly produce myself because I care enough to be semi-competent at it.
I can sort of understand the kinds of roles that LLMs might replace in the next few years, but there are many roles where it isn’t even close. They are useless in domains with minimal training data.
>For any important or hard technical questions relevant to anything I do, the LLM results are consistently trash. And if you are an expert in the domain you can’t not notice this.
This is also my experience. My day job isn't programming, but when I can feed an LLM secretarial work, or simple coding prompts to automate some work, it does great and saves me time.
Most of my day is spent getting into the details on things for which there's no real precedent. Or if there is, it hasn't been widely published on. LLMs are frustrating useless for these problems.
Because it’s not a scientific research tool, it’s a most likely next text generator. It doesn’t keep a database of ingested information with source URLs. There are plenty of scientific research tools but something that just outputs text based on your input is no good for it.
I’m sure that in the future there will be a really good search tool that utilises an LLM but for now a plain model just isn’t designed for that. There are a ton of other uses for them, so I don’t think that we should discount them entirely based on their ability to output citations.
I think that to understand the diversity of opinions, we have to recognize a few different categories of users:
Category 1: people who don't like to admit that anything trendy can also be good at what it does.
Category 2: people who don't like to admit that anything made by for-profit tech companies can also be good at what it does.
Category 3: people who don't like to admit that anything can write code better than them.
Category 4: people who don't like to admit that anything which may be put people out of work who didn't deserve to be put out of work, and who already earn less than the people creating the thing, can also be good at what it does
Category 5: people who aren't using llms for things they are good at
Category 6: people who can't bring themselves to communicate with AIs with any degree of humility
Category 7: people to whom none of the above applies
I have a bunch of friends who don't get along well with each other, but I tend to get along with all of them. I believe this is about focusing on the good in people, and being able to ignore the bad. I think it's the same with tools. To me AI is an out-of-this-world, OP tool. Is it perfect? No. But it's amazing! The good I get out of it far surpasses its mistakes. Almost like people. People "hallucinate" and say wrong things all the time. But that doesn't make them useless or bad. So, whoever is having issues with AIs is probably having an issue dealing with people as well :) Learn how to deal with people, and learn how to deal with AI -- the single biggest skill you'll need in 21st century.
Not trying to dismiss your anecdote, as both can very much coexist separately, and I actually think your analogy is spot on for LLMs! But it did make me think of something.
I once was in an environment where A got along with everyone, and B was hated by everyone else except for A. This wasn't because B saw qualities in A that no one else recognized; it was just that A was oblivous to/wasn't personally affected by all the valid reasons why everyone else disliked B. A to an extent thought of themselves as being able to see the good in B, but in reality they simply lacked the understanding of the effects of B's behavior on others.
Agree. Had the same thought when I saw her complaint about getting the publication year of the article wrong. If she had a grad student who could read, understand and summarize the article well, but inadvertently said it was from 2023 instead of 2025, (hopefully) she wouldn’t call that grad student unintelligent.
In general LLMs have made many areas worse. Now you see people writing content using LLMs without understanding the content itself, it becomes really annoying especially if you don't know this and ask the question "did you perhaps write this using LLM" and get the "yes" answer.
In programming circles it's also annoying when you try to help and you get fed garbage outputted by LLMs.
I belive models for generating visuals (image, video sound generation) is much more interesting as it's area where errors do not matter as much. Though the ethicality of how these models have been trained is another matter.
The equivalent trope of this as recent as 5 years back would have been the lazy junior engineer copying code from Stackoverflow without fully grokking it.
I feel humans should be held to account for the work they produce irrespective of the tools they used to produce it.
The junior engineer who copied code he didn't understand from Stackoverflow should face the consequences as much as the engineer who used LLM generated code without understanding it.
I think the disconnect is that people who produce serious, researched and referenced work tend to believe that most work is like that. It is not. The majority of content created by humans is not referenced, it's not deeply researched, and a lot of it isn't even read, at least not closely. It sends a message just by existing in a particular place at a particular time. And its creation gainfully employs millions of people, which of course costs millions of dollars. That's why, warts and all, people are bullish on LLMs.
They are bullish for the right reasons. Those reasons are not the same ones you think of. They are betting that humans wouuld keep getting addicted to more and more technology crutches and assistants as they keep inviting more workload onto their minds and body. There is no going back with this trend.
Why do we burden ourselves with such expectations on us? Look at cities like Dallas. It is designed for cars. Not for human walking. The buildings are far apart, workplaces are far away from homes and everything looks like designed for some king-kong like creatures.
The burden of expectations on humans is driven by technology. Technology makes you work harder than before. It didn't make your life easier. Check how hectic life has become for you now vs a laid-back village peasant a century back.
The bullishness on LLMs is betting on this trend of self-inflicted human agony and dependency on tech. Man is going back to the craddle. LLMs give the milk feeder.
Write code to pull down a significant amount of public data using an open API. (That took about 30 seconds - I just gave it the swagger file and said “here’s what I want”)
Get the data (an hour or so), clean the data (barely any time, gave it some samples, it wrote the code), used the cleaned data to query another API, combined the data sources, pulled down a bunch of PDFs relating to the data, had the AI write code to use tesseract to extract data from the PDFs, and used that to build a dashboard. That’s a mini product for my users.
I also had a play with Mistral’s OCR and have tested a few things using that against the data. When I was out walking my dogs I thought about that more, and have come up with a nice workflow for a problem I had, which I’ll test in more detail next week.
That was all whole doing an entirely different series of tasks, on calls, in meetings. I literally checked the progress a few times and wrote a new prompt or copy/pasted some stuff in from dev tools.
For the calls I was on, I took the recording of those calls, passed them into my local instance whisper, fed the transcript into Claude with a prompt I use to extract action points, pasted those into a google doc, circulated them.
One of the calls was an interview with an expert. The transcript + another prompt has given me the basis for an article (bulleted narrative + key quotes) - I will refine that tomorrow, and write the article, using a detailed prompt based on my own writing style and tone.
I needed to gather data for a project I’m involved in, so had Claude write a handful of scrapers for me (HTML source > here is what I need).
I downloaded two podcasts I need to listen to - but only need to listen to five minutes of each - and fed them into whisper then found the exact bits I needed and read the extracts rather than listening to tedious podcast waffle.
I turned an article I’d written into an audio file using elevenlabs, as a test for something a client asked me about earlier this week.
I achieved about three times as much today as I would have done a year ago. And finished work at 3pm.
So yeah, I don’t understand why people are so bullish about LLMs. Who knows?
Yeah, they are not “reading recycled LLM content”, no. The dashboard in question presents data from PDFs. They are very happy with being able to explore that data.
So much about this seems inauthentic. The post itself. The experience. The content produced. I wouldn’t like to be on the other end of the production of this content.
This just sounds like a normal day for someone who does research and analysis in 2025.
Where do you think expert analysis comes from?
Talk to experts, gather data, synthesize, output. Researchers have been doing this for a long time. There's a lot of grunt work LLM's can really help with, like writing scripts to collect data from webpages.
However, as this thread demonstrates repeatedly, using LLMs effectively is about knowing what questions to ask, and what to put into the LLM alongside the questions.
The people who pay me to do what I do could do it themselves, but they choose to pay me to do it for them because I have knowledge they don’t have, I can join the dots between things that they can’t, and I have access to people they don’t have access to.
AI won’t change any of that - but it allows me to do a lot more work a lot more quickly, with more impact.
So yeah, at the point that there’s an AI model that can find and select the relevant datasets, and can tell the user what questions to ask - when often they don’t know the questions they need to have answered, then yes, I’ll be out of a job.
But more likely I’ll have built that tool for my particular niche. Which is more and more what I’m doing.
AI gives me the agency to rapidly test and prototype ideas and double down on the things that work really well, and refine the things that don’t work so brilliantly.
Well the API calls worked perfectly. The LLM didn’t misinterpret that.
The data extraction via tesseract worked too.
The whisper transcript was pretty good. Not perfect, but when you do this daily you are easily able to work around things.
The summaries of the calls were very useful. I could easily verify those because I was on the calls.
The interview - again, transcript is great. The bulleted narrative was guided - again - by me having been on the call. I certify he quotes against the transcript, and audio if I’ve got any doubts.
Scrapers - again, they worked fine. The LLM didn’t misinterpret anything.
Podcasts - as before. Easy.
Article to voice - what’s to misinterpret?
Your criticism sounds like a lot of waffle with no understanding of how to use these tools.
Firstly I am not summarising the podcast, simply using whisper to make a transcript.
T even if I was, because I do this multiple times a day and have been for quite sone time I know how to check for errors.
One part of that is a “fact check” built into the prompt, another part is feeding the results of that prompt back into the API with a second prompt and the source material and asking it to verify that the output of the first prompt is accurate.
However the level of hallucination has dropped massively over time, and when you’re using LLMs all the time you quickly become attuned to what’s likely to cause them and how to mitigate them.
I don’t mean this in an unpleasant way but this question - and many of the other comments responding to my initial description of how I use LLMs - feel like the story is things that people who have slightly hand wavey experience of LLMs think, having played with the free version of ChatGPT back in the day.
Claude 3.7 is far removed from ChatGPT at launch, and even now ChatGPT feels like a consumer facing procure while Claude 3.7 feels like a professional tool.
And when you couple that with detailed tried and tested prompts via the api in a multistage process, it is incredibly powerful.
Did you also do that while mewing and listening to an AI abridged audiobook version of the laws of power in chinese? Don't forget your morning ice face dunks.
Yep, that's exactly how Microsoft is operating as a company recently: Copilot is a hammer—it needs to be used EVERYWHERE.
Forget bug fixes and new feature rollouts, every department and product team at Microsoft needs to add Copilot. Microsoft customers MUST jump on the AI-bandwagon!
>> I am neither bullish or bearish. LLM is a tool...It's a hammer
If someone says, "This new type of hammer will increase productivity in the construction industry by 25%", it's something else in addition to being a tool. It's either a lie, or it's an incredible advance in technology.
It's a lie. Those constructed things would become increasingly sub-par requiring even more maintenance from workers that no longer exist because of the 25% efficiency gain. There is no win here. It's a shortcut for some people that will cost other people more time. It's burden shifting to the others on the team, in the industry, or in the near future. It's weak sauce from weak workers.
The claims are a lie. The tool is still useful. Even if the hammer doesn't increase productivity by 25%, if it feels more comfortable to hold and I can use it for longer, I'm going to be happy with it, independent from any marketing.
I feel like we'll laugh at posts like this in 5 years. It's not inaccurate in any way, it just misses the wood for the trees. Any new technology is always worse in some ways. Smart phones still have much worse battery life and are harder to type on than Blackberries. But imagine not understanding why people are bullish about Smartphones.
It's 100x easier to see how LLM's change everything. It takes very little vision to see what an advancement they are. I don't understand how you can NOT be bullish about LLM's (whether you happen to like them or not is a different question).
Think about the early days of digital photography. When digital cameras first emerged, expert critics from the photograph field were quick to point out issues like low resolution, significant noise, and poor color reproduction—imperfections that many felt made them inferior to film. Yet, those early digital cameras represented a breakthrough: they enabled immediate image review, easy sharing, and rapid technological improvements that soon eclipsed film in many areas. Just as people eventually recognized that the early “flaws” of digital photography were a natural part of a revolutionary leap forward, so too should we view the occasional hallucinations in modern LLMs as a byproduct of rapidly evolving technology rather than a fundamental flaw.
Or how about computer graphics? Early efforts to move 3D graphics hardware into the PC realm were met with extreme skepticism by my colleagues who were “computer graphics researchers” armed with the latest Silicon Graphics hardware. One researcher I was doing some work with in the mid-nineties remarked about PC graphics at the time: “It doesn’t even have a frame buffer. Look how terrible the refresh rate is. It flickers in a nauseating way.” Etc.
It’s interesting how people who are actual experts in a field where there is a major disruption going on often take a negative view of the remarkable new innovation simply because it isn’t perfect yet. One day, they all end up eating their words. I don’t think it’s any different with LLMs. The progress is nothing short of astonishing, yet very smart people continue to complain about this one issue of hallucination as if it’s the “missing framebuffer” of 1990s PC graphics…
I think there are multiple conversations happening that are tying to converge on one.
On one hand, LLMs are overhyped and not delivering on promises made by their biggest advocates.
On the other hand, any other type of technology (not so overhyped) would be massively celebrated in significantly improving a subset of niche problems.
It’s worth acknowledging that LLMs do solve a good set of problems well, while also being overhyped as a silver bullet by folks who are generally really excited about its potential.
Reality is that none of us know what the future is, and whether LLMs will have enough breakthroughs to solve more problems then today, but what they do solve today is still very impressive as is.
Yes, exactly. There is a bell curve of hype, where some people think autoregressive decoders will lead us to AGI if we just give it the right prompt or perhaps a trillion dollars of compute. And there are others who haven’t even heard of ChatGPT. Depending on which slice of the population you’re interacting with, it’s either under or over hyped.
I've been saying the same thing, though in less detail. AI is so dr8ven by hype at the moment, that it's unavoidable that it's going to collapse at some point. I'm not saying tue current crop of AI is useless; there are plenty of useful applications, but it's clear lots of people expect more from it than it's capable of, and everybody is investing in it just because everybody else is.
But even if it does work, you still need to doublecheck everything it does.
Anyway, my RPG group is going to try roleplaying with AI generated content (not yet as GM). We'll see how it goes.
> Eisegesis is "the process of interpreting text in such a way as to introduce one's own presuppositions, agendas or biases". LLMs feel very smart when you do the work of making them sound smart on your own end: when the interpretation of their output has a free parameter which you can mentally set to some value which makes it sensible/useful to you.
> This includes e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations/directions to explore, you impute the coherent personality to the LLM. And you inject very few bits of steering by doing so, but those bits are load-bearing. If left to their own devices, LLMs won't pick those obviously correct ideas any more often than chance.
I wrote an AI assistant which generates working spreadsheets with formulas and working presentations with neatly laid out elements and styles. It's a huge productivity gain relative to starting from a blank page.
I think LLMs work best when they are used as a "creative" tool. They're good for the brainstorming part of a task, not for the finishing touches.
They are too unreliable to be put in front of your users. People don't want to talk to unpredictable chatbots. Yes, they can be useful in customer service chats because you can put them on rails and map natural language to predetermined actions. But generally speaking I think LLMs are most effective when used _by_ someone who's piloting them instead of wrapped in a service offered _to_ someone.
I do think we've squeezed 90%+ of what we could from current models. Throwing more dollars of compute at training or inference won't make much difference. The next "GPT moment" will come from some sufficiently novel approach.
I don't trust Sabine Hossenfelder on this. Even I as a computer scientist/programmer with just some old experience from my master courses towards AI and ML, I know much more than her about how the things work.
She became more of an influencer than a scientist. And that is nothing wrong with that unless she doesn't try to pose as an authority on subjects she doesn't have a clue. It's OK to have an opinion as an outsider but it's not OK to pretend you are right and that you are an expert on every scientific or technical subject that happens to make you want to make a tweet about.
Some web development advice I remember from long ago is to do most of the styling after the functionality is implemented. If it looks done, people will think it is done.
LLMs did the "styling" first. They generate high-quality language output of the sort most of us would take as a sign of high intelligence and education in a human. A human who can write well can probably reason well, and probably even has some knowledge of facts they write confidently about.
When digital cameras first appeared, their initial generations produced low-resolution, poor-quality images, leading many to dismiss them as passing gimmicks. This skepticism caused prominent companies, notably Kodak, to overlook the significance of digital photography entirely. Today, however, film photography is largely reserved for niche professionals and specialized use-cases.
New technologies typically require multiple generations of refinement—iterations that optimize hardware, software, cost-efficiency, and performance—to reach mainstream adoption. Similarly, AI, Large Language Models (LLMs), and Machine Learning (ML) technologies are poised to become permanent fixtures across industries, influencing everything from automotive systems and robotics to software automation, content creation, document review, and broader business operations.
Considering the immense volume of new information generated and delivered to us constantly, it becomes evident that we will increasingly depend on automated systems to effectively process and analyze this data. Current challenges—such as inaccuracies and fabrications in AI-generated content—parallel the early imperfections of digital photography. These issues, while significant today, represent evolutionary hurdles rather than permanent limitations, suggesting that patience and continuous improvement will ultimately transform these AI systems into indispensable tools.
I'm also not bullish on this. In the sense that I don't think LLMs are going to get 10x better, but they are useful for what they can do already.
If I see what Copilot suggests most of the time, I would be very uncomfortable using it for vibe coding though. I think it's going to be... entertaining watching this trend take off. I don't really fear I'm going to lose my job soon.
I'm skeptical that you can build a business on a calculator that's wrong 10% of the time when you're using it 24/7. You're gonna need a human who can do the math.
Like others here, I use it to code (no longer a professional engineer, but keep side projects).
As soon as LLMs were introduced into the IDE it began to feeling like LLM autocomplete was almost reading my mind. With some context built up over a few hundred lines of initial architecture, autocomplete now sees around the same corners I am. It’s more than just “solve this contrived puzzle” or “write snake”. It combines the subject matter use case (informed by variable and type naming) underlying the architecture and sometimes produces really breathtaking and productive results. Like I said, it took some time but when it happened, it was pretty shocking.
Walk into any coffee shop or office and I can guarantee that you'll see several people actively typing into ChatGPT or Claude. If it was so useless, four years on, why would people be bothering with it?
I don't think you can even be bullish or bearish about this tech. It's here and it's changing pretty much every sector you can think of. It would be like saying you're not bullish about the Internet.
I honestly can't imagine life without one of these tools. I have a subscription to pretty much all of them because I get so excited to try out new models.
She's a scientist. Most of the people on here are writing software which is essentially reinventing the wheel over and over. Of course you have a different experience of LLMs.
I don't know.. I've maintained skepticism, but recently AI has enabled solutions for client problems that would have been intractable with conventional coding.
A team was migrating a years old excel based workflow where no less than 3 spreadsheets contained thousands of call notes, often with multiple notes stuffed into the same column separated inconstantly by a shorthand date and initials of who was in the call. Sometimes with text arrows or other meta descriptions like (all calls after 3/5 were handled by Tim). They want to move all of this into structured jira tickets and child tickets.
Joining the mess of freeform, redundant, and sometimes self contradicting data into JSON lines, and feeding it into AI with a big explicit prompt containing example conversions and corrections for possible pitfalls has resulted in almost magically good output. I added a 'notes' field to the output and instructed the model to call out anything unusual and it caught lots of date typos by context, ambiguously attributed notes, and more.
It would have been a man month or so of soul drowningly tedious and error prone intern level work, but now it was 40 minutes and $15 of Gemini usage.
So, even if it's not a galaxy brained super intelligence yet, it is a massive change to be able to automate what was once exclusively 'people' work.
I think there is a lot of denial going around right now.
The present path of IA is nothing short of revolutionary, a lot of jobs and industries are going to suffer a major upheaval and a lot of people are just living in some wishful thinking moment where it will all go away.
I see people complaining it gives them bad results. Sure it does, so all other parsed information we get. It’s our job to check it ourselves. Still , the amount of time it saves me, even if I have to correct it is huge.
A can give an example that has nothing to do with work. I was searching for the smallest miniATX computer cases that would accept at least 3 HDDs (3.5”). The amount of time LLMs saved me is staggering.
Sure, there was one wrong result in the mix, and sure, I had to double check all the cases myself, but, just not having to go through dozens of cases, find the dimensions, calculate the volume, check the HDDs in difficult to read (and sometimes obtain) pages, saved days of work - yes I had done a similar search completely manually about 5 years ago.
This is a personal example, I also have others at work.
The author of the tweet is a physicist. Her work is at the edge of the boundary of human knowledge. LLM’s are useless in this domain, at last when applied directly.
When I use LLM’s to explore applications of cutting edge nonlinear optics, I too am appalled about the quality of the output. When I use an LLM to implement a React program, something that has been done hundreds of times before by others, I find it performs well.
It's a common and fair criticism. LLM-based products promise to save time, but for many complex, day-to-day tasks - adding a feature to a 50M LOC codebase, writing a magazine-quality article, properly summarizing 5 SEC filings - they often don't. They require careful re-validation, and once you find a few obvious issues, trust erodes and the whole thing gets tossed.
This isn't a technology problem, it's a product problem - and one that may not be solvable with better models alone.
Another issue: people communicate uncertainty naturally. We say "maybe", "it seems", "I'm not sure, but...". LLMs suppress that entirely, for structural reasons. The output sounds confident and polished, which warps perception - especially when the content is wrong.
We've had the opposite experience, especially with o3-mini using Deep Research for market research & topic deep-dive tasks. The sources that are pulled have never been 404 for us, and typically have been highly relevant to the search prompt. It's been a huge time-saver. We are just scratching the surface of how good these LLMs will become at research tasks.
I used to be skeptical of the hype but it's hard to deny that they are incredible tools. For coding, they save me a few hours a week and this is just the beginning. A few months ago I would use them to generate simple pieces of code, but now they can handle refactoring across several files. Even if LLMs don't get smarter, the tooling around them will improve and they'll be even more useful. We'll also learn to use them better.
Also my gf who's not particularly tech savvy relies heavily on ChatGPT for her work. It's very useful for a variety of text (translation, summaries, answering some emails).
Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
> Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
Yeah. That was my first thought. There’s probably orders of magnitude more training data for software engineering than for theoretical physics (her field). Also, how much of software engineering is truly novel? Probably someone else has already come up with a decent solution to your problem, it’s “just” a matter of finding it.
Should we listen to Sabine in this case? Isn't this another manifestation of a generally intelligent person, who happens to be an expert in her field weighing in on something she's not an expert on, thinking her expertise transfers?
This is the most common 'smart person' fallacy out there.
As for my 2 cents, LLMs can do sequence modeling and prediction tasks, so as long as a problem can be reduced to sequence modeling (which is a lot of them!), they can do the job.
This is like saying that the Fourier Transform is played out because you can only do so much with manipulating signal frequencies.
Well, she's an expect in the field she's asking the LLM about, and she judges the responses to be nonsense. Would a non-expert be able to tell? This is mostly mirroring my experience with it, which is a straightforward report of "it doesn't work for me, even though I keep trying because everyone else is hyped about it and apparently it's getting better all the time".
LLMs are an incredible technological breakthrough which even their creators are using completely incorrectly - in my view, half because they want to make a lot of money, and half because they are enchanted by the allure of their own achivement and are desperate to generalize it. The ability to generate human-style language, art, and other media dynamically on demand, based on a prompt communicated in natural language, is an astonishing feat. It's immensely useful on its own.
But even its creators, who acknowledge it is not AGI, are trying to use it as if it were. They want to sell you LLMs as "AI" writ large, that is, they want you to use it as your research assistant, your secretary, your lawyer, your doctor, and so on and so forth. LLMs on their own simply cannot do those tasks. They is great for other uses: troubleshooting, assisting with creativity and ideation, prototyping concepts of the same, and correlating lots of information, so long as a human then verifies the results.
LLMs right now are flour, sugar, and salt, mixed in a bowl and sold as a cake. Because they have no reasoning capability, only rote generation via prediction, LLMs cannot process contextual information the way required for them to be trustworthy or reliable for the tasks people are trying to use them for. No amount of creative prompting can resolve this totally. (I'll note that I just read the recent Anthropic paper, which uses terms like "AI biology" and "concept" to imply that the AI has reasoning capacity - but I think these are misused terms. An LLM's "concept" of something bears no referent to the real world, only a set of weights to other related concepts.)
What LLMs need is some sort of intelligent data store, tuned for their intended purpose, that can generate programmatic answers for the LLMs to decipher and present. Even then, their tendency to hallucinate makes things tough - they might imagine the user requested something they didn't, for instance. I don't have a clear solution to this problem. I suspect whoever does will have solved a much bigger, more complex than the already massive one that LLMs have solved, and if they are able to do so, will have brought us much much closer to AGI.
I am tired of seeing every company under the sun claim otherwise to make a buck.
I love this. The more people that say "I don't get it" or "it's a stochastic parrot", the more time I get to build products rapidly without the competition that there would be if everyone was effectively using AI. Effectively is the key.
It's cliche at this point to say "you're using it wrong" but damn... it really is a thing. It's kind of like how some people can find something online in one Google query and others somehow manage to phrase things just wrong enough that they struggle. It really is two worlds. I can have AI pump out 100k tokens with a nearly 0% error rate, meanwhile my friends with equally high engineering skill struggle to get AI to edit 2 classes in their codebase.
There are a lot of critical skills and a lot of fluff out there. I think the fluff confuses things further. The variety of models and model versions confuses things EVEN MORE! When someone says "I tried LLMs and they failed at task xyz" ... what version was it? How long was the session? How did they prompt it? Did they provide sufficient context around what they wanted performed or answered? Did they have the LLM use tools if that is appropriate (web/deepresearch)?
It's never a like-for-like comparison. Today's cutting-edge models are nothing like even 6-months ago.
Honestly, with models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3-mini-high, I'm not sure how people fail so hard at prompting and getting quality answers. The models practically predict your thoughts.
Maybe that's the problem, poor specifications in (prompt), expecting magic that conforms to their every specification (out).
I genuinely don't understand why some people are still pessimistic about LLMs.
Great points. I think much of the pessimism is based on fear of inadequacy. Also the fact that these things bring up truly base-level epistemological quandaries that question human perception and reality fundamentally. Average joe doesnt want to think about how we dont know if consciousness is a real thing, let alone determine if the robot is.
We are going through a societal change. There will always be the people who reject AI no matter the capabilities. I'm at the point where if ANYTHING tells me that it's conscious... I just have to believe them and act accordingly to my own morals
I am making an effort to use LLMs at work, but in my workflow it's basically just a fancy auto complete. Having a more AI centric workflow could be interesting, but I haven't thought of a good way to rig that up. I'm also not really itching for something to do my puzzles for me. They're what gets me out of bed in the morning.
I haven't tried using LLMs for much else, but I am curious as long as I can run it on my own hardware.
I also totally get having a problem with the massive environmental impact of the technology. That's not AIs fault per se, but its a valid objection.
I think LLMs have value, but what I'm really looking forward to is the day when everyone can just quietly use (or not use) LLMs and move on with their lives. It's like that one friend who started a new diet and can't shut up about it every time you see them, except instead of that one friend it's seemingly the majority of participants in tech forums. It's getting so old.
The author mentioned Gemini sometimes refusing to do something.
I’ve recently been using Gemini (mostly 2.0 flash) a lot and I’ve noticed it sometimes will challenge me to try doing something by myself. Maybe it’s something in my system prompt or the way I worded the request itself. I am a long time user of 4o so it felt annoying at first.
Since my purpose was to learn how to do something, being open minded I tried to comply with the request and I can say that… it’s being a really great experience in terms of retention of knowledge. Even if I’m making mistakes Gemini will point them out and explain it nicely.
I think the author is being overly pessimistic with this. The positives of an LLM agent outweigh the negatives when used with a Human-in-the-loop.
For people interested in understanding the possibilities of LLM for use in a specific domain see The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee (Microsoft Research VP), Isaac Kohane (Harvard Biomedical Informatics MD) et al. It is an easy read showing the authors systematic experiments with using the OpenAI models via the ChatGPT interface for the medical/healthcare domain.
People have different opinions about this, but I think one problem is there are different questions.
One is - Google, Facebook, OpenAI, Anthropic, Deepseek etc. have put a lot of capital expenditure into train frontier large language models, and are continuing to do so. There is a current bet that growing the size of LLMs, with more or maybe even synthetic data, with some minor breakthroughs (nothing as big as the Alexnet deep learning breakthrough, or transformers), will have a payoff for at least the leading frontier model. Similar to Moore's law for ICs, the bet is that more data and more parameters will yield a more powerful LLM - without that much more innovation needed. So the question for this is whether the capital expenditure for this bet will pay off.
Then there's the question of how useful current LLMs are, whether we expect to see breakthroughs at the level of Alexnet or transformers in the coming decades, whether non-LLM neural networks will become useful - text-to-image, image-to-text, text-to-video, video-to-text, image-to-video, text-to-audio and so on.
So there's the business side question, of whether the bet that spending a lot of capital expenditure training a frontier model will be worth it for the winner in the next few years - with the method being an increase in data, perhaps synthetic data, and increasing the parameter numbers - without much major innovation expected. Then there's every other question around this. All questions may seem important but the first one is what seems important to business, and is connected to a lot of the capital spending being done on all of this.
I think she is right. And it may have some very real consequence. The latest Gen Alpha are AI native. They have been using AI for one or two years now. And as they grow up their knowledge will definitely be built on top of AI. This leads to a few fundamental problems.
1. AI inventing false information that is being used to built on their foundational knowledge.
2. There is a lot less problem solving for them once they are used to AI.
I think the fundamental of Eduction needs to look at AI or current LLM chatbot seriously and start asking or planning how to react to it. We have already witness Gen Z, with era of Google thinking they know everything and if not google it. Thinking of "They Know it ALL" only to be battered in the real world.
Can our brains recall precise citations to tens of papers we read a while ago? For the vast majority, no. LLMs function somewhat similarly to our brains in many ways, as opposed to classical computers.
Their strengths and flaws differ from our brains, to be sure, but some of these flaws are being mitigated and improved on by the month. Similarly, unaided humans cannot operate successfully in many situations. We build tools, teams, and institutions to help us deal with them.
> LLMs function somewhat similarly to our brains in many ways,
Including the arrogance to confidently deliver a wrong answer. Which is the opposite of the reasons we use computers in the first place. Why this is worth billions of dollars is utterly beyond me.
> unaided humans cannot operate successfully in many situations
Absolute nonsense driven by a total lack of historical perspective or knowledge.
> We build tools, teams, and institutions to help us deal with them.
And when they lie to us we immediately correct that problem or disband them recognizing that they are more trouble than they could be worth.
>> unaided humans cannot operate successfully in many situations
> Absolute nonsense driven by a total lack of historical perspective or knowledge.
An LLM can give you a list of examples:
Historical Examples:
- During historical epidemics, structured record-keeping and statistical analysis (such as John Snow’s cholera maps in 1854) significantly improved outcomes.
- Development of physics, architecture, and engineering depended heavily on tools such as abacus, logarithmic tables, calculators, slide rules, to supplement human cognitive limitations.
- Astronomical calculations in ancient civilizations (Babylonian, Greek, Mayan) depended heavily on abacuses, tables, and other computational tools.
- The pyramids in ancient Egypt required extensive use of tools, mathematics, coordinated human labor, and sophisticated organization.
An LLM can do some pretty interesting things, but the actual applicability is narrow. It seems to me that you have to know a fair amount about what you're asking it to do.
For example, last week I dusted off my very rusty coding skills to whip up a quick and dirty Python utility to automate something I'd done by hand a few too many times.
My first draft of the script worked, but was ugly and lacked any trace of good programming practices; it was basically a dumb batch file, but in Python. Because it worked part of me didn't care.
I knew what I should have done -- decompose it into a few generic functions; drive it from an intelligent data structure; etc -- but I don't code all the time anymore, and I never coded much in Python, so I lack the grasp of Python syntax and conventions to refactor it well ON MY OWN. Stumbling through with online references was intellectually interesting, but I also have a whole job to do and lack the time to devote to that. And as I said, it worked as it was.
But I couldn't let it go, and then had the idea "hey, what if I ask ChatGPT to refactor this for me?" It was very short (< 200 lines), so it was easy to paste into the Chat buffer.
Here's where the story got interesting. YES, the first pass of its refactor was better, but in order to get it to where I wanted it, I had to coach the LLM. It took a couple passes through before it had made the changes I wanted while still retaining all the logic I had in it, and I had to explicitly tell it "hey, wouldn't it be better to use a data structure here?" or "you lost this feature; please re-add it" and whatnot.
In the end, I got the script refactored the way I wanted it, but in order to get there I had to understand exactly what I wanted in the first place. A person trying to do the same thing without that understanding wouldn't magically get a well-built Python script.
While I am amazed at the technology, at the same time I hate it. First, 90% of people misintepret it and cite output as facts. Second, it needs too much energy. Third, energy consumption is rarely mentioned in nerdy discussions about LLMs.
'But you're such a killjoy.'
Yes, it is an evil technology in its current shape. So we should focus on fixing it, instead of making it worse.
Note that there is a different between being "bullish" (ie market that is upwards trending) vs "useful". I think there is general value with LLM for semantic search & information extraction, but not as an exclusive way to AGI that some the market expects for its overinflated valuation.
I say the author tells us more than the headline or first sentence of the loop. If you have recently scrolled through Sabines posts on Twitter, or her clickbaity thumbnails, facial expressions and headlines on YouTube [1], you would see that she is all-in for clicks. She often takes a popular belief, then negates it, and throws around counter-examples of why X or Y has absolutely failed. It's a repeating pattern to gain popularity, and it seems to work not only on Twitter and YouTube, but even here on Hacker News, given the massive amount of upvotes her post has.
She's a physicist. LLMs are not for creating new information. They're for efficiently delivering established information. I use it to quickly inform me about business decisions all the time, because a thousand answers about those questions already exist. She's just using it for the wrong thing.
I think LLM skeptics and cheerleaders all have a point but I lean toward skepticism. And that's because though LLMs are easy to pick up, they're impossible to "master" and are extremely finicky. The tech is really fun to tinker with and capable of producing some truly awesome results in the hands of a committed practitioner. That puts it in the category of specialized tools. Despite this fact, LLMs are hyped and valued by the markets like they are breakthrough consumer products. My own experience plus the vague/underwhelming adoption and revenue numbers reported by the major vendors tell me that something's not quite right in this area of the industry.
I understand how developers can come to this conclusion if they're only using local models that can run on consumer GPUs since there's a time cost to prompting and the output is fairly low quality with a higher probability of errors and hallucinations.
But I don't understand how you can come to this conclusion when using SOTA models like Claude Sonnet 3.7, it's response has always been useful and when it doesn't get it right first time you can keep prompting it with clarifications and error responses. On the rare occasion it's unable to get it right, I'm still left with a bulk of useful code that I can manually fix and refactor.
Either way my interactions with Sonnet is always beneficial. Maybe it's a prompt issue? I only ask it to perform small, specific deterministic tasks and provide the necessary context (with examples when possible) to achieve it.
I don't vibe code or unleash an LLM on an entire code base since the context is not large enough and I don't want it to refactor/break working code.
I genuinely don't understand why some people are so critical of LLMs. This is new tech, we don't really understand the emergent effects of attention and transformers within these LLMs at all. It is very possible that, with some further theoretical development, LLMs which are currently just 'regurgitating and hallucinating' can be made to be significantly more performant indeed. In fact, reasoning models - when combined with whatever Google is doing with the 1M+ ctxt windows - are much closer to that than people who were using LLMs expected.
The tech isn't there yet, clearly. And stock valuations are over the board way too much. But, LLMs as a tech != the stock valuations of the companies. And, LLMs as a tech are here to stay and improve and integrate into everyday life more and more - with massive impacts on education (particularly K-12) as models get better at thinking and explaining concepts for example.
The key for LLM productivity, it seems to me, is grounding. Let me give you my last example, from something I've been working on.
I just updated my company commercial PPT. ChatGPT helped me with:
- Deep Research great examples and references of such presentations.
- Restructure my argument and slides according to some articles I found on the previous step, and thought were pretty good.
- Come up with copy for each slide.
- Iterate new ideas as I was progressing.
Now, without proper context and grounding, LLMs wouldn't be so helpful at this task, because they don't know my company, clients, product and strategy, and would be generic at best. The key: I provided it with my support portal documentation and a brain dump I recorded to text on ChatGPT with key strategic information about my company. Those are two bits of info I keep always around, so ChatGPT can help me with many tasks in the company.
From that grounding to the final PPT, it's pretty much a trivial and boring transformation task that would have cost me many, many hours to do.
The way I see it, it's less about the technicalities of accuracy and more about the long term human and societal problems it presents when widely adopted.
On one hand, every new technology that comes about unregulated creates a set of ethical and in this particular case, existential issues.
- What will happen to our jobs?
- Who is held accountable when that car navigation system designed by an LLM went haywire and caused an accident?
- What will happen with education if we kill all entry level jobs and make technical skills redundant?
In a sense they're not new concerns in science, we research things to make life easier, but as technology advances, critical thinking takes a hit.
So yeah, I would say people are still right to be weary and 'bullish" of LLMs as it's the normal behaviour for disruptive technology, and one will help us create adequate regulations to safeguard the future.
Simple example: My company is using Gemini to analyze every 13F filing and find correlations between all S&P500 companies the minute new earnings are released. We profited millions off of this in the last six months or so. Replicating this work alone without AI would require hiring dozens of people. How can I not be bullish on LLMs? This is only one of many things we are doing with it.
I do not understand how you can be bearish on LLMs. Data analysis, data entry, agents controlling browsers, browsing the web, doing marketing, doing much of customer support, writing BS React code for a promo that will be obsolete in 3 months anyway.
The possibilities are endless, and almost every week, there is a new breakthrough.
That being said, OpenAI has no moat, and there definitely is a bubble. I'm not bullish on AI stocks. I'm bullish on the tech.
LLMs as a tool that you use and check can be useful, especially for code.
However, I think that putting some LLM in your customer-facing app/SaaS/game is like using the "I'm feeling lucky" button when Google introduced it. It only works for trivial things that you might not even had to use a search engine for (find the address of a website you had already visited). But since it's so cheap to implement and feels like it's doing the work of humans for a tiny fraction of the cost, they won't care about the customers and implement it anyway. So it'll probably flood any system that can get away with basic mistakes, hopefully not in systems where human lives are at stake.
LLMs are like any tool, you get what you put in. If you are frustrated with the results, maybe you need to think about what you're doing.
300/5290 functions decompiled and analyzed in less than three hours off of a huge codebase. By next weekend, a binary that had lost source code will have tests running on a platform it wasn't designed for.
Sabine has moved well into the deep-end and has all sorts of bizarre anti-science content at this point. I had been getting more and more uncomfortable with some of her videos and couldn't really put it into words until one of the guys who blew up debunking flat earther stuff, professor dave, put out a few videos on what she's been doing.
She's youtuber who is happy to take whatever position she thinks will get her the most views and ad revenue, all while crying woe is me about how the scientific establishment shuns her.
As a former theoretical high energy physicist (who believe the field does have problems): yes, and there’s no colleague of mine that I know of who gives a shit about what she says. She’s just a YouTuber to us. As a general rule of thumb, you can assume any “celebrity” “scientist” constantly feuding in public is full of shit.
Sabine is on my blocklist as she very often puts out really ignorant and short-sighted perspectives.
LLMs are the most impactful technology we've had since the internet, that is why people are bullish on them, anyone who fails to see that cannot probably tie its own shoes without a "peer-reviewed" mechanism, lol.
To each is own. I give o3 with deep research my notes and ask it for a high-level design document, then feed that to claude and get a skeleton of a multi-service system, then build out functionality in each service with subsequent claude requests.
Sure, it does middle-of-the-road stuff, but it comments the code well, I can manually tweak things at the various levels of granularity to guide it along, and the design doc is on par something a senior principal would produce.
I do in a week what a team of four would take a month and a half to do. It's insane.
Sure, don't be bullish. I'm frantically piecing together enough hardware to run a decent sized LLM at home.
This user just doesn't understand how to use an LLM properly.
The best solution to hallucination and inaccuracy is to give the LLM mechanisms for looking up the information it lacks. Tools, MCP, RAG, etc are crucial for use cases where you are looking for factual responses.
I’m bullish because many of these models rest on the consumption of copyrighted material or information that wasn’t intended for mass consumption in this way.
Also, I like to think for myself. Writing code and thinking through what I am writing often exposes edge cases that I wouldn’t otherwise realize.
I was previously a non believer in LLMs, but I've come around to accepting them. Gemini has saved me so much time and energy it's actually insane. I wouldn't be able to do my work to my satisfaction (which is highly technical) without its support.
One is a boss's view, looking for an AI to replace his employees someday. I think that is a dead end. It is just getting better to become a sophisticate, increasingly impressive but won't work.
One is the worker's view, looking at AI to be a powerful tool that can leverage one's productivity. I think that is looking promising.
I don't really care for the chat bot to give me accurate sources. I care about an AI that can provide likely places to look for sources and I'll build the tool chain to lookup and verify the sources.
A bunch of comments here seem to be missing a point : the author is (at least was ?) a scientist.
Her primarily work interest is in the truth, not the statically plausible.
Her point is that using LLM to generate truth is pointless, and that people should stop advertising llms as "intelligent", since, to a scientist, being "intelligent" and being "dead wrong" are polar opposite.
Other use cases have feedback loops - it does not matter so much if Claude spuits wrong code, provided you have a compiler and automated tests.
Scientists _are_ acting as compilers to check truth. And they rely on truths compiled by other scientists, just like your programs rely on code written by other people.
What if I tell you that, from now on, any third party library that you call will _stastically_ work 76% of the time, and I have no clue what it does is the remaining X % ?(I don't know what X is, I haven't hast chatgpt yet.)
In the meantime, I still have to see a headline "AI X discovered life-changing new Y on its own" (the closest thing I know of is alpha fold, which I both know is apparently "changing the world of scientists", and yet feel has "not changed the world of your average joe, so far" - emphasis on the "so far" ) ; but I've already seen at least one headline of "dumb mistake made because an AI hallucinated".
I suppose we have to hope the trend will revert at some point ? Hope, on a Friday...
I don’t think it’s fair to say LLMs are “statistically plausible” text generators. It ignores the emergent world model that enables them to generalise outside of their training set. LLMs aren’t a lookup table, and they’re not n-grams.
I pay our lawyer quite a lot and he also makes mistakes. What have you not - typos, somewhat inconcise language in contracts, but we work to fix it and everything's ok.
AI coding overall still seems to be underrated by the average developer.
They try to add a new feature or change some behavior in a large existing codebase and it does something dumb and they write it off as a waste of time for that use case. And that's understandable. But if they had tweaked the prompt just a bit it actually might've done it flawlessly.
It requires patience and learning the best way to guide it and iterate with it when it does something silly.
Although you undoubtedly will lose some time re-attempting prompts and fixing mistakes and poor design choices, on net I believe the frontier models can currently make development much more productive in almost any codebase.
"Yes, I have tried Gemini, and actually it was even worse in that it frequently refuses to even search for a source and instead gives me instructions for how to do it myself. Stopped using it for that reason."
Thank you Sabine. Every time I have mentioned Gemini is the worst, and not even worth of consideration, I have been bombarded with downvotes, and told I am using it wrong.
I feel like at this point when people make some claim about LLM they need to actually include the model they are using. So many "LLMs can do / cant do X", without reference to the model, which I think is relevant.
I am hoping that the LLM approach will face increasingly diminished returns however. So I am biased toward Sabine's griping. I don't want LLM to go all the way to "AGI".
Well you see, there are lots of people that are dependent on bullshitting their way through life, and don't actually contribute anything new of novel to the world. For these people, LLMs are great because they can generate more bullshit than they ever could before. For those that are actually trying to do new and interesting things, well, an LLM is only as good as what it has seen before, and you're doing something new and exciting. Congratulations, you beat the machine until they steal your work and feed it to the LLM.
Ah hah, this is what I think. LLMs are the ultimate bullshitting machines! When you need to produce some BS fast, an LLM is definitely the right tool for that job.
I hope she is aware of the limited context window and ability to retrieve older tokens from conversations.
I have used llms for the exact same purpose she has, summerize chapters or whole books and find the source from e quote, both with success.
I think they key to a successful output lies in the way you prompt it.
Hallucinations should be expected though, as we all hopefully know, llms are more of a autocomplete than intelligence, we should stick to that mindset.
I can't believe they're complaining about LLMs not being able to do unit conversion. Of all things, this is the least interesting and last task I would ever ask an LLM to do.
People who don't work in tech have no idea how hard it is to do certain things at scale. Skilled tech people are severely underappreciated.
From a sub-tweet:
>> no LLM should ever output a url that gives a 404 error. How hard can it be?
As a developer, I'm just imagining a server having to call up all the URLs to check that they still exist (and the extra costs/latency incurred there)... And if any URLs are missing, getting the AI to re-generate a different variant of the response, until you find one which does not contain the missing links.
And no, you can't do it from the client side either... It would just be confusing if you removed invalid URLs from the middle of the AI's sentence without re-generating the sentence.
You almost need to get the LLM to engineer/pre-process its own prompts in a way which guesses what the user is thinking in order to produce great responses...
Worse than that though... A fundamental problem of 'prompt engineering' is that people (especially non-tech people) often don't actually fully understand what they're asking. Contradictions in requirements are extremely common. When building software especially, people often have a vague idea of what they want... They strongly believe that they have a perfectly clear idea but once you scope out the feature in detail, mapping out complex UX interactions, they start to see all these necessary tradeoffs and limitations rise to the surface and suddenly they realize that they were asking for something they don't want.
It's hard to understand your own needs precisely; even harder to communicate them.
I think it could be possible if you're using thinking tokens or via function calls - decide all URLs that will be mentioned while thinking (don't display this to the user), delay for a few seconds to test they exist, push that into the prompt history and then output what the user sees. But it's a bit of an edge case really and would probably annoy users with the extra latency.
It's all done quite easily when that's the priority but it's not because the priority is hype to keep the cycle going to mint a few more ultra-wealthy assholes. Nothing about fixing 404s would require resources unavailable to these mega-corporations and your and others carrying their water is part of why they don't.
Why do people who don't like using LLMs keep insisting they are useless for the rest of us? If you don't like to use them, then simply don't use them.
I use them almost daily in my job and get tremendous use out of them. I guess you could accuse me of lying, but what do I stand to gain from that?
I've also seem people claim that only people who don't know how to code or people doing super simple done a million times apps can get value out of LLMs. I don't believe that applies to my situation, but even if it did, so what? I do real work for a real company delivering real value, and the LLM delivers value to me. It's really as simple as that.
Even outside of work. Two personal examples, out of dozens this week:
1. Asked ChatGPT for a table showing monthly daily max temp, rainfall in mm and numbers of rain days, for Vietnam, Cambodia and Thailand. And colour coded based on the temperatures. Then suggest two times of year, and a route direction, to hit the best conditions on a multi-week trip.
It took a couple of seconds, and it helpfully split Vietnam at Hanoi and HCM given their weather differences.
2. I'm trying to work out how I will build a chicken orchard - post material, spacing, mesh, etc. I asked ChatGPT for a comparison table of steel posts versus timber, and then to cost it out with varying scales of post spacing. Plus pros and cons of each, and likely effort to build. Again, it took a few seconds, including browsing local stores for indicative pricing.
On top of that, I've been even more impressed by a first week testing Cursor.
Yeah, it did. And uncovered a local timber supply place I wasn't aware of, with decent pricing. I'm not expecting a final quote, but quick ways to compare and consider variations. I've used it in the same way to compare cladding costs/work for lining a building.
I don't think it's a matter or liking or not. The use cases just differ considerably, and tools and not as useful or applicable across those. THe OP's use case is probably one of the worst possible for LLMs right now, imo...
The internet is full of people spouting their opinions and implying other opinions are wrong. Both sides of the AI/LLM (and all in-between) are well represented.
I don't actually like LLMs that much or find them useful, but its clear there are some things they are good at and some things they are bad at, and this post is all about things they are weak at.
LLMs are a little bit magical but they are still a square peg. The fact they don't fit in a round hole is uninteresting. The interesting thing to debate is how useful they are at the things they are good at, not at the things they are bad at.
I haven't met any developers in real life who are hyped about LLMs. The only people who seem excited are managers and others who don't really know how to program.
,,By my personal estimate currently GPT 4o DeepResearch is the best one. ''
If the o3 based 3 month old strongest model is the best one, it's a proof that there were quite significant improvements in the last 2 years.
I can't name any other technology that improved as much in 2 years.
O1 and o1 pro helped me with filing tax returns and answered me questions that (probably quite bad) tax accountants (and less smart models) weren't able to (of course I read the referenced laws, I don't trust the output either).
I don't mean to be disparaging to the original author, but I genuinely think a good litmus test today for the calibre of a human researcher's intelligence is what they are able to get out of the current state of the art AI with all of its faults. Contrasting any of Terrance Tao's comments over the last year to the comments above are telling. Perhaps it seems unfair to contrast such a celebrated mathematician to a popular science author, but in fact one would a priori expect that AI ought to be less helpful for the most talented mind in a field. Yet we seem to find exactly the opposite: this cottage industry of "academics" who, now more than a year since LLMs entered popular consciousness, seem to do nothing but decry "AI hype" - while both everyday people and seemingly the most talented researchers continue to pursue what interests them at a higher level.
If I were to be cynical, I think we've seen over the last decade the descent of most of academia, humanities as much as natural sciences, to a rather poor state, drawing entirely on a self-contained loop of references without much use or interest. Especially in the natural sciences, one can today with little effort obtain an infinitely more insightful and, yes, accurate synthesis of the present state of a field from an LLM than 99% of popular science authors.
The keyword in title is "bullish". It's about the future.
Specifically I think it's about the potential of the transformer architecture & the idea that scaling is all that's needed to get to AGI (however you define AGI).
> Companies will keep pumping up LLMs until the day a newcomer puts forward a different type of AI model that will swiftly outperform them.
Upon asking ChatGPT multiple times it just wouldn’t tell me why Vlad the Impaler has this name. It will refuse to say anything cruel I guess, but it’s very frustrating when history is not represented truthfully.
When I’m asking about topics unknown to me, what is it hidding from me, I don’t know it’s awful.
This is interesting. I have replaced google search with ChatGPT and meta’s AI. They more than deliver. Thinking about my use cases, I use google to recollect things or to give me a starting point for further research. LLMs are great for that and so I am never going back to google. However I am curious about how the cases where the OP sees this great gap and failures
This article seems to focus on the shortcomings of LLMs being wrong, but fails to consider the value LLMs provide and just how large of an opportunity that value presents.
If you look at any company on earth, especially large ones, they all share the same line item as their biggest expense: labor. Any technology that can reduce that cost represents an overwhelmingly huge opportunity.
OP highlights application problems, and RAG specifically. But that is not an LLM problem.
Chat is such a “leaky” abstraction for LLMs
I think most people share the same negative experience as they only interact with LLMs through the chat UI by OpenAI and Anthropic. The real magic moment for me was still the autocompletion moment from the gh copilot.
What is a good tutorial / training, to learn about LLMs from scratch ?
I guess that’s the main problem, even more so for non-developers and -tech people. The learning curve is too steep, and people don’t know where to start.
Funny how most of the counter comments here used the form "my experience is different/it's amazing!" and then listed activities that are completely different from what Sabine listed :)
We really should stop reinforcing our echo bubbles and learn from other people. And sometimes be cool in the face of criticism.
She's a scientist. In that area LLMs are quite useful in my opinion and part of my daily workflow. Quick scripts that use APIs to get data, cleaning the data and converting it. Quickly working with polar data frames. Dumb daily stuff like "take this data from my CSV file and turn it into a Latex table"...but most importantly freeing up time from tedious administrative tasks (let's not go into detail here).
Also great for brainstorming and quick drafting grant proposals. Anything prototyping and quickly glued together I'll go for LLMs (or LLM agents). They are no substitute for your own brain though.
I'm also curious about the hallucinated sources. I've recently read some papers on using LLM-agents to conduct structured literature reviews and they do it quite well and fairly reproducible. I'm quite willing to build some LLM-agents to reproduce my literature review process in the near future since it's fairly algorithmic. Check for surveys and reviews on the topic, scan for interesting papers within, check sources of sources, go through A tier conference proceedings for the last X years and find relevant papers. Rinse, repeat.
I'm mostly bullish because of LLM-agents, not because of using stock models with the default chat interface.
LLM interactions are more a reflection on the user than a objective technology. I'm starting to believe that people that think that LLM's are terrible are sub optimal communicators. Because their input is bad the output is.
My experience mirrors hers. Asking questions is worthless because the answers are either 404 links, telling me how to use a search engine, or just flat out wrong and the code generated compiles maybe one time out of ten and when it does the implementation is usually poor.
When I evaluate against areas I possess professional expertise I become convinced LLMs produce the Gell Mann amnesia effect for any area I don't know.
I'm just glad I have something that can write logic in any unix shell script for me. It's the right combination of thing I don't have to do often, thing that doesn't fit my mental model, thing that works differently based on the platform, and thing I just can't be bothered to master.
I wonder when we can get LLM, which might be more "stupid" but knows what it does not know rather than hallucinates... Though perhaps when it would be not LLM but entirely different tech :)
To me, the bullishness stems from the observation that current gen of LLM an plausibly lead to a route to human level intelligence / agi, because they have this causal behavior: (memory/context + inputs) --> outputs. Vaguely similar to humans.
It feels like OP is not using the tools that LLM are correctly. Yes they hallucinate, but I've found they rarely do on first run. It's only when you insist that they start doing it.
Experts will be in denial of LLMs for a long time, while the non-experts will swiftly use it to bridge their own knowledge gap. This is the use case for LLMs, maybe more so than 100% correctness.
Well strap in, because I only see this getting worse (or better, depending on your outlook). Faster chips, more power, more algorithm research and more data. I don't know what's coming exactly, but changes are coming fast.
If you have tried to use LLMs and find them useless or without value, you should seriously consider learning how to correctly use them and doing more research. It is literally a skill issue and I promise you that you are in the process of being left behind. In the coming years Human + AI cooperatives are going to far surpass you in terms of efficiency and output. You are handicapping yourself by not becoming good at using them. These things can deliver massive value NOW. You are going to be a grumbling gray beard losing your job to 22 year old zoomers who spend 10 hours a day talking to LLMs.
Weight the three possibilities how you want but I think in scenario 3 you are coping hard and the grey beard skills aren't nearly as valuable as you think. Is this not already the case? Grey beards already have well known problems with age discrimination despite having such a unique and hard to earn skillset no?
LLMs are not difficult to use and learn though, if so called "greybeard" skills are valueless (or close to) then knowing how to use an LLM certainly won't be valuable either!
Just like in the 2000-2010's knowing how to effectively Google things (while undoubtedly a skill) wasn't what made someone economically valuable.
>LLMs are not difficult to use and learn though, if so called "greybeard" skills are valueless (or close to) then knowing how to use an LLM certainly won't be valuable either!
I mean I challenge you to show you're as skilled at using LLMs as Janus or Pliny the Liberator.
I feel that sentiment on LLMs has almost approached a left/right style idealogical divide, where both sides seem to have a totally different interpretation of what they are seeing and what is important.
I'm genuinely just unimpressed by computers and space craft. Sometimes they just won't boot up, or blow up. I also just am unimpressed by wireless 4G/5G internet,... the printing press. (sarcasm)
LLMs are not a miracle, they are a type of tool. The hype I am angry about is the “black magic? Well, clearly that can solve every problem” mode of thought.
A better model would be good for the stockmarket. The stock market SP500 is shovel town. What would be bad id AI fizzle out and not deliver on hype valuations.
I realized it would be a bubble this xmas when I ran across a neighbor of mine walking their dog. I told them what I did and they immedately asked if I worked with AI, they were looking for AI programmers. No regard for details, or what field, just get me AI programmers.
He's a person with money and he wants AI programmers. I bet there are millions like him.
Don't get me wrong though, I do believe in a future with LLMs. But I believe they will become more and more specialized for specific tasks. The more general an AI is, the more it's likely to fail.
It's more or less the same for every hype technology.
Most people only glance over the executive slides version of the description of the technology (which is understandable because they have different priorities).
Others have already told them that this is the "new thing" and they need it.
I use them everyday and they work greatly, I even made a command (using Claude, actually Claude made everything in that script) that calls Gemini from the terminal so that I can ask for question related to the shell directly there, just doing a: ai "how can I convert a webp to a png", the system prompt asks to be brief, using markdown (it does display nicely), that most question are related to Linux and it provides information about my OS (uname -a), the last code block is also copied in the clipboard, super useful, I imagine there are plenty online of similar utilities.
I wonder if this site isn't impressed because there are a lot of 1% coders here that don't understand what most people do at work. It's mostly administrative. Take this spreadsheet, combine with this one, stuff it into power BI, email it to Debbie so she can spell check it and send to the client. Y'all forget there are companies that actually make things that don't have insane valuations like your bullshit apps do. A company that scopes municipal sewer pipes can't afford a $500k/yr dev, so there's a few $60k/yr office workers that fiddle with reports and spreadsheets all day. It's literally a whole department in some cases. Those are the jobs that are about to be replaced and there are a lot of those jobs out there.
I think it's because it's mostly old people which tends to come with being uninspired and jaded from previous hypes. There's always been the same attitude towards cryptocurrency. Young people are drawn to exciting new things that really break open their existing worldviews. Bitcoin was pioneered by teenagers and young men.
Also, people in general quickly adapt. LLMs are absolute sci-fi magic but you forget that easily. Here's a comedian's view on that phenomenon https://www.youtube.com/watch?v=nUBtKNzoKZ4
I think of LLMs like smart but unreliable humans. You don't want to use them for anything that you need to have right. I would never have one write anything that I don't subsequently go over with a fine-toothed comb.
With that said, I find that they are very helpful for a lot of tasks, and improve my productivity in many ways. The types of things that I do are coding and a small amount of writing that is often opinion-based. I will admit that I am somewhat of a hacker, and more broad than deep. I find that LLMs tend to be good at extending my depth a little bit.
From what I can tell, Sabine Hossenfelder is an expert in physics, and I would guess that she already is pretty deep in the areas that she works in. LLMs are probably somewhat less useful at this type of deep, fact-based work, particularly because of the issue where LLMs don't have access to paywalled journal articles. They are also less likely to find something that she doesn't know (unlike with my use cases, where they are very likely to find things that I don't know).
What I have been hearing recently is that it will take a long time for LLMs will be better than humans at everything. However, they are already better than many many humans at a lot of things.
1. Any low hanging fruits that could easily be solved by an LLM easily probably would have been solved by someone already using standard methods.
2. Humans and LLMs have to spend some particular amount of energy to solve problems. Now, there are efficiencies that can lower/raise that amount of energy but at the end of the day TANSTAAFL. Humans spend this in a lifetime of learning and eating, and LLMs spend this in GPU time and power. Even when AI gets to human level it's never going to abstract this cost away, energy still needs spent to learn.
so many people have shown me these stupid ass AI summaries for random things and if you have even a basic understanding of the relevent issue then the answers just seem bizare. this feels like cheating on my homework and not understanding. like using photomath on homework does.
My take is that if you expect current LLMs to be some near perfect, near-AGI models, then you're going to be sorely disappointed.
If that disappoints you to such a degree that you simply won't use them, you might find yourself in a position some years ahead - could be 1...could be 2...could be 5...could be 10 - who knows, but when the time comes, you might just be outdated and replaced yourself.
When you closely follow the incremental improvements of tech, you don't really fall for the same hype hysteria. If you on the other hand only look into it when big breakthroughs are made, you'll get caught in the hype and FOMO.
And even if you don't want to explicitly use the tools, at least try to keep some surface-level attention to the progress and improvements.
I honestly believe that there are many, many senior engineers / scientists out there that currently just scoff at these models, and view them as some sort of toy tech that is completely overblown and overhyped. They simply refuse to use the tools. They'll point to some specific time a LLM didn't deliver, roll their eyes, and call it useless.
Then when these tools progress, and finally meet their standards, they will panic and scramble to get into the loop. Meanwhile their non-tech bosses and executives will see the tech as some magic that can be used to reduce headcount.
I have a very specific esoteric question like: "What material is both electrically conductive and good at blocking sound?" I could type this into google and sift through the titles and short descriptions of websites and eventually maybe find an answer, or I can put the question to the LLM and instantly get an answer that I can then research further to confirm.
This is significantly faster, more informative, more efficient, and a rewarding experience.
As others have said, its a tool. A tool is as good as how you use it. If you expect to build a house by yelling at your tools I wouldn't be bullish either.
I mean, the first link I got when I pasted that in is probably the Stack Exchange thread you would use to research further, along with other sources, which do seem relevant to the query.
I don't see how an LLM is significantly faster or more informative, since you still have to do the legwork to validate the answer. I guess if you're google-phobic (which a lot of people seem to be, especially on HN) then I can see how it's more rewarding to put it off until later in the process.
you can boil it down to this, whats easier for most people? looking through websites and search engine results for answers or speaking in plain language? The answer is pretty obvious.
The validity of the answers is not 1:1 with its potential profitability.
Like James Baldwin said "people love answers, but hate questions."
getting an answer faster is exponentially better than getting the more precise, more right, more nuanced answer for most people every time. Doing the due dilligence is smart but its also after the fact.
Its just an example and you can "well actually" until the cows come home, but your missing the point. I'm sure there are things you have found hard to google. Also you are most likely (as your commenting on hacker news) not a good representation of the majority of the world.
idk if you have noticed, but google is clearly using LLM technology in conjunction with its search results, so the assumption they are just using traditional tech and not LLM's to inform or modify its result set I think is unlikely.
The first stackexchange link I see answers the question of thermal conductivity, not electrical. Google is convinced I didn’t actually mean electrical. Forcing it to include electrical brings up nothing of use.
The Google AI summary suggests MLV which is wrong.
ChatGPT suggests using copper which is also wrong.
A material that is both electrically conductive and good at blocking sound is:
Lead (Pb)
• Electrical conductivity: Lead is a metal, so it conducts electricity, although it’s not the most conductive (lower than copper or silver).
• Sound blocking: Lead is excellent at blocking sound due to its high density and mass, which help attenuate airborne sound effectively.
Other options depending on application:
Composite materials:
• Metal-rubber composites or metal-polymer composites can be engineered to conduct electricity (via embedded conductive metal layers or fillers) and block sound (due to the damping properties of the polymer/rubber layer).
Graphene or carbon-filled rubber:
• Electrically conductive due to graphene/carbon content.
• Sound damping from rubber base.
• Used in some specialized industrial or automotive applications.
Let me know if you need it optimized for a specific use case (e.g., lightweight, flexible, non-toxic).
I have a prompt personalization that says im a scientist / engineer. Perhaps thats why it gave me a better answer. If you consider the multitude of contexts you could ask this question it makes sense to give it a little personal background.
Use LLMs for what they are good at. Most of my prompts start with “What are my options for … ?” And they excel at that, particularly the recent reasoning models with the ability to search the web. They can help expand your horizons and analyze pros/cons from many angles.
Just today, I was thinking of making changes to my home theater audio setup and there are many ways to go about that, not to mention lots of competing products, so I asked ChatGPT for options and gave it a few requirements. I said I want 5.1 surround sound, I like the quality and simplicity of Sonos, but I want separate front left and right speakers instead of “virtual” speakers from a soundbar. I waited years thinking Sonos would add that ability, but they never did. I said I’d prefer to use the TV as the hub and do audio through eARC to minimize gaming latency and because the TV has enough inputs anyway, so I really don’t need a full blown AV receiver. Basically just a DAC/preamp that can handle HDMI eARC input and all of the channels.
It proceeded to tell me that audio-only eARC receivers that support surround sound don’t really exist as an off-the-shelf product. I thought, “What? That can’t be right, this seems like an obvious product. I can’t be the first one to have thought of this.” Turns out it was right, there are some stereo DAC/preamps that have an eARC input and I could maybe cobble together one as a DIY project, but nothing exactly like what I wanted. Interesting!
ChatGPT suggested that it’s probably because by the time a manufacturer fully implements eARC and all of the format decoding, they might as well just throw in a few video inputs for flexibility and mass-market appeal, plus one less SKU to deal with. And that kind of makes sense, though it adds excess buttons and bothers me from a complexity standpoint.
It then suggested WISA as a possible solution, which I had never heard of, and as a music producer I pay a lot of attention to speaker technology, so that was interesting to me. I’m generally pretty skeptical of wireless audio, as it’s rarely done well, and expensive when it is done well. But WISA seems like a genuine alternative to an AV receiver for someone who only wants it to do audio. I’m probably going to go with the more traditional approach, but it was fun learning about new tech in a brainstorming discussion. Google struggles with these sorts of broad research queries in my experience. I may or may not have found out about it if I had posted on Reddit, depending on whether someone knowledgeable happened to see my post. But the LLM is faster and knows quite a bit about many subjects.
I also can’t remember the last time it hallucinated when having a discussion like this. Whereas, when I ask it to write code, it still hallucinates and makes plenty of mistakes.
I mean the answer is simple, money. There's a bajillion dollars getting shoved into this crap, and most of the bulls are themselves pushing some AI thing. Look at YC, pretty much everything they're funding has some mention of AI. It's a massive bubble with people trying to cash in on the hype. Plus the managerial class being the scum they are, they're bullish because they keep getting sold on the idea that they can replace swathes of workers.
This is just the beginning - and it need not be nation states. Imagine instead of Russian disinfo, it is oil companies doing the same thing with positive takes on oil, climate change is a hoax, etc. Or <insert religion> pushing a narrative against a group they are against.
I have been using Claude this week the first time for a _slightly_ bigger SwiftUI project than just a few lines of bash or SQL I used it before. I have never used swift before but I am amazed how much Claude could do. It feels to me as we are at the point where anyone can now generate small tools with low effort for themselves. Maybe not production ready, but good enough to use yourself. It feels like it should be good enough to empower the average user to break out of having to rely on pre-made apps to do small things. Kind of like bash for the average Joe.
What worked:
- generated a mostly working PoC with minimal input and hallucinated UI layout, Color scheme, etc. this is amazing because it did not bombard me with detailed questions. It just carried on to provide me with a baseline that I could then finetune
- it corrected build issues by me simply copy pasting the errors from Xcode
- got APIs working
- added debug code when it could not fix an issue after a few rounds
- resolved an API issue after I pointed it to a typescript SDK to the API (I literally gave a link to the file and told it, try to use this to work out where the problem is)
- it produces code very fast
What is not working great yet:
- it started off with one large file and crashed soon after because it hit a timeout when regenerating the file. I needed to ask it to split the file into a typical project order
- some logic I asked it to implement explicitly got changed at some point during an unrelated task. To prevent this in future I asked it mark this code part as important and that it should only be changed at explicit request. I don’t know yet how long this code will stay protected for
- by the time enough context got build up usage warnings pop up in Claude
- only so many files are supported atm
So my takeaway is that it is very good at translating, I.e. API docs into code, errors into fixes. There is also a fine line between providing enough context and running out of tokens.
I am planning to continue my project to see how far I can push it. As I am getting close to the limit of the token size now, I am thinking of structuring my app in a Claude friendly way:
- clear internal APIs. Kind of like header files so that I can tell Claude what functions it can use without allowing it to change them or needing to tokenize the full source code
- adversarial testing. I don’t have tests yet, but I am thinking of asking one dedicated instance of Claude to generate tests. I will use other Claude instances for coding and provide them with failing test outputs like I do now with build errors. I hope it will fix itself similarly.
Perhaps attitudes to this new phenomenon are correlated with propensity to skepticism in general.
I will cite myself as Exhibit A. I am the sort of person who takes almost nothing at face value. To me, physiotherapy, and oenology, and musicology, and bed marketing, and mineral-water benefits, and very many other such things, are all obviously pseudoscience, worthy of no more attention than horoscopes. If I saw a ghost I would assume it was a hallucination caused by something I ate.
So it seems like no coincidence that I reflexively ignore the AI babble at the top of search results. After all, an LLM is a language-rehashing machine which (as we all know by now) does not understand facts. That's terribly relevant.
I remember reading, a couple of years back, about some Very Serious Person (i.e. a credible voice, I believe some kind of scientist) who, after a three-hour conversation with ChatGPT, had become convinced that the thing was conscious. Rarely have I rolled my eyes so hard. It occurred to me then that skepticism must be (even) less common a mindset than I assumed.
Not related to the post, but you don't believe in physiotherapy? You know, the "help grandma to go from wheelchair to a walker after a fall thorough controlled exercise" people. You don't mean chiropractors, right? Terms do get weird country to country, so maybe in the US it's more woe adjacent.
I genuinely don’t understand why everyone is so hyper polarized on LLMs. I find them to be fantastically useful tools in certain situations, and definitely one of the most impressive technological developments in decades. It isn’t some silver-bullet, solve all issues solution. It can be wrong and you can’t simply take things it outputs for granted. But that does not make it anywhere near useless or any less impressive. This is especially true considering how truly young the technology is. It literally didn’t exist 10 years ago, and the iteration that thrust them into the public is less than 3 years old and has already advanced remarkably. I find the idea that they are useless snake-oil to be just as deluded of a take as the people claiming we have solved AGI.
Regardless of whether she's right or not, isn't she the same person who recently said that Europe needs to invest tens of billions in foundation models as soon as possible?
As far as I understand LLM what is being asked is unfortunately close to impossible with LLM.
Also I find it disingenuous that apologists are stating thing close to "you are using it wrong". Where it is advertised that LLM based AI should be more and more trusted (because more accurate, based on some arbitrary metrics) and might save some time ( on some undescribed task).
Of course in that use case most would say to use your judgement to verify whatever is generated, but for the generation that is using AI LLM as a source of knowledge ( like some people are using Wikipedia as source of truth, or stack overflow) it will be difficult to verify, when all they knew is LLM generated content as source of knowledge.
Not everyone is as impressed as us by new tech, especially when it's kinda buggy.
The article makes a lot of good points. I get a lot of slop responses to both coding and non-coding prompts, but I've also gotten some really really good responses, especially code completion from Copilot. Even today, ChatGPT saved me a ton of Google searches.
I'm going to continue using it and taking every response with a grain of salt. It can only get better and better.
Never underestimate the momentum of a poorly understood idea that appears as magic to the average person. Once the money starts flowing, that momentum will increase until the idea hits a brick wall that its creators can't gaslight people about.
I hope that realization happens before "vibe coding" is accepted as standard practice by software teams (especially when you consider the poor quality of software before the LLM era). If not, it's only a matter of time before we refer to the internet as "something we used to enjoy."
I watched this creator on YT and she is unsatisfied all the time of LLMs. Just blocked her not to see her videos, because do not give anything but bad aura. Wishing her a good luck finding a topic where she will be happy to discuss.
I like them but I still feel like I am employing a bunch of over confident narcissist engineers and if I ask them to do something, I am never really comfortable the result is correct.
What I want is a work force where I can pass off a request and go home in confidence that the results were correctly implemented.
Written by some grifter who knows their game is up due to generative AI.
These along with the whiney creatives who think progress should be halted so that they can continue to be a bottleneck, are the only people moaning about AI
It’s weird that LLM bulls not not directly engage with the criticism, which would characterize as a fairly standard criticism of hallucination. Hallucinations are a problem. I don’t know who Sabine is but I don’t have to.
Same with the top reply from Teortaxes containing zero relevant information, which Twitter in its infinite wisdom has decided is the “most relevant” reply. (The second “most relevant” reply is some ad for some bs crypto newsletter.)
It's weird that you'd guess (wrong) that I'm an "LLM bull", then start generalizing the actions of this supposed group. Hallucinations are a problem, no news value there. Sabine seems not to be able to get value out of LLMs, and concludes that they aren't useful. The logic is not exactly impressive.
It's a tool. It can be useful, it doesn't always work. Some people claim it's better than it is, some people claim it's worse. This isn't exactly rocket science.
Regardless of the tweet in question, Sabine is a grifter. Her novel takes on academia being some kind of conspiracy of people milking the system, and of physicists not being interested in making new discoveries are nonsensical and only serves to increase her own profile. Look at this video of her trying to convince the world she received an email that apparently proves all her points correct. My BS detector tells me she wrote that email herself, but you be the judge: https://www.youtube.com/watch?v=shFUDPqVmTg
I get so confused on this. I play around, test, and mess with LLMs all the time and they are miraculous. Just amazing, doing things we dreamed about for decades. I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out. It talks to me like a person. It generates really cool images. It helps me write code. And just tons of other stuff that astounds me.
And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly? This is the most amazing technology I've experienced as a 50+ year old nerd that has been sitting deep in tech for basically my whole life. This is the stuff of science fiction, and while there totally are limitations, the speed at which it is progressing is insane. And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
Crazy.
The technology is not just less than superintelligence, for many applications it is less than prior forms of intelligence like traditional search and Stack Exchange, which were easily accessible 3 years ago and are in the process of being displaced by LLMs. I find that outcome unimpressive.
And this Tweeter's complaints do not sound like a demand for superintelligence. They sound like a demand for something far more basic than the hype has been promising for years now. - "They continue to fabricate links, references, and quotes, like they did from day one." - "I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error." (Why have these companies not manually engineered out a problem like this by now? Just do a check to make sure links are real. That's pretty unimpressive to me.) - "They reference a scientific publication, I look it up, it doesn't exist." - "I have tried Gemini, and actually it was even worse in that it frequently refuses to even search for a source and instead gives me instructions for how to do it myself." - "I also use them for quick estimates for orders of magnitude and they get them wrong all the time. " - "Yesterday I uploaded a paper to GPT to ask it to write a summary and it told me the paper is from 2023, when the header of the PDF clearly says it's from 2025. "
A municipality in Norway used LLM to create a report about the school structure in the municipality (how many schools are there, how many should there be, where should they be, how big should they be, pros and cons of different size schools and classes etc etc). Turns out the LLM invented scientific papers to use as references and the whole report is complete and utter garbage based on hallucinations.
And that says… what? The entire LLM technology is worthless for all applications, from all implementations?
A company I worked for spent millions on a customer service solution that never worked. I wouldn’t say that contracted software is useless.
I agree. I use LLMs heavily for gruntwork development tasks (porting shell scripts to Ansible is an example of something I just applied them to). For these purposes, it works well. LLMs excel in situations where you need repetitive, simple adjustments on a large scale. IE: swap every postgres insert query, with the corresponding mysql insert query.
A lot of the "LLMs are worthless" talk I see tends to follow this pattern:
1. Someone gets an idea, like feeding papers into an LLM, and asks it to do something beyond its scope and proper use-case.
2. The LLM, predictably, fails.
3. Users declare not that they misused the tool, but that the tool itself is fundamentally corrupted.
It in my mind is no different to the steam roller being invented, and people remaking how well it flattens asphalt. Then a vocal group trying to use this flattening device to iron clothing in bulk, and declaring steamrollers useless when it fails at this task.
>swap every postgres insert query, with the corresponding mysql insert query.
If the data and relationships in those insert queries matter, at some unknown future date you may find yourself cursing your choice to use an LLM for this task. On the other hand you might not ever find out and just experience a faint sense of unease as to why your customers have quietly dropped your product.
I hope people do this and royally mess shit up.
Maybe then they’ll snap out of it.
I’ve already seen people completely mess things up. It’s hilarious. Someone who thinks they’re in “founder mode” and a “software engineer” because chatgpt or their cursor vomited out 800 lines of python code.
The vileness of hoping people suffer aside, anyone who doesn’t have adequate testing in place is going to fail regardless of whether bad code is written by LLMs or Real True Super Developers.
What vileness? These are people who are gleefully sidestepping things they don't understand and putting tech debt onto others.
I'd say maybe up to 5-10 years ago, there was an attitude of learning something to gain mastery of it.
Today, it seems like people want to skip levels which eventually leads to catastrophic failure. Might as well accelerate it so we can all collectively snap out of it.
The mentality you're replying to confuses me. Yes, people can mess things up pretty badly with AI. But I genuinely don't understand why the assumption that anyone using AI is also not doing basic testing, or code review.
Probably better to have AI help you write a script to translate postgres statements to mysql
Right, which is why you go back and validate code. I'm not sure why the automatic assumption that implementing AI in a workflow means you blindly accept the outputs. You run the tool, you validate the output, and you correct the output. This has been the process with every new engineering tool. I'm not sure why people assume first that AI is different, and second that people who use it are all operating like the lowest common denominator AI slop-shop.
In this analogy are all the steamroller manufacturers loudly proclaiming how well it 10x the process of bulk ironing clothes?
And is a credulous executive class en masse buying into that steam roller industry marketing and the demos of a cadre of influencer vibe ironers who’ve never had to think about the longer term impacts of steam rolling clothes?
> porting shell scripts to Ansible
Thank you for mentioning that! What a great example of something an LLM can pretty well do that otherwise can take a lot of time looking up Ansible docs to figure out the best way to do things. I'm guessing the outputs aren't as good as someone real familiar with Ansible could do, but it's a great place to start! It's such a good idea that it seems obvious in hindsight now :-)
Exactly, yeah. And once you look over the Ansible, it's a good place to start and expand. I'll often have it emit hemlcharts for me as templates, then after the tedious setup of the helm chart is done, the rest of it is me manually doing the complex parts, and customizing in depth.
Plus, it's a generic question; "give a helm chart for velero that does x y and z" is as proprietary as me doing a Google search for the same, so you're not giving proprietary source code to OpenAI/wherever so that's one fewer thing to worry about.
Yeah, I tend to agree. The main reason that I use AI for this sort of stuff is it also gives me something complete that I can then ask questions about, and refine myself. Rather than the fragmented documentation style "this specific line does this" without putting it in the context of the whole picture of a completed sample.
I'm not sure if it's a facet of my ADHD, or mild dyslexia, but I find reading documentation very hard. It's actually a wonder I've managed to learn as much as I have, given how hard it is for me to parse large amounts of text on a screen.
Having the ability to interact with a conversational type documentation system, then bullshit check it against the docs after is a game changer for me.
that's another thing! people are all "just read the documentation". the documentation goes on and on about irrelevant details, how do people not see the difference between "do x with library" -> "code that does x", and having to read a bunch of documentation to make a snippet of code that does the same x?
I'm not sure I follow what you mean, but in general yes. I do find "just read the docs" to be a way to excuse not helping team members. Often docs are not great, and tribal knowledge is needed. If you're in a situation where you're either working on your own and have no access to that, or in a situation where you're limited by the team member's willingness to share, then AI is an OK alternative within limits.
Then there's also the issue that examples in documentation are often very contrived, and sometimes more confusing. So there's value in "work up this to do such and such an operation" sometimes. Then you can interrogate the functionality better.
No, it says that people dislike liars. If you are known for making up things constantly, you might have a harder time gaining trust, even if you're right this time.
All of these things can be true at the same time:
1. LLMs have been massively overhyped, including by some of the major players.
2. LLMs have significant problems and limitations.
3. LLMs can do some incredibly impressive things and can be profoundly useful for some applications.
I would go so far as to say that #2 and #3 are hardly even debatable at this point. Everyone acknowledges #2, and the only people I see denying #3 are people who either haven't investigated or are so annoyed by #1 that they're willing to sacrifice their credibility as an intellectually honest observer.
#3 can be true and yet not be enough to make your case. Many failed technologies achieved impressive engineering milestones. Even the harshest critic could probably brainstorm some niche applications for a hallucination machine or whatever.
And yet we keep electing them to public office.
If it makes data up, then it is worthless for all implementations. I'd rather it said I don't have info on this question.
It only makes it worthless for implementations where you require data. There's a universe of LLM use cases that aren't asking ChatGPT to write a report or using it as a Google replacement.
The problem is that yes llms are great when working on some regular thing for the first time. You can get started at a speed never before seen in the tech world.
But as soon as your use case goes beyond that LLMs are almost useless.
The main complaint that yes its extremely helpful in that specific subset of problems, it’s not actually pushing human knowledge forward. Nothing novel is being created with it.
It has created this illusion of being extremely helpful when in reality it is a shallow kind of help.
> If it makes data up, then it is worthless for all implementations.
Not true. It's only worthless for the things you can't easily verify. If you have a test for a function and ask an LLM to generate the function, it's very easy to say whether it succeeded or not.
In some cases, just being able to generate the function with the right types will mostly mean the LLM's solution is correct. Want a `List(Maybe a) -> Maybe(List(a))`? There's a very good chance a LLM will either write the right function or fail the type check.
> all implementations
Are you speaking for yourself or everyone?
Does “it” apply to Homo sapiens as well?
Except value isnt polarised like that.
In a research context, it provides pointers, and keywords for further investigation. In a report-writing context it provides textual content.
Neither of these or the thousand other uses are worthless. Its when you expect working and complete work product that it's (subjectively, maybe) worthless but frankly aiming for that with current gen technology is a fool's errand.
It says that people need training on what the appropriate use-cases for LLMs are.
This is not the type of report I'd use an LLM to generate. I'd use a database or spreadsheet.
Blindly using and trusting LLMs is a massive minefield that users really don't take seriously. These mistakes are amusing, but eventually someone is going to use an LLM for something important and hallucinations are going to be deadly. Imagine a pilot or pharmacist using an LLM to make decisions.
Some information needs to come from authoritative sources in an unmodified format.
It says we don't have a lower bound on the effectiveness.
It's (currently) like an ad saying "this product can improve your stuff up to 300%"
It mostly says that one of the seriously difficult challenges with LLMs is a meta-challenge:
* LLMs are dangerously useless for certain domains.
* ... but can be quite useful for others.
* The real problem is: They make it real tricky to tell, because most of all they are trained to sound professional and authoritative. They hallucinate papers because that's what authoritative answers look like.
That already means I think LLMs are far less useful than they appear to be. It doesn't matter how amazing a technology is: If it has failure modes and it is very difficult to know what they are, it's dangerous technology no matter how awesome it is when it is working well. It's far simpler to deal with tech that has failure modes but you know about them / once things start failing it's easy to notice.
Add to it the incessant hype, and, oh boy. I am not at all surprised that LLMs have a ridiculously wide range as to detractors/supporters. Supporters of it hype the everloving fuck out of it, and that hype can easily seem justified due to how LLMs can produce conversational, authoritative sounding answers that are explicitly designed to make your human brain go: Wow, this is a great answer!
... but experts read it and can see the problems there. Which lots of tech suffers from: as a random example: Plenty of highly upvoted apparently fantastically written Stack Overflow answers have problems. For example, it's a great answer... for 10 years ago; it is a bad idea today because the answer has been obsoleted.
But between the fact that it's overhyped and particularly complex to determine an LLM answer is hallucinated drivel, it's logical to me that experts are hyperbolic when highlighting the problems. That's a natural reaction when you have a thing that SEEMS amazing but actually isn't.
> Stack Overflow answers have problems. For example, it's a great answer... for 10 years ago
To be fair, that's a huge problem with stack overflow and its culture. A better version of stack overflow wouldn't have that particular issue.
You, and the OP, are being unfair in your replies. Obviously, it's not worthless for all applications but when LLMs obviously fail in disastrous ways in some important areas, you can't refute that by going "actually it gives me codign advice and generates images".
Thats nice and impressive, but there are still important issues and shortcomings. Obligatory, semirelated xkcd: https://xkcd.com/937/
> And that says… what? The entire LLM technology is worthless for all applications, from all implementations?
You're the first in the thread to have brought that up; there are far more charitable ways to have interpreted the post you're replying to.
That software just didn‘t work that way. I don’t think it tried to convince the users that they were wrong by spouting nonsense that seems legitimate.
All of these anecdotal stories about "LLM" failures need to go into more detail about what model, prompt, and scaffolding was used. It makes a huge difference. Were they using Deep Research, which searches for relevant articles and brings facts from them into the report? Or did they type a few sentences into ChatGPT Free and blindly take it on faith?
LLMs are _tools_, not oracles. They require thought and skill to use, and not every LLM is fungible with every other one, just like flathead, Phillips, and hex-head screwdrivers aren't freely interchangeable.
If any non-trivial ask of an LLM also requires the prompts/scaffolding to be listed, and independently verified, along with its output, their utility is severely diminished. They should be saving time not giving us extra homework.
Far better to just get these problems resolved.
That isn't what I'm saying. I'm saying you can't make a blanket statement that LLMs in general aren't fit for some particular task. There are certainly tasks where no LLM is competent, but for others, some LLMs might be suitable while others are not. At least some level of detail beyond "they used an LLM" is required to know whether a) there was user error involved, or b) an inappropriate tool was chosen.
then they shouldn't market it as one-size fits all
Are they? Every foundation model release includes benchmarks with different levels of performance in different task domains. I don't think I've seen any model advertised by its creating org as either perfect or even equally competent across all domains.
The secondary market snake oil salesmen <cough>Manus</cough>? That's another matter entirely and a very high degree of skepticism for their claims is certainly warranted. But that's not different than many other huckster-saturated domains.
People like Zuckerberg go around claiming most of their code will be written by AI starting sometime this year. Other companies are hearing that and using it as a reason(or false cover) for layoffs. The reality is LLMs still have a way to go before replacing experienced devs and even when they start getting there there will be a period of time where we’re learning what we can and can’t trust them with and how to use them effectively and responsibly. Feels like at least a few years from now, but the marketing says it’s now.
In many, many cases those problems are resolved by improvements to the model. The point is that making a big deal about LLM fuck ups in 3 year old models that don't reproduce in new ones is a complete waste of time and just spreads FUD.
Did you read the original tweet? She mentions the models and gives high level versions of her prompts. I'm not sure what "scaffolding" is.
You're right that they're tools, but I think the complaint here is that they're bad tools, much worse than they are hyped to be, to the point that they actually make you less efficient because you have to do more legwork to verify what they're saying. And I'm not sure that "prompt training," which is what I think you're suggesting, is an answer.
I had several bad experiences lately. With Claude 3.7 I asked how to restore a running database in AWS to a snapshot (RDS, if anyone cares). It basically said "Sure, just go to the db in the AWS console and select 'Restore from snapshot' in the actions menu." There was no such button. I later read AWS docs that said you cannot restore a running database to a snapshot, you have to create a new one.
I'm not sure that any amount of prompting will make me feel confident that it's finally not making stuff up.
I was responding to the "they used an LLM" story about the Norwegian school report, not the original tweet. The original tweet has a great level of detail.
I agree that hallucination is still a problem, albeit a lot less of one than it was in the recent past. If you're using LLMs for tasks where you are not directly providing it the context it needs, or where it doesn't have solid tooling to find and incorporate that context itself, that risk is increased.
Why do you think these details are important? The entire point of these tools is that I am supposed to be able to trust what they say. The hard work is precisely to be able to spot which things are true and false. If I could do that I wouldn't need an assistant.
> The entire point of these tools is that I am supposed to be able to trust what they say
Hard disagree, and I feel like this assumption might be at the root of why some people seem so down on LLMs.
They’re a tool. When they’re useful to me, they’re so useful they save me hours (sometimes days) and allow me to do things I couldn’t otherwise, and when they’re not they’re not.
It never takes me very long to figure out which scenario I’m in, but I 100% understand and accept that figuring that out is on me and part of the deal!
Sure if you think you can “vibe code” (or “vibe founder”) your way to massive success but getting LLMs to do stuff you’re clueless about without anyone way to check, you’re going to have a bad time, but the fact they can’t (so far) do that doesn’t make them worthless.
Because then I can know whether the hallucinations they encountered are a little surprising, or not surprising at all.
Because it's the difference between a fleshy hallucination and something that might related to reality.
> Why do you think these details are important?
It's https://en.wikipedia.org/wiki/Sealioning
Sounds like a user problem, though. When used properly as a tool they are incredible. When you give up 100% trust to them to be perfect it’s you that is making the mistake.
Well yeah, it's fancy autocomplete. And it's extremely amazing what 'fancy autocomplete' is able to do, but making the decision to use an LLM for the type of project you described is effectively just magical thinking. That isn't an indictment against LLM, but rather the person who chose the wrong tool for the job.
This is more a lack of understanding of it's limitations, it'd be different if they asked for it to write a python script to collate the data.
If the LLM is intelligent, why can’t it figure out that writing a script would be the best way to solve the problem?
Some of the more modern tools do exactly that. If you upload a CSV to Claude, it will not (or at least not anymore) try to process the whole thing. It will read the header, and then ask you what you want. It will then write the appropriate Javascript code and run it to process the data and figure out the stats/whatever you asked it for.
I recently did this with a (pretty large) exported CSV of calories/exercise data from MyFitnessPal and asked it to evaluate it against my goals/past bloodwork etc (which I have in a "Claude Project" so that it has access to all that information + info I had it condense and add to the project context from previous convos).
It wrote a script to extract out extremely relevant metrics (like ratio of macronutrients on a daily basis for example), then ran it and proceeded to talk about the result, correlating it with past context.
Use the tools properly and you will get the desired results.
ChatGPT has been able to do exactly that (using its Code Interpreter tool) for two years now. Gemini and Claude have similar features.
Often they will do exactly that, currently their reasoning isn't the best so you may have to coax it to take the best path. It's also making a judgement call in its writing the code so worth checking too. No different to a senior instructing an intern.
Ah, it's like communism, then (to its diehards). It cannot fail, it can only be failed.
Please explain how what I am saying is wrong?
This is an odd non-sequitur.
So they used the model as a database? It should be immediately obvious to anyone that this won't work.
"an old poorly implemented model can't do item X well therefore the technology is garbage"
Likely the most accurate measure of progress would be watching detractors goalposts move over time.
"Even a journey of 1,000 miles begins with the first step. Unless you're an AI hyper then taking the first step is the entire journey - how dare you move the goalposts"
"They continue to fabricate links, references, and quotes, like they did from day one." - "I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error."
Why have these companies not manually engineered out a problem like this by now? Just do a check to make sure links are real. That's pretty unimpressive to me.
There are no fabricated links, references, or quotes, in OpenAI's GPT 4.5 + Deep Research.
It's unfortunate the cost of a Deep Research bespoke white paper is so high. That mode is phenomenal for pre-work domain research. You get an analyst's two week writeup in under 20 minutes, for the low cost of $200/month (though I've seen estimates that white paper cost OpenAI over USD 3000 to produce for you, which explains the monthly limits).
You still need to be a domain expert to make use of this, just as you need to be to make use of an analyst. Both the analyst and Deep Research can generate flawed writeups with similar misunderstandings: mis-synthesizing, misapplication, or missing inclusion of some essential.
Neither analyst nor LLM is a substitute for mastery.
While I agree, it doesn't stop business folks pushing for its use in area where it is inappropriate. That is, at least for me, part of the skepticism.
How do people in the future become domain experts capable of properly making use of it if they are not the analyst spending two weeks on the write-up today?
My complaints with Deep Research LLMs is they don't go deeper than 2 pages of SERPs. I want them to dig down obscure stuff, not list cursorily relevant peripheral directions. they just seem to do breadth first than depth first search.
This assessment is incomplete. Large languages models are both less and more than these traditional tools. They have not subsumed them and all can sit together in separate tabs of a single browser window. They are another resource, and when the conditions are right, which is often the case in my experience, they are a startlingly effective tool for navigating the information landscape. The criticism of Gemini is a fair one, and I encountered it yesterday, but perhaps with 50% less entitlement. But Gemini also helped me translate obscure termios APIs to python from C source code I provided. The equivalent using search and/or Stack Overflow would have required multiple piecemeal searches without guarantees -- and definitely would have taken much more time.
The 404 links are hilarious, like you can't even parse the output and retry until it returns a link that doesn't 404? Even ignoring the billions in valuation, this is so bad for a $20 sub.
The tweeters complaints sound like a user problem. LLM’s are tools. How you use them, when you use them, and what you expect out of them should be based on the fact they are tools.
I’m sorry but the experience of coding with an LLM is about ten billion times better than googling and stack overflowing every single problem I come across. I’ve stack overflowed maybe like two things in the past half year and I’m so glad to not have to routinely use what is now a very broken search engine and web ecosystem.
How did you measure and compare googling/stack overflow to coding with an LLM? How did you get to the very impressive number ten billion times better?! Can you share your methodology? How have you defined better?
I take calipers to my boss’s forehead veins and see how pissed he is routinely throughout the day
It‘s broken now. It was fine 5 years ago.
The search ecosystem is broken now because google is focused on LLMs
That's part of it. The other part is Google sacrificing product quality for excessive monetization. An example would be YouTube search - first three results are relevant, next 12 results are irrelevant "people also watched", then back to relevant results. Another example would be searching for an item to buy and getting relevant results in the images tab of google, but not the shopping tab.
It’s broken bc google has spent 20+ years promoting garbage content in a self-serving way. No one was able to compete unless they played by googles rules, and so all we have left is blog spam and regular spam
[flagged]
I thought summarizing papers/stories/emails/meetings was one of the touted use cases of LLMs?
What are the use cases where the expected performance is high?
I didn't notice that example. I doubt top tier models have issues with that. I was more referencing Sabines mentions of hallucinating citations and papers which is an issue I also had 2 years ago but is probably solved by Deep Research at this point. She just has massive skill issues and doesn't know what shes doing.
>What are the use cases where the expected performance is high?
https://openai.com/index/introducing-chatgpt-pro/
o1-pro is probably at top tier human level performance on most small coding tasks and definitely at answering STEM questions. o3 is even better but not released outside of it powering Deep Research.
https://codeforces.com/blog/entry/137543 o3 is top 200 on Codeforces for example.
> This is just not a use case where the expected performance on these tasks is high.
Yet the hucksters hyping AI are falling all over themselves saying AI can do all this stuff. This is where the centi-billion dollar valuations are coming from. It's been years and these super hyped AIs still suck at basic tasks.
When pre-AI shit Google gave wrong answers it at least linked to the source of the wrong answers. LLMs just output something that looks like a link and calls it a day.
To be fair the newest tools like Deep Research are actually quite good and hallucination is essentially not a real problem for them.
https://marginalrevolution.com/marginalrevolution/2025/02/de...
<<After glowing reviews, I spent $200 to try it out for my research. It hallucinated 8 of 10 references on a couple of different engineeribg topics. For topics that are well established (literature search), it is useful, although o3-mini-high with web search worked even better for me. For truly frontier stuff, it is still a waste of time.>>
<<I've had the hallucination problem too, which renders it less than useful on any complex research project as far as I'm concerned.>>
These quotes are from the link you posted. There are a lot more.
I think Sabine is just wrong in this case. I don't think Deep Research can even hallucinate links in this way at all.
The whole point is that an LLM is not a search engine and obviously anyone who treats it as one is going to be unsatisfied. It's just not a sensible comparison. You should compare working with an LLM to working with an old "state of the art" language tool like Python NLTK -- or, indeed, specifying a problem in Python versus specifying it in the form of a prompt -- to understand the unbridgeable gap between what we have today and what seemed to be the best even a few years ago. I understand when a popular science author or my relatives haven't understood this several years after mass access to LLMs, but I admit to being surprised when software developers have not.
Hosted and free or subscription-based DeepResearch like tools that integrate LLMs with search functionality (the whole domain of "RAG" or "Retrieval Augmented Generation") will be elementary for a long time yet simply because the cost of the average query starts to go up exponentially and there isn't that much money in it yet. Many people have and will continue to build their own research tools where they can determine how much compute time and API access cost they're willing to spend on a given query. OCR remains a hard problem, let alone appropriately chunking potentially hundreds of long documents into context length and synthesizing the outputs of potentially thousands of LLM outputs into a single response.
to be fair a few? one? years ago LLMs were touted? marketed? as a "search killer", and a lot of people do use it in that fashion.
A lot of people need to improve their critical thinking skills to deconstruct the marketing hype, and then choose the right tool for the job.
sure. isn't that effectively what Sabine is doing though? She just doesn't have as compelling a use in the cases where LLMs are strong.
Certainly. I agree of course as to the problem of hype and I'm aware of how many people use LLMs today. I tried to emphasize in my earlier post that I can understand why someone like Sabine has the opinion she does -- I'm more confused how there's still similar positions to be found among software developers, evidenced often within Hacker News threads like the one we're in. I don't intend that to refer to you, who clearly has more than a passing knowledge of LLM internals, but more to the original commenter I was responding to.
More than marketing, I think from my experience it's chat with little control over context as the primary interface of most non-engineers with LLMs that leads to (mis)expectations of the tool in front of them. Having so little control over what is actually being input to the model makes it difficult to learn to treat a prompt as something more like a program.
It's mostly because of how they were initially marketed. In an effort to drive hype 'we' were promised the world. Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence? In reality Bard, let alone whatever early version he was using, is about as sentient as my left asscheek.
OpenAI did similar things by focusing to the point of absurdity on 'safety' for what was basically a natural language search engine that has a habit of inventing nonsensical stuff. But on that same note (and also as you alluded to) - I do agree that LLMs have a lot of use as natural language search engines in spite of their proclivity to hallucinate. Being able to describe a e.g. function call (or some esoteric piece of history) by description and then often get the precise term/event that I'm looking for is just incredibly useful.
But LLMs obviously are not sentient, are not setting us on the path to AGI, or any other such nonsense. They're arguably what search engines should have been 10 or 15 years ago, but anti-competitive monopolization of the industry meant that search engine technology progress basically stalled out, if not regressed for the sake of ads (and individual 'entrepreneurs' becoming better at SEO), about the time Google fully established itself.
> Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence?
I presume you are referring to this Google engineer, who was sacked for making the claim. Hardly an example of AI companies overhyping the tech; precisely the opposite, in fact. https://www.bbc.co.uk/news/technology-62275326
It seems to be a common human hallucination to imagine that large organisations are conspiring against us.
Corporations are motivated by profit, not doing what's best for humanity. If you need an example of "large organizations conspiring against us," I can give you twenty.
I agree that sometimes organisations conspire against people. My point was, in case it wasn't apparent, the irony that somenameforme was talking about how LLMs were of little use because they hallucinate, whilst apparently hallucinating a conspiracy by AI companies to overhype the technology.
I wasn't making a political point. You see similar evidence-free allegations against international organisations and national government bodies.
There is no difference between an organization and a conspiracy. Organizing to do something is the same as conspiring to do something.
That leaves the question of whether the organization is commensal, symbiotic or predatory towards any given "us".
> Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence?
That's not what happened. Google stomped hard on Lemoine, saying clearly that he was wrong about LaMDA being sentient ... and then they fired him for leaking the transcripts.
Your whole argument here is based on false information and faulty logic.
Which is pretty ironic in a thread littered with people asserting LLMs are useless because they can hallucinate and create illogical outputs.
So in your reading people who consider the weakness of LLMs a definite problem should call humans generally "adequate".
Were people capable to lift concrete pillars, cranes would not be sought.
Edit: #@!! snipers. Speak up.
Can’t tell what you’re trying to say or who you’re angry at. Rephrase please?
Were you perchance noting that according to some people «LLMs ... can hallucinate and create illogical outputs» (you also specified «useless», but that must be a further subset and will hardly create a «litter[ing]» here), but also that some people use «false information and faulty logic»?
Noting that people are imperfect is not a justification for the weaknesses in LLMs. Since around late 2022 some people started stating LLMs are "smart like their cousin", to which the answer remains "we hope that your cousin has a proportionate employment".
If you built a crane that only lifts 15kg, it's no justification that "many people lift 10". The purpose of the crane is to lift as needed, with abundance for safety.
If we build cranes, it is because people are not sufficient: the relative weakness of people is, far from a consolation of weak cranes, the very reason why we want strong cranes. Similarly for intelligence and other qualities.
People are known to use use «false information and faulty logic»: but they are not being called "adequate".
> angry at
There's a subculture around here that thinks it normal to downvote without any rebuttal - equivalent to "sneering and leaving" (quite impolite), almost all times it leaves us without a clue about what could be the point of disapproval.
I think you're missing the point. He's pointing out what the atmosphere was/is around LLMs in these discussions, and how that impacts stories like with Lemoine.
I mean, you're right that he's silly and Google didn't want to be part of it, but it was (and is?) taken seriously that: LLMs are nascent AGI, companies are pouring money to get there first, we might be a year or two away. Take these as true, it's at least possible that Google might have something chained up in their basement.
In retrospect, Google dismissed him because he was acting in a strange and destructive way. At the time, it could be spun as just further evidence: they're silencing him because he's right. Could it have created such hysteria and silliness if the environment hadn't been so poisoned by the talk of imminent AGI/sentience?
Sure, but does any of that support the claim that LLMs were marketed as superintelligence?
Which comment claimed that LLMs were marketed as super-intelligence? I'm looking up the chain and I can't see it.
I don't think they were, but I think it's pretty clear they were marketed as being the imminent path to super-intelligence, or something like it. OpenAI were saying GPT-(n-1) is as intelligent as a high school student, GPT-(n) is a university student, GPT-(n+1) will be.. something.
> OpenAI did similar things by focusing to the point of absurdity on 'safety' for what was basically a natural language search engine that has a habit of inventing nonsensical stuff.
The focus on safety, and the concept of "AI", preexisted the product. An LLM was just the thing they eventually made; it wasn't the thing they were hoping to make. They applied their existing beliefs to it anyway.
I am worried about them as a substitute for search engines. My reasoning is that classic google web-scraping and SEO, as shitty as it may be, is 'open-source' (or at least, 'open-citation') in nature - you can 'inspect the sh*t it's built from'. Whereas LLMs, to me seem like a chinese - or western - totalitarian political system wet dream - 'we can set up an inscrutable source of "truth" for the people to use, with the _truths_ we intend them to receive'. We already saw how weird and unsane this was, when they were configured to be woke under the previous regime. Imagine it being configured for 'the other post-truth' is a nightmare.
> Remember the "leaks" from Google about an engineer trying to get the word out that they had created a sentient intelligence?
No, first time I hear about it. I guess the secret to happiness is not following leaks. I had very low expectations before trying LLMs and I’m extremely impressed now.
This was three years ago:
https://www.theguardian.com/technology/2022/jun/12/google-en...
He was fired and a casual browse of his blog makes it quite clear that he was a few fries short of a Happy Meal all along.
> """happiness"""
Not following leaks, or just the news, not living in the real world, not caring of the consequences of reality: anybody can think he's """happy""" with psychedelia and with just living in private world. But it is the same kind of "happy" that comes with "just smile".
If you did not get information that there are severe pitfalls - which is by the way so unrelated to the "it's sentient thing", as we are talking about the faults in the products, not the faults in human fools -, you are supposed to see them from your own judgement.
They have their value in analyzing huge amounts of data for example scientific papers or raw observations, but the popular public ones are mostly trained on stolen/pirated texts offthe internet and from social media clouds the companies control. So this means: bullshit in -> bullshit out. I don't need machines for that the regular human bullshitters do this job just fine.
> the popular public ones are mostly trained on stolen/pirated texts offthe internet
You mean like actual literature, textbooks and scientific papers? You can't get them in bulk without pirating. Thank intellectual property laws.
> from social media clouds the companies control
I.e. conversations of real people about matters of real life.
But if it satisfies your elitist, ivory-towerish vision of "healthy information diet" for LLMs, then consider that e.g. Twitter is where, until now, you'd get most updates from the best minds in several scientific fields. Or that besides r/All, the Reddit dataset also contains r/AskHistorians and other subreddits where actual experts answer questions and give first-hand accounts of things.
The actually important bit though, is that LLM training manages to extract value from both the "bullshit" and whatever you'd call "not bullshit", as the model has to learn to work with natural language just as much as it has to learn hard facts or scientific theories.
Yes, I find the biggest issue in discussing the present state of AI with people outside the field, whether technical or not, is that "machine learning" had only just entered popular understanding: i.e. everyone seems ready today to talk about the limits of training a machine learning model on X limited data set, unable to extrapolate beyond it. The difference between "learning the best binary classifier on a labelled training set" and "exploring the set of all possible programs representable by a deep neural network of whatever architecture to find that which best generates all digitally recorded traces of human beings throughout history" is very far from intuitive to even specialists. I think Ilya's old public discussions of this question are the most insightful for a popular audience, explaining how and why a world model and not simply a Markov chain is necessary to solve the seemingly trivial problem of "predicting the next word in a sequence."
A lot of irony in that comment.
Nobody promised the world. The marketing underpromised and LLMs overdelivered. Safety worries didn't come from marketing, it came from people who were studying this as a mostly theoretical worry for the next 50+ years, only to see major milestones crossed a decade or more before they expected.
Did many people overhype LLMs? Yes, like with everything else (transhumanist ideas, quantum physics). It helps being more picky who one listens to, and whether they're just painting pretty pictures with words, or actually have something resembling a rational argument in there.
Bro AGI as a marketing term is too stale already.
We are now at Artificial SUPER Intelligence.
I’m waiting for Artificial Pro Max Super Duper Intelligence.
You wait, someday they'll come out with something they start calling AI 2.0
Folks really over index when an LLM is very good for their use case. And most of the folks here are coders, at which they're already good and getting better.
For some tasks they're still next to useless, and people who do those tasks understandably don't get the hype.
Tell a lab biologist or chemist to use an LLM to help them with their work and they'll get very little useful out of it.
Ask an attorney to use it and it's going to miss things that are blindingly obvious to the attorney.
Ask a professional researcher to use it and it won't come up with good sources.
For me, I've had a lot of those really frustrating experiences where I'm having difficulty on a topic and it gives me utter incorrect junk because there just isn't a lot already published about that data.
I've fed it tricky programming tasks and gotten back code that doesn't work, and that I can't debug because I have no idea what it's trying to do, or I'm not familiar with the libraries it used.
It sounds like you're trying to use these llms as oracles, which is going to cause you a lot of frustration. I've found almost all of them now excel at imitating a junior dev or a drunk PhD student. For example the other day I was looking at acoustic sensor data and I ran it down the trail of "what are some ways to look for repeating patterns like xyz" and 10 minutes later I had a mostly working proof of concept for a 2nd order spectrogram that reasonably dealt with spectral leakage and a half working mel spectrum fingerprint idea. Those are all things I was thinking about myself, so I was able to guide it to a mostly working prototype in very little time. But doing it myself from zero would've taken at least a couple of hours.
But truthfully 90% of work related programming is not problem solving, it's implementing business logic. And dealing with poor, ever changing customer specs. Which an llm will not help with.
> But truthfully 90% of work related programming is not problem solving, it's implementing business logic. And dealing with poor, ever changing customer specs. Which an llm will not help with.
Au contraire, these are exactly things LLMs are super helpful at - most of business logic in any company is just doing the same thing every other company is doing; there's not that many unique challenges in day-to-day programming (or business in general). And then, more than half of the work of "implementing business logic" is feeding data in and out, presenting it to the user, and a bunch of other things that boil down to gluing together preexisting components and frameworks - again, a kind of work that LLMs are quite a big time-saver for, if you use them right.
Strongly in agreement. I've tried them and mostly come away unimpressed. If you work in a field where you have to get things right, and it's more work to double check and then fix everything done by the LLM, they're worse than useless. Sure, I've seen a few cases where they have value, but they're not much of my job. Cool is not the same as valuable.
If you think "it can't quite do what I need, I'll wait a little longer until it can" you may still be waiting 50 years from now.
> If you work in a field where you have to get things right, and it's more work to double check and then fix everything done by the LLM, they're worse than useless.
Most programmers understand reading code is often harder than writing it. Especially when someone else wrote the code. I'm a bit amused by the cognitive dissonance of programmers understanding that and then praising code handed to them by an LLM.
It's not that LLMs are useless for programming (or other technical tasks) but they're very junior practitioners. Even when they get "smarter" with reasoning or more parameters their nature of confabulation means they can't be fully trusted in the way their proponents suggest we trust them.
It's not that people don't make mistakes but they often make reasonable mistakes. LLMs make unreasonable mistakes at random. There's no way to predict the distribution of their mistakes. I can learn a human junior developer sucks at memory management or something. I can ask them to improve areas they're weak in and check those areas of their work in more detail.
I have to spend a lot of time reviewing all output from LLMs because there's rarely rhyme or reason to their errors. They save me a bunch of typing but replace a lot of my savings with reviews and debugging.
My view is that it will be some time before they can as well because of the success in the software domain - not because LLM's aren't capable as a tech but because data owners and practitioners in other domains will resist the change. From the SWE experience, news reports, financial magazines, etc many are preparing accordingly, even if it is a subconscious thing. People don't like change, and don't want to be threatened when it is them at risk - no one wants what happened to artists and now SWE's to happen to their profession. They are happy for other professions to "democratize/commoditize" as long as it isn't them - after all this increases their purchasing power. Don't open source knowledge/products, don't let AI near your vertical domain, continue to command a premium for as long as you can - I've heard variations of this in many AI conversations. Much easier in oligopoly and monopoly like domains and/or domains where knowledge was known to be a moat even when mixed with software as you have more trust competitors won't do the same.
For many industries/people work is a means to earn, not something to be passionate in for its own sake. Its a means to provide for other things in life you are actually passionate about (e.g. family, lifestyle, etc). In the end AI may get your job eventually but if it gets you much later vs other industries/domains you win from a capital perspective as other goods get cheaper and you still command your pre-AI scarcity premium. This makes it easier for them to acquire more assets from the early disrupted industries and shield them from eventual AI taking over.
I'm seeing this directly in software. Less new frameworks/libraries/etc outside the AI domain being published IMO, more apprehension from companies to open source their work and/or expose what they do, etc. Attracting talent is also no longer as strong of a reason to showcase what you do to prospective employees - economic conditions and/or AI make that less necessary as well.
I know at least two attorneys who use LLMs productively.
As with all LLM usage right now, it's a tool and not fit for every purpose. But it has legit uses for some attorney tasks.
I frequently see news stories where attorneys get in trouble for using LLMs, because they cite hallucinated case law (e.g.). If they didn't get caught, that would look the same as using them "productively".
Asking the LLM for relevant case law and checking it up - productive use of LLM. Asking the LLM to write your argument for you and not checking it up - unproductive use of LLM. It's the same as with programming.
>Asking the LLM for relevant case law and checking it up - productive use of LLM
That's a terrible use for an LLM. There are several deterministic search engines attorneys use to find relevant case law, where you don't have to check to see if the cases actually exist after it produces results. Plus, the actual text of the case is usually very important, and isn't available if you're using an LLM.
Which isn't to say they're not useful for attorneys. I've had success getting them to do some secretarial and administrative things. But for the core of what attorneys do, they're not great.
For law firms creating their own repositories of case law, having LLMs search via summaries, and then dive into the selected cases to extract pertinent information seems like an obvious great use case to build a solution using LLMs.
The orchestration of LLms that will be reading transcripts, reading emails, reading case law, and preparing briefs with sources is unavoidable in the next 3 years. I don’t doubt multiple industry specialized solutions are already under development.
Just asking chatGPT to make your case for you is missing the opportunity.
If anyone is unable to get Claud 3.7 or Gemini 2.5 to accelerate their development work I have to doubt their sentience at this point. (Or more likely doubt that they’re actively testing these things regularly)
Law firms don't create their own repos of case law. They use a database like westlaw or lexis. LLMs "preparing briefs with sources" would be a disaster and wholly misunderstands what legal writing entails.
This lawyer uses llms for everything. Correspondence, document review, drafting demands, drafting pleadings,discovery requests, discovery responses, golden rule letters, motions, responses to motions, deposition outlines, depo prep, voir dire, opening, direct, cross, closings.
I find it very useful to review the output and consider its suggestions.
I don’t trust it blindly, and I often don’t use most of what it suggests; but I do apply critical thinking to evaluate what might be useful.
The simplest example is using it as a reverse dictionary. If I know there’s a word for a concept, I’ll ask an LLM. When I read the response, I either recognize the word or verify it using a regular dictionary.
I think a lot of the contention in these discussions is because people are using it for different purposes: it's unreliable for some purposes and it is excellent at others.
> For law firms creating their own repositories of case law
Wait, why would a law firm create its own repository of case law? It's not like it has access to secret case law that other lawfirms do not.
> Asking the LLM for relevant case law and checking it up - productive use of LLM.
Only if you're okay with it missing stuff. If I hired a lawyer, and they used a magic robot rather than doing proper research, and thus missed relevant information, and this later came to light, I'd be going after them for malpractice, tbh.
Ironically both Westlaw and Lexis have integrated AI into their offerings so it's magic robots all the way down.
https://legal.thomsonreuters.com/en/products/westlaw-edge https://www.lexisnexis.com/en-us/products/protege.page
Surely this was meant ironically, right? You must've heard of at least one of the many cases involving lawyers doing precisely what you described and ending up presenting made up legal cases in court. Guess how that worked out for them.
The uses that they cited to me were "additional pair of eyes in reviewing contracts," and, "deep research to get started on providing a detailed overview of a legal topic."
The problem it‘s marketed as a general tool that at least in the future will work near perfectly if we just provide enough data and computing power.
> For some tasks they're still next to useless, and people who do those tasks understandably don't get the hype.
This is because programmers talk on the forums that programmers scrape to get data to train the models.
Honestly it's worse than this. A good lab biologist/chemist will try to use it, understand that it's useless, and stop using it. A bad lab biologist/chemist will try to use it, think that it's useful, and then it will make them useless by giving them wrong information. So it's not just that people over-index when it is useful, they also over-index when it's actively harmful but they think it's useful.
You think good biologists never need to summarize work into digestible language, or fill out multiple huge, redundant grant applications with the same info, or reformat data, or check that a writeup accurate reflects data?
I’m not a biologist (good or bad) but the scientists I know (who I think are good) often complain that most of the work is drudgery unrelated to the science they love.
Sure, lots of drudgery, but none of your examples are things that you could trust an LLM to do correctly when correctness counts. And correctness always counts in science.
Edit to add: and regardless, I'm less interested in the "LLM's aren't ever useful to science" part of the point. The point that actual LLM usage in science will mostly be for cases where they seem useful but actually introduce subtle problems is much more important. I have observed this happening with trainees.
I have also seen trainees introduce subtle problems when they think they know more than they do.
This attorney uses it all day every day.
[dead]
The problem Sabine tries to communicate is that reality is different from what the cash-heads behind main commercial models are trying to portray. They push the narrative that they’ve created something akin to human cognition, when in reality, they’ve just optimised prediction algorithms on an unprecedented scale. They are trying to say that they created Intelligence, which is the ability to acquire and apply knowledge and skills, but we all know the only real Intelligence they are creating is the collection of information of military or political value.
The technology is indeed amazing and very amusing, but like all the good things in the hands of corporate overlords, it will be slowly turning into profit-milking abomination.
> They push the narrative that they’ve created something akin to human cognition
This is your interpretation of what these companies are saying. I'd love to see if some company specifically anything like that?
Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now? How many things from today when brought back into 2010 would make the person using it make it feel like they're being tricked or pranked? We already take them for granted even thought they've only been around for less than half of a decade.
LLMs aren't a catch all solution to the world's problems; or something that is going to help us in every facet of our lives; or an accelerator for every industry that exists out there. But at no point in history could you talk to your phone about general topics, get information, practice language skills, build an assistant that teaches your kid about the basics of science, use something to accelerate your work in a many different ways etc...
Looking at llms shouldn't be boolean, it shouldn't be between they're the best thing ever invented vs they're useless; but it seems like everyone presents the issue in this manner and Sabine is part of that problem.
No major company directly states "We have created human-like intelligence," they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
>I'd love to see if some company specifically anything like that?
1. DeepMind researchers: Sparks of Artificial General Intelligence: Early experiments with GPT-4 - https://arxiv.org/abs/2303.12712
2. "GPT-4 is not AGI, but it does exhibit more general intelligence than previous models." - Sam Altman
3. Musk has claimed that AI is on the path to "understanding the universe." His branding of Tesla's self-driving AI as "Full Self-Driving" (FSD) also misleadingly suggests a level of autonomous reasoning that doesn't exist.
4. Meta's AI chief scientist, Yann LeCun, has repeatedly said they are working on giving AI "common sense" and "world models" similar to how humans think.
>Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now?
ELIZA is an early natural language processing computer program developed from 1964 to 1967
ELIZA's creator, Weizenbaum, intended the program as a method to explore communication between humans and machines. He was surprised and shocked that some people, including Weizenbaum's secretary, attributed human-like feelings to the computer program. 60 years ago.
So as you can see, us humans are not too hard to fool with this.
ELIZA was not a natural language processor, and the fact that some people were easily fooled by a program that produced canned responses based on keywords in the text but was presented as a psychotherapist is not relevant to the issue here--it's a fallacy of affirmation of the consequent.
Also,
"4. Meta's AI chief scientist, Yann LeCun, has repeatedly said they are working on giving AI "common sense" and "world models" similar to how humans think."
completely misses the mark. That LLMs don't do this is a criticism from old-school AI researchers like Gary Marcus; LeCun is saying that they are addressing the criticism by developing the sorts of technology that Marcus says are necessary.
> they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
As do all companies in the world. If you want to buy a hammer, the company will sell it as the best hammer in the world. It's the norm.
I don't know exactly what your point is with ELIZA?
> So as you can see, us humans are not too hard to fool with this.
I mean ok? How is that related to having a 30 minute conversation with ChatGPT where it teaches you a language? Or Claude outputting an entire application in a single go? Or having them guide you through fixing your fridge by uploading the instructions? Or using NotebookLM to help you digest a scientific paper?
Im not saying LLMs are not impressive or useful — Im pointing out that corporations behind commercial AI models are capitalising on our emotional response to natural language prediction. This phenomenon isnt new – Weizenbaum observed it 60 years ago, even with the simplest of algorithms like ELIZA.
Your example actually highlights this well. AI excels at language, so it’s naturally strong in teaching (especially for language learning ;)). But coding is different. It’s not just about syntax; it requires problem-solving, debugging, and system design — areas where AI struggles because it lacks true reasoning.
There’s no denying that when AI helps you achieve or learn something new, it’s a fascinating moment — proof that we’re living in 2025, not 1967. But the more commercialised it gets, the more mythical and misleading the narrative becomes
> system design — areas where AI struggles because it lacks true reasoning.
Others addressed code, but with system design specifically - this is more of an engineering field now, in that there's established patterns, a set of components at various levels of abstraction, and a fuck ton of material about how to do it, including but not limited to everything FAANG publishes as preparatory material for their System Design interviews. At this point in time, we have both a good theoretical framework and a large collection of "design patterns" solving common problems. The need for advanced reasoning is limited, and almost no one is facing unique problems here.
I've tested it recently, and suffice it to say, Claude 3.7 Sonnet can design systems just fine - in fact much better than I'd expect a random senior engineer to. Having the breadth of knowledge and being really good at fitting patterns is a big advantage it has over people.
You originally said
> They push the narrative that they’ve created something akin to human cognition
I am saying they're not doing that, they're doing sales and marketing and it's you that interprets this as possible/true. In my analogy if the company said it's a hammer that can do anything, you wouldn't use it to debug elixir. You understand what hammers are for and you realize the scope is different. Same here. It's a tool that has its uses and limits.
> Your example actually highlights this well. AI excels at language, so it’s naturally strong in teaching (especially for language learning ;)). But coding is different. It’s not just about syntax; it requires problem-solving, debugging, and system design — areas where AI struggles because it lacks true reasoning.
I disagree since I use it daily and Claude is really good at coding. It's saving me a lot of time. It's not gonna build a new Waymo but I don't expect it to. But this is besides the point. In the original tweet what Sabine is implying is that it's useless and OpenAI should be worth less than a shoe factory. When in fact this is a very poor approach to look at LLMs and their value and both sides of the spectrum are problematic (those that say it's a catch all AGI and those that say hurr it couldn't solve P versus NP it's trash).
I think one difference between a hammer and an LLM is that hammers have existed since forever, so common sense is assumed to be there as to what their purpose is. For LLMs though, people are still discovering on a daily basis to what extent they can usefully apply them, so it's much easier to take such promises made by companies out of context if you are not knowledgeable/educated on LLMs and their limitations.
>they're doing sales and marketing and it's you that interprets this as possible/true.
You've moved the goalpost from "they're not saying it" to "they're saying, but you're not supposed to believe it."
The companies are not doing it. This is what I am saying.
You admitted earlier that they are:
Person you replied to: they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
Your response: As do all companies in the world. If you want to buy a hammer, the company will sell it as the best hammer in the world. It's the norm.
As a programmer (and GOFAI buff) for 60 years who was initially highly critical of the notion of LLMs being able to write code because they have no mental states, I have been amazed by the latest incarnations being able to write complex functioning code in many cases. There are, however, specific ways that not being reasoners is evident ... e.g., they tend to overengineer because they fail to understand that many situations aren't possible. I recently had an example where one node in a tree was being merged into another, resulting in the child list of the absorbed node being added to the child list of the kept node. Without explicit guidance, the LLM didn't "understand" (that is, its response did not reflect) that a child node can only have one parent so collisions weren't possible.
> proof that we’re living in 2025, not 1967. But the more commercialised it gets, the more mythical and misleading the narrative becomes
You seem to be living in 2024, or 2023. People generally have far more pragmatic expectations these days, and the companies are doing a lot less overselling ... in part because it's harder to come up with hype that exceeds the actual performance of these systems.
[dead]
How about Sam Altman literally saying on twitter "We know how to build AGI now"? That close enough?
“We know how to build something” is pretty different from “our in-market products are something”
How many examples of CEOs writing shit like that can you name? I can name more than one. Elon's been saying that camera driven level 5 autonomous driving will be ready in 2021. Did you believe him?
You went from "they're not saying it" to "and you believe them when they say it??" Pretty quickly
I said the company is not saying this and not using it for marketing - and this stays true. CEOs hyping their stock is par for the course.
A CEO is the most visible representative of a company.
A statement on their personal Twitter might not be "the company's" statement, but who cares?
Sam Altman's social media IS OpenAI marketing.
If it didn't officially come from the marketing department it's only sparkling overhype right?
Elon? Never did, and for the record, also never really understood his fanboys. I never even bought a Tesla. And no, besides these two guys, I don´t really remember many other CEOs making such revolutionary statements. That is usually the case when people understand their technology and are not ready to bullshit. There is one small differentiation though: At least self-driving cars hype was believable because it seemed almost like a finite-search problem, like along the lines of, how hard could it be to process X input signals from lidars and image frames and marry it to an advanced variation of what is basically a PID controller. And at least there is a defined use-case. With genAI, we have no idea what the problem definition and even problem space is, and the main use-case that the companies seem to be pushing down our throats (aside from code assistants) is "summarising your email" and chatting with your smartphone, for lonely people. Ew, thanks, but no thanks.
I mean you really don't know multiple CEOs in jail that hyped their stock to the moon? Theranos? Nikola?
That's reallyyyy trying hard to minimise the capability of LLMs and their potentials that we're still discovering. But you do you I guess.
No mate, not everyone is trying hard to prove some guy on the Internet wrong. I do remember these two but to be honest, they were not on top of my mind in this context, probably because it's a different example - or what are you trying to say? That the people running AI companies should go to jail for deceiving their investors? This is different to Theranos. Holmes actively marketed and PRESENTED a "device" which did not exist as specified (they relied on 3rd party labs doing their tests in the background). For all that we know, OpenAI and their ilk are not doing that really. So you're on thin ice here. Amazon came close though, with their failed Amazon Go experiment, but they only invested their own money, so no damage was done to anyone. In either case your example is showing what? That lying is normal in the business world and should be done by the CEOs as part of their job description? That they should or should not go to jail for it? I am really missing your point here, no offence.
No offense taken
> In either case your example is showing what? That lying is normal in the business world and should be done by the CEOs as part of their job description? That they should or should not go to jail for it? I am really missing your point here, no offence.
If you run through the message chain you'll see first that the comment OP is claiming companies market llms as AGI, and then the next guy quotes Altmans tweet to support it. I am saying companies don't claim llms are AGI and that CEOs are doing CEO things; my examples are Elon (didn't go to jail btw) and the other two that did.
> For all that we know, OpenAI and their ilk are not doing that really.
I am on the same page here.
CEOs represent their companies. "The company didn't say it, the CEO did" is a nonsensical distinction.
I think you completely missed the point. Altman is definitely engaging in 'creative' messaging, so do other GenAI CEOs. But unlike Holmes and others, they are careful to wrap it into conditionals and future tense and this vague corporate speak about how something "feels" like this and that and not that it definitely is this or that. Most of us dislike the fact that they are indeed implying this stuff as being almost AGI, just around the corner, just a few more years, just a few more hundred billion dollars wasted in datacenters. When we can see on a day-to-day basis, that their tools are just advanced text generators. Anyone who finds them 'mindblowing' clearly does not have a complex enough use case.
I think you are missing the point. I never said it's the same nor is that what I am arguing.
> Anyone who finds them 'mindblowing' clearly does not have a complex enough use case.
What is the point of llms? If their only point is complex use cases then they're useless, let's throw them away. If their point/scope/application is wider and they're doing something for a non negligible percentage of people then who are you to gauge whether they deserve to be mindblowing to someone or not regardless of their use case?
What is the point of LLMs? It seems nobody really knows, including the people selling them. They are a solution in search of a problem. But if you figure it out in the meanwhile, make sure to let everyone know. Personally I'd be happy with just having back Google as it was between roughly 2006-2019 (RIP) in the place of the overly verbose statistical parrots.
> Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now?
Lots e.g. vacuum cleaners.
> But at no point in history could you talk to your phone
You could always "talk" to your phone just like you could "talk" to a parrot or a dog. What does that even mean?
If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation and often doesn't reply with anything related.
Right on the money. Plus vacuum cleaners are actually useful and predictable in their inputs and outputs :)
Sure, a vacuum cleaner is the same.
> If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation and often doesn't reply with anything related.
I don't believe this one bit. But keep on trucking.
> Sure, a vacuum cleaner is the same.
> I don't believe this one bit. But keep on trucking.
You sure? Isn't that contradictory? It can't be the same if you don't believe it...
Did you need an /s to understand sarcasm?
Of course they aren't "real" conversations but I can dialog with LLMs as a means of clarifying my prompts. The comment about parrots and dogs is made in bad faith.
By your own admission, those are not dialogues, but merely query optimisations in an advanced query language. Like how you would tune an SQL query until your get the data you are expecting to see. That's what it is for the LLMs.
> The comment about parrots and dogs is made in bad faith
Not necessarily. (Some aphonic, adactyl downvoters seem to have possibly tried to nudge you into noticing that your idea above is against some entailed spirit the guidelines.)
The poster may have meant that for the use natural to him, he feels in the results the same utility of discussing with a good animal. "Clarifying one's prompts" may be effective in some cases, but it's probably not what others seek. It is possible that many want the good old combination of "informative" and "insightful": in practice there may be issues with both.
> "Clarifying one's prompts" may be effective in some cases but it's probably not what others seek
It's not even that. Can the LLM run away, stop the conversation or even say no? It's as much as your boss "talking" to you about the task and not giving you a chance to respond. Is that a talk? It's 1-way.
E.g. ask the LLM who invented Wikipedia. It will respond with "facts". If I ask a friend, the reply might be "look it up yourself". This a real conversation. Until then.
Even parrots and dogs can respond differently than a forced reply exactly how you need it.
True - but LLMs can do this.
A German Onion-like magazine has a wrapper around ChatGPT that behaves like that called „DeppGPT“ (IdiotGPT), likely implemented with a decent prompt.
> If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation
Imagine the LLM is halfway through its journey to the Moon, and mentally correct for ~1.5 seconds of light lag.
> and often doesn't reply with anything related.
Use better microphone, or stop mumbling.
> This is your interpretation of what these companies are saying. I'd love to see if some company specifically anything like that?
What is the layman to make of the claim that we now have “reasoning” models? Certainly sounds like a claim of human-like cognition, even though the reality is different.
Studies have shown that corvids are capable of reasoning. Does that sound like a claim of human level cognition?
I think you’re going too far in imagining what one group of people will make of what another group of people is saying, without actually putting yourself in either group.
Much as i agree with the point about overhyping from companies, I'd be more sympathetic to this point of view if she acknowledged the merits of the technology.
Yes, it hallucinates and if you replace your brain with one of these things, you won't last too long. However, it can do things which, in the hands of someone experienced, are very empowering. And it doesn't take an expert to see the potential.
As it stands, it sounds like a case of "it's great in practice but the important question is how good it is in theory."
If it works for you...
I use LLMs. They're somewhat useful if you're on a non niche problem. They're also useful instead of search engines, but that's because search has been entshittified more than because a LLM is better.
However 90% of the marketing material about them is simply disgusting. The bigwigs sound like they're spreading a new religion, and most enthusiasts sound like they're new converts to some sect.
If you're marketing it as a tool, fine. If you're marketing it as the third and fourth coming of $DEITY, get lost.
> I use LLMs. They're somewhat useful if you're on a non niche problem. They're also useful instead of search engines...
The problem for me is that I could use that type of assistance precisely when I hit that "niche problem" zone. Non-niche problems are usually already solved.
Like search. Popular search engines like Google and Bing are mostly garbage because they keep trying to shove gen AI in my face with made up answers. I have no such problems with my SearxNG instance.
> I could use that type of assistance precisely when I hit that "niche problem" zone
Tough luck. On the other hand, we're still justified in asking for money to do the niche problems with our fleshy brains, right? In spite of the likes of Altman saying every week that we'll be obsoleted in 5 years by his products. Like ... cold fusion? Always 5 years away?
[I have more hope for cold fusion than these "AIs" though.]
> Popular search engines like Google and Bing are mostly garbage because they keep trying to shove gen AI in my face with made up answers.
No they became garbage significantly before "AI". Google at least has gradually reduced the number of results returned and expanded the search scope to the point that you want a reminder of the i2c api syntax on a raspberry pi and they return 20 beginner tutorial results that show you how to unpack the damn thing and do the first login instead.
I completely agree about the marketing material. I'm not sure about 90% but that's not something I have a strong opinion on. The stream from the bigwigs is the same song being played in a different tune and I'm inoculated to it.
I'm not marketing it. I'm not a marketer. I'm a developer trying to create an informed opinion on its utility and the marketing speak you criticize is far away from the truth.
The problem is this notion that it's just completely bullshit. The way it's worded irks me. "I genuinely don't understand...". It's quite easy to see the utility and acknowledging that doesn't, in any way, detract from valid criticisms of the technology and the people who peddle.
Exactly. It’s so strange to read so many comments that boil down to “because some marketing people are over-promising, I will retaliate by choosing to believe false things”
Don't forget that before "AI" there was crypto. Same attitude among their promoters.
So someone who isn't already invested can genuinely draw the conclusion they're similar and not worth the time.
Edit: oh wait
>because some marketing people are over-promising
all "AI" marketing people that I've seen. Ok 98%. And all my LLM info is from what gets posted on HN.
> I will retaliate by choosing to believe false things
"I will retaliate by cataloguing them as pathological liars and not waste my time with them any more".
But it’s not the marketers building the products. This is like saying “because the car salesman lied about this Prius’ gas mileage, I’ll retaliate by refusing to believe hybrids are any better than pure ICE cars and will buy a pickup”.
It hurts nobody but the person choosing ignorance.
No, I'm afraid it's like 90% of the Prius owners who post something about their car post fake gas mileages.
And 100% of the marketers of course.
I disagree, but even conceding the point — why would that make you choose to believe falsehoods just to stick it to the liars?
I hate to bring an ad hominem into this, but Sabine is a YouTube influencer now. That's her current career. So I'd assume this Tweet storm is also pushing a narrative on its own, because that's part of doing the work she chose to do to earn a living.
Pinch of salt & all.
While true, I think this is more likely a question of framing or anchoring — I am continuously impressed and surprised by how good AI is, but I recognise all the criticisms she's making here. They're amazing, but at the same time they make very weird mistakes.
They actually remind me of myself, as I experience being a native English speaker now living in Berlin and attempting to use a language I mainly learned as an adult.
I can often appear competent in my use of the language, but then I'll do something stupid like asking someone in the office if we have a "Gabelstapler" I can borrow — Gabelstapler is "forklift truck", I meant to ask for a stapler, which is "Tacker" or "Hefter", and I somehow managed to make this mistake directly after carefully looking up the word. (Even this is a big improvement for me, as I started off like Officer Crabtree from Allo' Allo').
What you have done there is to discount statements that may build up a narrative - and still may remain fair... On which basis? Possibly they do not match your own narrative?
LLMs seem akin to parts of human cognition, maybe the initial fast thinking bit when ideas pop up in a second of two. But any human writing a review with links to sources would look them up and check the are they right ones that match the initial idea. Current LLMs don't seem to do that, at least the ones Sabine complains about.
Akin to human cognition but still a few bricks short of a load, as it were.
You lay the rhetoric on so thick (“cash heads”, “pushing the narrative”, “corporate overlords”, “profit-making abomination”) that it’s hard to understand your actual claim.
Are you trying to say that LLMs are useful now but you think that will stop being the case at some point in the future?
If it's just cash-heads pushing a narrative, where do Bengio and Hinton fit? Stuart Russell? Doug Hofstadter?
I mean fine, argue that they're mistaken to be concerned, if that's your belief. But dismissing it all as obvious shilling is not that argument.
Look man, and I'm saying this not to you but to everyone who is in this boat; you've got to understand that after a while, the novelty wears off. We get it. It's miraculous that some gigabytes of matrices can possibly interpret and generate text, images, and sound. It's fascinating, it really is. Sometimes, it's borderline terrifying.
But, if you spend too much time fawning over how impressive these things are, you might forget that something being impressive doesn't translate into something being useful.
Well, are they useful? ... Yeah, of course LLMs are useful, but we need to remain somewhat grounded in reality. How useful are LLMs? Well, they can dump out a boilerplate React frontend to a CRUD API, so I can imagine it could very well be harmful to a lot of software jobs, but I hope it doesn't bruise too many egos to point out that dumping out yet another UI that does the same thing we've done 1,000,000 times before isn't exactly novel. So it's useful for some software engineering tasks. Can it debug a complex crash? So far I'm around zero for ten and believe me, I'm trying. From Claude 3.7 to Gemini 2.5, Cursor to Claude Code, it's really hard to get these things to work through a problem the way anyone above the junior dev level can. Almost unilaterally, they just keep digging themselves deeper until they eventually give up and try to null out the code so that the buggy code path doesn't execute.
So when Sabine says they're useless for interpreting scientific publications, I have zero trouble believing that. Scoring high on some shitty benchmarks whose solutions are in the training set is not akin to generalized knowledge. And these huge context windows sound impressive, but dump a moderately large document into them and it's often a challenge to get them to actually pay attention to the details that matter. The best shot you have by far is if the document you need it to reference definitely was already in the training data.
It is very cool and even useful to some degree what LLMs can do, but just scoring a few more points on some benchmarks is simply not going to fix the problems current AI architecture has. There is only one Internet, and we literally lit it on fire to try to make these models score a few more points. The sooner the market catches up to the fact that they ran out of Internet to scrape and we're still nowhere near the singularity, the better.
100% this. I think we should start producing independent evaluations of these tools for their usefulness, not for whatever made up or convoluted evaluation index the OpenAI, Google or Anthropic throw at us.
> the novelty wears off.
Hardly. I pretty much have been using LLM at least weekly (most of the time daily) since GPT3.5. I am still amazed. It's really, really hard to not be bullish for me.
It kinda reminds me the days I learned Unix-like command line. At least once a week, I shouted to me self: "What? There is a one-liner that does that? People use awk/sed/xargs this way??" That's how I feel about LLM so far.
I tried LLMs for generating shell snippets. Mixed bag for me. It seems to have a hard time making portable awk/sed commands. It also really overcomplicates things; you really don't need to break out awk for most simple file renaming tasks. Lesser used utilities, all bets are off.
Yesterday Gemini 2.5 Pro suggested running "ps aux | grep filename.exe" to find a Wine process (pgrep is the much better way to go for that, but it's still wrong here) and get the PID, then pass that into "winedbg --attach" which is wrong in two different ways, because there is no --attach argument and the PID you pass into winedbg needs to be the Win32 one not the UNIX one. Not an impressive showing. (I already knew how to do all of this, but I was curious if it had any insights I didn't.)
For people with less experience I can see how getting e.g. tailored FFmpeg commands generated is immensely useful. On the other hand, I spent a decent amount of effort learning how to use a lot of these tools and for most of the ways I use them it would be horrific overkill to ask an LLM for something that I don't even need to look anything up to write myself.
Will people in the future simply not learn to write CLI commands? Very possible. However, I've come to a different, related conclusion: I think that these areas where LLMs really succeed in are examples of areas where we're doing a lot of needless work and requiring too much arcane knowledge. This counts for CLI usage and web development for sure. What we actually want to do should be significantly less complex to do. The LLM actually sort of solves this problem to the extent that it works, but it's a horrible kludge solution. Literally converting video files and performing basic operations on them should not require Googling reference material and Q&A websites for fifteen minutes. We've built a vastly overly complicated computing environment and there is a real chance that the primary user of many of the interfaces will eventually not even be humans. If the interface for the computer becomes the LLM, it's mostly going to be wasted if we keep using the same crappy underlying interfaces that got us into the "how do I extract tar file" problem in the first place.
> dumping out yet another UI that does the same thing we've done 1,000,000 times before isn't exactly novel
As a yet that's exactly what people get paid to do every day. And if it saves them time, they won't exactly get bored of that feature.
They really don’t. People say this all the time, but you give any project a little time and it evolves into a special unique snowflake every single time.
That’s why every low code solution and boilerplate generator for the last 30 years failed to deliver on the promises they made.
I agree some will evolve into more, but lots of them won't. That's why shopify, WordPress and others exist - most commercial websites are just online business cards or small shops. Designers and devs are hired to work on them all the time.
If you’re hiring a dev to work on your Shopify site, it’s most likely because you want to do something non-standard. By the time the dev gets done with it, it will be a special unique snowflake.
If your site has users, it will evolve. I’ve seen users take what was a simple trucking job posting form and repurpose an unused “trailer type” field to track the status of the job req.
Every single app that starts out as a low code/no code solution given enough time and users will evolve beyond that low code solution. They may keep using it, but they’ll move beyond being able to maintain it exclusively through a low code interface.
And most software engineering principles is for dealing how to deal with this evolution.
- Architecture (making it easy to adjust part of the codebase and understanding it)
- Testing (making sure the current version works and future version won't break it)
- Requirements (describing the current version and the planned changes)
- ...
If a project was just a clone, I'd sure people would just buy the existing version and be done with it. And sometimes they do, then a unique requirement comes and the whole process comes back into play.
If your website is so basic that you can just take a template and put your specific details into it, what exactly do you need an LLM for?
If your job can be hallowed out into >90% entering prompts into AI text editors, you won't have to worry about continuing to be paid to do it every day for very long.
> Well, are they useful? ... Yeah, of course LLMs are useful, but we need to remain somewhat grounded in reality. How useful are LLMs?
They are useful enough that they can passably replace (much more expensive) humans in a lot of noncritical jobs, thus being a tangible tool for securing enterprise bottom lines.
Which jobs? I haven't seen LLMs successfully replace more expensive humans in noncritical roles
From what I've seen in my own job and observing what my wife does (she's been working with the things on very LLM-centric processes and products in a variety of roles for about three years) not a lot of people are able to use them to even get a small productivity boost. Anyone less than very-capable trying to use them just makes a ton more work for someone more expensive than they are.
They're still useful, but they're not going to make cheap employees wildly more productive, and outside maybe a rare, perfect niche, they're not going to increase expensive employees' productivity so much that you can lay off a bunch of the cheap ones. Like, they're not even close to that, and haven't really been getting much closer despite improvements.
>they can dump out a boilerplate react frontend to a CRUD API
This is so clearly biased that it boarders on parody. You can only get out what you put in. The real use case of current LLMs is that any project that would previously require collaboration can now be down solo with a much faster turnover. Of course in 20 years when compute finally catches up they will just be super intelligent AGI
I have Cursor running on my machine right now. I am even paying for it. This is in part because no matter what happens, people keep professing, basically every single time a new model is released, that it has finally happened: programmers are finally obsolete.
Despite the ridiculous hype, though, I have found that these things have crossed into usefulness. I imagine for people with less experience, these tools are a godsend, enabling them to do things they definitely couldn't do on their own before. Cool.
Beyond that? I definitely struggle to find things I can do with these tools that I couldn't do better without. The main advantage so far is that these tools can do these things very fast and relatively cheaply. Personally, I would love to have a tool that I can describe what I want in detailed but plain English and have it be done. It would probably ruin my career, but it would be amazing for building software. It'd be like having an army of developers on your desktop computer.
But, alas, a lot of the cool shit I'd love to do with LLMs doesn't seem to pan out. They're really good at TypeScript and web stuff, but their proficiency definitely tapers off as you veer out. It seems to work best when you can find tasks that basically amount to translation, like converting between programming languages in a fuzzy way (e.g. trying to translate idioms). What's troubling me the most is that they can generate shitloads of code but basically can't really debug the code they write beyond the most entry-level problem-solving. Reverse engineering also seems like an amazing use case, but the implementations I've seen so far definitely are not scratching the itch.
> Of course in 20 years when compute finally catches up they will just be super intelligent AGI
I am betting against this. Not the "20 years" part, it could be months for all we know; but the "compute finally catches up" part. Our brains don't burn kilowatts of power to do what they do, yet given basically unbounded time and compute, current AI architectures are simply unable to do things that humans can, and there aren't many benchmarks that are demonstrating how absolutely cataclysmically wide the gap is.
I'm certain there's nothing magical about the meat brain, as much as that is existentially challenging. I'm not sure that this follows through to the idea that you could replicate it on a cluster of graphics cards, but I'm also not personally betting against that idea, either. On the other hand, getting the absurd results we have gotten out of AI models today didn't involve modest increases. It involved explosive investment in every dimension. You can only explode those dimensions out so far before you start to run up against the limitations of... well, physics.
Maybe understanding what LLMs are fundamentally doing to replicate what looks to us like intelligence will help us understand the true nature of the brain or of human intelligence, hell if I know, but what I feel most strongly about is this: I do not believe LLMs are replicating some portion of human intelligence. They are very obviously neither a subset or superset or particularly close to either. They are some weird entity that overlaps in other ways we don't fully comprehend yet.
Complete hyperbole.
I see a difference between seeing them as valuable in their current state vs being "bullish about LLMs" in the stock market sense.
The big problem with being bullish in the stock market sense is that OpenAI isn't selling the LLMs that currently exist to their investors, they're selling AGI. Their pitch to investors is more or less this:
> If we accomplish our goal we (and you) will have infinite money. So the expected value of any investment in our technology is infinite dollars. No, you don't need to ask what the odds are of us accomplishing our goal, because any percent times infinity is infinity.
Since OpenAI and all the founders riding on their coat tails are selling AGI, you see a natural backlash against LLMs that points out that they are not AGI and show no signs of asymptotically approaching AGI—they're asymptotically approaching something that will be amazing and transformative in ways that are not immediately clear, but what is clear to those who are watching closely is that they're not approaching Altman's promises.
The AI bubble will burst, and it's going to be painful. I agree with the author that that is inevitable, and it's shocking how few people see it. But also, we're getting a lot of cool tech out of it and plenty of it is being released into the open and heavily commoditized, so that's great!
I think that people who don't believe LLMs to be AGI are not very good at Venn diagrams. Because they certainly are artificial, general, and intelligent according to any dictionary.
Good grief. You are deeply confused and/or deeply literal. That's not the accepted definition of AGI in any sense. One does not evaluate each word has an isolated component for testing the truth of a statement in an open compound word. Does your "living room" have organs?
For your edification:
https://en.wikipedia.org/wiki/Artificial_general_intelligenc...
It is that or you can't recognize a tongue-in-cheek comment on goalpost shifting. Wiki page you linked has the original definition of the term from 1997, dig it up. Better yet, look at the history of that page in Wayback machine and see with your own eyes how ChatGPT release changed it.
For reference, 1997 original: By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed.
2014 wiki requirements: reason, use strategy, solve puzzles, and make judgments under uncertainty; represent knowledge, including commonsense knowledge; plan; learn; communicate in natural language; and integrate all these skills towards common goals.
By this argument the DEA is the FBI because they are also a federal bureau that does investigations.
I'm dead right now.
You're funny, but "the DEA" is "a FBI", not "the FBI", as you said yourself.
That’s not how language works.
That's precisely how language works in most cases, and there does not seem to be any reason to think it is different in this one.
Although if you look at the trajectory of goalposts since https://web.archive.org/web/20140327014303/https://en.wikipe... , I could concede the point, that people discarded the obvious definition in the light of recent events.
No, it's really not. Joining words into a compound word enables the new compound to take on new meaning and evolve on its own, and if it becomes widely used as a compound it always does so. The term you're looking for if you care to google it is an "open compound noun".
A dog in the sun may be hot, but that doesn't make it a hot dog.
You can use a towel to dry your hair, but that doesn't make the towel a hair dryer.
Putting coffee on a dining room table doesn't turn it into a coffee table.
Spreading Elmer's glue on your teeth doesn't make it tooth paste.
The White House is, in fact, a white house, but my neighbor's white house is not The White House.
I could go on, but I think the above is a sufficient selection to show that language does not, in fact, work that way. You can't decompose a compound noun into its component morphemes and expect to be able to derive the compound's meaning from them.
You wrote so much while failing to read so little:
> in most cases
What do you think will happen if we will start comparing the lengths of the list ["hot dog", ...] and the list ["blue bird", "aeroplane", "sunny March day", ...]?
No, I read that, and it's wrong. Can you point me to a single compound noun that works that way?
A bluebird is a specific species. A blue parrot is not a bluebird.
An aeroplane is a vehicle that flies through the air at high speeds, but if you broke it down into morphemes and tried to reason it out that way you could easily argue that a two-dimensional flat surface that extends infinitely in all directions and intersects the air should count.
Sunny March day isn't a compound noun, it's a noun phrase.
Can you point me to a single compound noun (that is, a two-or-more-part word that is widely used enough to earn a definition in a dictionary, like AGI) that can be subjected to the kind of breaking apart into morphemes that you're doing without yielding obviously nonsensical re-interpretations?
Once a group of words becomes a compound noun, you can’t necessarily look at the individual component words to derive the definition.
A paste made of a ground up tooth is clearly tooth paste because it is both a tooth and paste.
The problem is that a lot of established words have a meaning different from the mere sum of their parts.
For example, “homophobia” literally means “same-fear” - are homophobes afraid of sameness? Do they have an unusual need for variety and novelty?
Then Explain:
Firetruck — It's not a truck that is on fire nor is it a truck that delivers fire.
Butterfly — Not a fly made of butter.
Starfish — Not a fish, not a star.
Pineapple — Neither a pine nor an apple.
Guinea pig — A rodent, not a pig.
Koala bear — A marsupial, not a bear.
Silverfish — An insect, not a fish.
Oh, the irony of using the verb "believe" in the same sentence with Venn diagrams... :)
I feel like LLMs are the same as the leap from "world before web search" to "world after web search." Yeah, in google, you get crap links for sure, and you have to wade through salesy links and random blogs. But in the pre-web-search world, your options were generally "ask a friend who seems smart" or "go to the library for quite a while," AND BOTH OF THOSE OPTIONS HAD PLENTY OF ISSUES. I found a random part in an old arduino kit I bought years ago, and GPT-4o correctly identified it and explained exactly how to hook it up and code for it to me. That is frickin awesome, and it saves me a ton of time and leads me to reuse the part. I used DeepResearch to research car options that fit my exact needs, and it was 100% spot on - multiple people have suggested models that DeepReearch did not identify that would be a fit, but every time I dig in, I find that DeepResearch was right and the alternative actually had some dealbreaker I had specified. Etc., etc.
In the 90s, Robert Metcalfe infamously wrote "Almost all of the many predictions now being made about 1996 hinge on the Internet’s continuing exponential growth. But I predict the Internet, which only just recently got this section here in InfoWorld, will soon go spectacularly supernova and in 1996 catastrophically collapse." I feel like we are just hearing LLM versions of this quote over and over now, but they will prove to be equally accurate.
> versions of this quote
Generic. For the Internet, more complex questions would have been "What are the potential benefits, what the potential risks, what will grow faster" etc. The problem is not the growth but what that growth means. For LLMs, the big clear question is "will they stop just being LLMs, and when will they". Progress is seen, but we seek a revolution.
It would be fine if it were sold that way, but there is so much hype. We're told that it's going to replace all of us and put us all out our jobs. They set the expectations so high. Like remember OpenAI showing a video of it doing your taxes for you? Predictions that super-intelligent AI is going to radically transform society faster than we can keep up? I think that's where most of the backlash is coming from.
> We're told that it's going to replace all of us and put us all out our jobs.
I think this is the source of a lot of the hype. There are people salivating at the thought of no longer needing to employ the peasant class. They want it so badly that they'll say anything to get more investment in LLMs even if it might only ever allow them to fire a fraction of their workers, and even if their products and services suffer because the output they get with "AI" is worse than what the humans they throw away were providing.
They know they're overselling it, but they're also still on their knees praying that by some miracle their LLMs trained on the collective wisdom of facebook and youtube comments will one day gain actual intelligence and they can stop paying human workers.
In the meantime, they'll shove "AI" into everything they can think of for testing and refinement. They'll make us beta test it for them. They don't really care if their AI makes your customer service experience go to shit. They don't care if their AI screws up your bill. They don't care if their AI rejects your claims or you get denied services you've been paying for and are entitled to. They don't care if their AI unfairly denies you parole or mistakenly makes you the suspect of a crime. They don't care if Dr. Sbaitso 2.0 misdiagnoses you. Your suffering is worth it to them as long as they can cut their headcount by any amount and can keep feeding the AI more and more information because just maybe with enough data one day their greatest dream will become reality, and even if that never happens a lot of people are currently making massive amounts of money selling that lie.
The problem is that the bubble will burst eventually. The more time goes by and AI doesn't live up to the hype the harder that hype becomes to sell. Especially when by shoving AI into everything they're exposing a lot of hugely embarrassing shortcomings. Repeating "AI will happen in just 10 more years" gives people a lot of time to make money and cash out though.
On the plus side, we do get some cool toys to play with and the dream of replacing humans has sparked more interest in robotics so it's not all bad.
Yeah, it won't do your taxes for you, but it can sure help you do them yourself. Probably won't put you out of your job either, but it might help you accomplish more. Of course, one result of people accomplishing more in less time is that you need fewer people to do the same amount of work - so some jobs could be lost. But it's also possible that for the most part instead, more will be accomplished overall.
People frame that like it's something we gain, efficiency, as if before we were wasting time by thinking for ourselves. I get that they can do certain things better, I'm not sure that delegating to them is free of charge. We're paying something, losing something. Probably learning and fulfillment. We become increasingly dependent on machines to do anything.
Something important happened when we turned the tables around, I don't feel it gets the credit it should. It used to be humans telling machines what to do. Now we're doing the opposite.
Yes. Use an LLM with any regularity to write code or anything else and the effect becomes perceptible. It discourages effortful thought.
Sometimes the means are just as important as the ends, if not more
If it had access to my books, a current-gen frontier LLM most certainly could do my taxes. I don't understand that entire line of reasoning.
And it might even be right and not get you in legal trouble! Not that you'd know (until audit day) unless you went back and did them as a verification though.
And if I did my own taxes, I wouldn't know if I were in legal trouble until audit day as well!
If you hired a competent professional accountant, in all but the most contrived scenarios you wouldn't have to worry about legal trouble at all.
Except now, you can hire a competent professional accountant and discover on audit day that they got taken over by private equity, replaced 90% of the professionals doing work with AI and made a lot of money before the consequences become apparent.
This is such a fun timeline
Yes, but you're going to pay through the nose for the "wouldn't have to worry about legal trouble at all" (part of what you're paying for with professional services is a degree of protection from their fuckups).
So going back to apples-and-apples comparison, i.e. assuming that "spend a lot of money to get it done for you" is not on the table, I'd trust current SOTA LLM to do a typical person's taxes better than they themselves would.
I pay my accountant 500 USD to file my taxes. I don't consider that "through the nose" relative to my my inflated tech salary.
If a person is making a smaller income their tax situation is probably very simple, and can be handled by automated tools like TurboTax (as the sibling comment suggests).
I don't see a lot of value add from LLMs in this particular context. It's a situation where small mistakes can result in legal trouble or thousands of dollars of losses.
TurboTax handles it fine with what are effectively a bunch of defined if-else statements.
I'm on a financial forum where people often ask tax questions, generally _fairly_ simple questions. An obnoxious recent trend on many forums, including this one, is idiots feeding questions into a magic robot and posting what it says as a response. Now, ChatGPT may be very good at, er, something, I dunno, I am assured that it has _some_ use by the evangelists, but it is not good at tax, and if people follow many of the answers it gives then they are likely to get in trouble.
If a trillion-parameter model can't handle your taxes, that to me says more about the tax code than the AI code.
People who paste undisclosed AI slop in forums deserve their own place in hell, no argument there. But what are some good examples of simple tax questions where current models are dangerously wrong? If it's not a private forum, can you post any links to those questions?
So, a super-basic one I saw recently, in relation to Irish tax. In Ireland, ETFs are taxed differently to normal stocks (most ETFs available here are accumulating, they internally re-invest dividends; this is uncommon for US ETFs for tax reasons). Normal stocks have gains taxed under the capital gains regime (33% on gains when you sell). ETFs are different; they're taxed 40% on gains when you sell, and they are subject to 'deemed disposal'; every 8 years, you are taxed _as if you had sold and re-bought_. The ostensible reason for this is to offset the benefit from untaxed compounding of dividends.
Anyway, the magic robot 'knew' all that. Where it slipped up was in actually _working_ with it. Someone asked for a comparison of taxation on a 20 year investment in individual stocks vs ETFs, assuming re-investment of dividends and the same overall growth rate. The machine happily generated a comparison showing individual stocks doing massively better... On closer inspection, it was comparing growth for 20 years for the individual stocks to growth of 8 years for the ETFs. (It also got the marginal income tax rate wrong.)
But the nonsense it spat out _looked_ authoritative on first glance, and it was a couple of replies before it was pointed out that it was completely wrong. The problem isn't that the machine doesn't know the rules; insofar as it 'knows' anything, it knows the rules. But it certainly can't reliably apply them.
(I'd post a link, but they deleted it after it was pointed out that it was nonsense.)
Interesting, thanks. That doesn't seem like an entirely simple question, but it does demonstrate that the model is still not great at recognizing when it is out of its league and should either hedge its answer, refuse altogether, or delegate to an appropriate external tool.
This failure seems similar to a case that someone brought up earlier ( https://news.ycombinator.com/item?id=43466531 ). While better than expected at computation, the transformer model ultimately overestimates its own ability, running afoul of Dunning-Kruger much like humans tend to.
Replying here due to rate-limiting:
One interesting thing is that when one model fails spectacularly like that, its competitors often do not. If you were to cut/paste the same prompt and feed it to o1-pro, Claude 3.7, and Gemini 2.5, it's possible that they would all get it wrong (after all, I doubt they saw a lot of Irish tax law during training.) But if they do, they will very likely make different errors.
Unfortunately it doesn't sound like that experiment can be run now, but I've run similar tests often enough to tell me that wrong answers or faulty reasoning are more likely model-specific shortcomings rather than technology-specific shortcomings.
That's why I get triggered when people speak authoritatively on here about what AI models "can't do" or "will never be able to do." These people have almost always, almost without exception, been proven dead wrong in the past, but that never seems to bother them.
It's the sort of mistake that it's hard to imagine a human making, is the thing. Many humans might have trouble compounding at all, but the 20 year/8 year confusion just wouldn't happen. And I think it is on the simple side of tax questions (in particular all the _rules_ involved are simple, well-defined, and involve no ambiguity or opinion; you certainly can't say that of all tax rules). Tax gets _complicated_.
This reminds me of the early days of Google, when people who knew how to phrase a query got dramatically better results than those who basically just entered what they were looking for as if asking a human. And indeed, phrasing your prompts is important here too, but I mean more that by having a bit of an understanding of how it works and how it differs from a human, you can avoid getting sucked in by most of these gaps in its abilities, while benefitting from what it's good at. I would ask it the question about the capital gains rules (and would verify the response probably with a link I'd ask it to provide), but I definitely wouldn't expect it to correctly provide a comparison like that. (I might still ask, but would expect to have to check its work.)
Forget OpenAI ChatGPT doing your taxes for you. Now Gemini will write up your sales slides about Gouda cheese, stating wrongly in the process that gouda makes up about 50% of all cheese consumption worldwide :) These use-cases are getting more useful by the day ;)
I mean, it's been like 3 years. 3 years after the web came out was barely anything. 3 years after the first GPU was cool, but not that cool. The past three years in LLMs? Insane.
Things could stall out and we'll have bumps and delays ... I hope. If this thing progresses at the same pace, or speeds up, well ... reality will change.
Or not. Even as they are, we can build some cool stuff with them.
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly?
The trouble is that, while incredibly amazing, mind blowing technology, it falls down flat often enough that it is a big gamble to use. It is never clear, at least to me, what it is good at and what it isn't good at. Many things I assume it will struggle with, it jumps in with ease, and vice versa.
As the failures mount, I admittedly do find it becoming harder and harder to compel myself to see if it will work for my next task. It very well might succeed, but by the time I go to all the trouble to find out it often feels that I may as well just do it the old fashioned way.
If I'm not alone, that could be a big challenge in seeing long-term commercial success. Especially given that commercial success for LLMs is currently defined as 'take over the world' and not 'sustain mom and pop'.
> the speed at which it is progressing is insane.
But same goes for the users! As a result the failure rate appears to be closer to a constant. Until we reach the end of human achievement, where the humans can no longer think of new ways to use LLMs, that is unlikely to change.
It's becoming clear to me that some people just have vastly different uses and use cases than I do. Summarizing a deep, cutting edge physics paper is I'm sure vasty different than summarizing a web page while I'm browsing HN, or writing a Python plugin for Icinga to monitor a web endpoint that spits out JSON.
The author says they use several LLMs every day and they always produce incorrect results. That "feels" weird, because it seems like you'd develop an intuition fairly quickly for the kinds of questions you'd ask that LLMs can and can't answer. If I want something with links to back up what is being said, I know I should ask Perplexity or maybe just ask a long-form prompt-like question of Google or Kagi. If I want a Python or bash program I'm probably going to ask ChatGPT or Gemini. If I want to work on some code I want to be in Cursor and am probably using Claude. For general life questions, I've been asking Claude and ChatGPT.
Running into the same issue with LLMs over and over for years, with all due respect, seems like the "doing the same thing and expecting different results" situation.
This is so true. I really hope she joins this conversation so we can have a productive discussion and understand what she's actually hoping to achieve.
The two sides are never going to understand each other because I suspect we work on entirely different things and have radically different workflows. I suspect that hackernews gets more use out of LLMs in general than the average programmer because they are far more likely to be at a web startup and more likely to actually be bottlenecked on how fast you can physically put more code in the file and ship sooner.
If you work on stuff that is at all niche (as in, stack overflow was probably not going to have the answer you needed even before LLMs became popular), then it's not surprising when LLMs can't help because they've not been trained.
For people that were already going fast and needed or wanted to put out more code more quickly, I'm sure LLMs will speed them up even more.
For those of us working on niche stuff, we weren't going fast in the first place or being judged on how quickly we ship in all likelihood. So LLMs (even if they were trained on our stuff) aren't going to be able to speed us up because the bottleneck has never been about not being able to write enough code fast enough. There are architectural and environmental and testing related bottlenecks that LLMs don't get rid of.
That's a good point, I've personally not got much use out of LLMs (I use them to generate fantasy names for my D&D campaign, but find they fall down for anything complex) - but I've also never got much use out of StackOverflow either.
I don't think I'm working on anything particularly niche, but nor is it cookie-cutter generic either, and that could be enough to drastically reduce their utility.
"It talks to me like a person"
Because it has a sample size of our collective human knowledge and language big enough to trick our brains into believing that.
As a parallel thought, it reminds of a trick derren brown did. He picked every horse correctly across 6 races. The person who he was picking for was obviously stunned, as were the audience watching it.
The reality of course is just that people couldn't comprehend that he just had to go to extreme and tedious lengths to make this happen. They started with 7000 people and filmed every one like it was going to be the "one" and then the probability pyramid just dropped people out. It was such a vast undertaking of time and effort that we're biased towards believing there must be something really happening here.
LLMs currently are a natural language interface to a Microsoft Encarta like system that is so unbelievably detailed and all encompassing that we risk accepting that there's something more going on there. There isn't.
> Because it has a sample size of our collective human knowledge and language big enough to trick our brains into believing that.
Yes, it's artificial intelligence. It's not the real thing, it's artificial.
Again, it's not intelligence. It's a mirror that condenses our own intelligence and reflects back to us using probabilities at a scale that tricks us into the notion there is something more than just a big index and clever search interface.
There is no meaningful interpretation of the word intelligence that applies, psychologically or philosophically, to what is going on. Machine Learning is far more apt and far less misleading.
I saw the transition from ML to AI happen in academic papers and then pitch decks in real time. It was to refill the well when investors were losing faith that ML could deliver on the promises. It was not progress driven.
> our own intelligence
this doesn't make any more sense than calling LLMs "intelligence". There is no "our intelligence" beyond a concept or an idea that you or someone else may have about the collective, which is an abstraction.
What we do each have our own intelligence, and that intelligence is and likely always be, no matter how science progresses, ineffable. So my point is you can't say your made up/ill defined concept is any realer than any other made up/ill defined concept.
It really depends on the task. Like Sabine, I’m operating on the very frontier of a scientific domain that is extremely niche. Every single LLM out there is worse than useless in this domain. It spits out incomprehensible garbage.
But ask it to solve some leet code and it’s brilliant.
The question I ask afterwords then is: is solving some leet code brilliant? Is designing a simple inventory system brilliant if they've all been accomplished already? My answer tends towards no, since they still make mistakes in the process, and it harms newer developers from learning.
At non-extremely niche tasks they fail as well.
I should start collecting examples, if only for threads like this. Recently I tried to llm a tsserver plugin that treats lines ending with "//del" as empty. You can only imagine all the sneaky failures in the chat and the total uselessness of these results.
Anything that is not literally millions (billions?) of times in the training set is doomed to be fantasized about by an LLM. In various ways, tones, etc. After many such threads I came to conclusion that people who find it mostly useful are simply treading water as they probably have done most of their career. Their average product is a react form with a crud endpoint and excitement about it. I can't explain their success reports otherwise, cause it rarely works on anything beyond that.
LLMs are basically a search engine for Stack Overflow and Github that doesn't suck as bad as Google does.
If your job is copy-pasting from Stack Overflow then LLMs are an upgrade.
Welcome to the new digital divide people, and the start of a new level of "inequality" in this world. This thread is proof that we've diverged and there is a huge subset of people that will not have their minds changed easily.
Surely you understand why an LLM that has no knowledge of your niche wouldn't be useful right?
Hallucinating incorrect information is worse than useless. It is actively harmful.
I wonder how much this affects our fundraising, for example. No VC understands the science here, so they turn to advisors (which is great!) or to LLMs… which has us starting off on the wrong foot.
Good thing humans never make mistakes.
Good scientists and engineers know how to say “I don’t know.”
I work in a field that is not even close to a scientific nishe - software reverse engineering - and LLM will happily lie to me all the time, for every question I have. I find out useful to generate some initial boilerplate but... that's it. AI autocompletion saved me an order of magnitude more time, and nobody is hyped about it.
How many actual humans are useful in your niche scientific domain?
And how many actual humans, with a fair bit of training, can become a little bit less than useless?
I mean, my parents used to have this dog that would just look at you like "go get you own damn ball, stupid human" if you threw a ball around him.
--edit--
and, yes, the dog also made grammatical mistakes.
Sabine is lex Friedman for women. Stay in your lane about quantum physics and stop trying to opine on LLMs. I’m tired of seeing the huge amount of FUD from her.
What she is saying is correct about the utility of LLMs in scientific research though.
When your user says that your product doesn’t work for them, saying they’re using it wrong is not an excuse.
I think lex friedman is far, far worse
Two things can be true: e.g., that LLMs are incredible tech we only dreamed of having, and that they’re so flawed that they’re hard to put to productive use.
I just tried to use the latest Gemini release to help me figure out how to do some very basic Google Cloud setup. I thought my own ignorance in this area was to blame for the 30 minutes I spent trying to follow its instructions - only to discover that Gemini had wildly hallucinated key parts of the plan. And that’s Google’s own flagship model!
I think it’s pretty telling that companies are still struggling to find product-market fit in most fields outside of code completion.
> Wah, it can't write code like a Senior engineer with 20 years of experience!
No, that's not my problem with it. My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.
Sure, sometimes it produces useful code. And often, it'll simply call the "doTheHardPart()" method. I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over.
Outside of programming, this is much worse. I've both seen online and heard people quote LLM output as if it were authoritative. That to me is the bigger danger of LLMs to society. People just don't understand that LLMs aren't high powered attorneys, or world renown doctors. And, unfortunately, the incorrect perception of LLMs is being hyped both by LLM companies and by "journalists" who are all to ready to simply run with and discuss the press releases from said LLM companies.
> What's worse, people are treating them as authoritative. … I've both seen online and heard people quote LLM output as if it were authoritative.
Thats not an LLM problem. But indeed quite bothersome. Dont tell me what Chatgpt told you. Tell me what you know. Maybe you got it from ChatGPT and verified it. Great. But my jaw kind of drops when people cite an LLM and just assume it’s correct.
It might not be an LLM problem, but it’s an AI-as-product problem. I feel like every major player’s gamble is that they can cement distinct branding and model capabilities (as perceived by the public) faster than the gradual calcification of public AI perception catches up with model improvements - every time a consumer gets burned by AI output in even small ways, the “AI version of Siri/Alexa only being used for music and timers” problem looms a tiny, tiny bit larger.
Why is that a problem tho?
Branding for current products have this property today - for example, apple products are seen as being used by creatives and such.
> when people cite an LLM and just assume it’s correct.
people used to say the exact same thing with wikipedia back when it first started.
These are not similar. Wikipedia says the same thing to everybody, and when what it says is wrong, anybody can correct it, and they do. Consequently it's always been fairly reliable.
Lies and mistakes persist on Wikipedia for many years. They just need to sound truthy so they don't jump out to Wikipedia power users who aren't familiar with the subject. I've been keeping tabs on one for about five years, and its several years older than that, which I won't correct because I am IP range banned and I don't feel like making an account and dealing with any basement dwelling power editor NEETs who read Wikipedia rules and processes for fun. I know I'm not the only one to, because this glaring error isn't in a particularly obscure niche, its in the article for a certain notorious defense initiative which has been in the news lately, so this error has plenty of eyes on it.
In fact, the error might even be a good thing; it reminds attentive readers that Wikipedia is an unreliable source and you always have to check if citations actually say the thing which is being said in the sentence they're attached to.
Maybe you're just wrong about it.
Citation Needed - you can track down WHY it's reliable too if the stakes are high enough or the data seems iffy.
That's true too, but the bigger difference from my point of view is that factual errors in Wikipedia are relatively uncommon, while, in the LLM output I've been able to generate, factual errors vastly outnumber correct facts. LLMs are fantastic at creativity and language translation but terrible at saying true things instead of false things.
> Consequently it's always been fairly reliable.
Comments like these honestly make me much more concerned than LLM hallucinations. There have been numerous times when I've tracked down the source for a claim, only to find that the source was saying something different, or that the source was completely unreliable (sometimes on the crackpot level).
Currently, there's a much greater understanding that LLM's are unreliable. Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.
The big problem is that people in general are terrible at exercising critical thinking when they're presented with information. It's probably less of an issue with LLMs at the moment, since they're new technology and a certain amount of skepticism gets applied to their output. But the issue is that once people have gotten more used to them, they'll turn off they're critical thinking in the same manner that they turn it off when absorbing information from other sources that they're used to.
Wikipedia is fairly reliable if our standard isn't a platonic ideal of truth but real-world comparators. Reminds me of Kant's famous line. "From the crooked timber of humankind, nothing entirely straight can be made".
See the Wikipedia page on the subject :)
https://en.m.wikipedia.org/wiki/Reliability_of_Wikipedia
The sell of Wikipedia was never "we'll think so you don't have to", it was never going to disarm you of your skepticism and critical thought, and you can actually check the sources. LLMs are sold as "replace knowledge work(ers)", you cannot check their sources, and the only way you can check their work is by going to something like Wikipedia. They're just fundamentally different things.
> The sell of Wikipedia was never "we'll think so you don't have to", it was never going to disarm you of your skepticism and critical thought, and you can actually check the sources.
You can check them, but Wikipedia doesn't care what they say. When I checked a citation on the French Toast page, and noted that the source said the opposite of what Wikipedia did by annotating that citation with [failed verification], an editor showed up to remove that annotation and scold me that the only thing that mattered was whether the source existed, not what it might or might not say.
I feel like I hear a lot of criticism about Wikipedia editors, but isn't Wikipedia overall pretty good? I'm not gonna defend every editor action or whatever, but I think the product stands for itself.
Wikipedia is overall pretty good, but it sometimes contains erroneous information. LLMs are overall pretty good, but they sometimes contain erroneous information.
The weird part is when people get really concerned that someone might treat the former as a reliable source, but then turn around and argue that people should treat the latter as a reliable source.
I had a moment of pique where I was just gonna copy paste my reply to this rehash of your original point that is non-responsive to what I wrote, but I've found myself. Instead, I will link to the Wikipedia article for Equivocation [0] and ChatGPT's answer to "are wikipedia and LLMs alike?"
[0]: https://en.wikipedia.org/wiki/Equivocation
[1]: https://chatgpt.com/share/67e6adf3-3598-8003-8ccd-68564b7194...
Wikipedia occasionally has errors, which are usually minor. The LLMs I've tried occasionally get things right, but mostly emit limitless streams of plausible-sounding lies. Your comment paints them as much more similar than they are.
In my experience, it's really common for wikipedia to have errors, but it's true that they tend to be minor. And yes, LLMs mostly just produce crazy gibberish. They're clearly worse than wikipedia. But I don't think wikipedia is meeting a standard it should be proud of.
Yes, I agree. What kind of institution do you think could do better?
It's scored better than other encyclopedias it has been compared against, which is something.
Wikipedia is one of the better sources out there for topics that are not seen as political.
For politically loaded topics, though, Wikipedia has become increasingly biased towards one side over the past 10-15 years.
source: the other side (conveniently works in any direction)
> Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.
One of these things is not like the others! Almost always, when I see somebody claiming Wikipedia is wrong about something, it's because they're some kind of crackpot. I find errors in Wikipedia several times a year; probably the majority of my contribution history to Wikipedia https://en.wikipedia.org/wiki/Special:Contributions/Kragen consists of me correcting errors in it. Occasionally my correction is incorrect, so someone corrects my correction. This happens several times a decade.
By contrast, I find many YouTube videos and studies from advocacy groups to be full of errors, and there is no mechanism for even the authors themselves to correct them, much less for someone else to do so. (I don't know enough about posts on AskHistorians to comment intelligently, but I assume that if there's a major factual error, the top-voted comments will tell you so—unlike YouTube or advocacy-group studies—but minor errors will generally remain uncorrected; and that generally only a single person's expertise is applied to getting the post right.)
But none of these are in the same league as LLM output, which in my experience usually contains more falsehoods than facts.
> Currently, there's a much greater understanding that LLM's are unreliable.
Wikipedia being world-editable and thus unreliable has been beaten into everyone's minds for decades.
LLMs just popped into existence a few years ago, backed by much hype and marketing about "intelligence". No, normal people you find on the street do not in fact understand that they are unreliable. Watch some less computer literate people interact with ChatGPT - it's terrifying. They trust every word!
Look at the comments here. No one is claiming that LLMs are reliable, while numerous people are claiming that Wikipedia is reliable.
Isn't that the issue with basically any medium?
If you read a non-fiction book on any topic, you can probably assume that half of the information in it is just extrapolated from the authors experience.
Even scientific articles are full of inaccurate statements, the only thing you can somewhat trust are the narrow questions answered by the data, which is usually a small effect that may or may not be reproducible...
No, different media are different—or, better said, different institutions are different, and different media can support different institutions.
Nonfiction books and scientific papers generally only have one person, or at best a dozen or so (with rare exceptions like CERN papers), giving attention to their correctness. Email messages and YouTube videos generally only have one. This limits the expertise that can be brought to bear on them. Books can be corrected in later printings, an advantage not enjoyed by the other three. Email messages and YouTube videos are usually displayed together with replies, but usually comments pointing out errors in YouTube videos get drowned in worthless me-too noise.
But popular Wikipedia articles are routinely corrected by hundreds or thousands of people, all of whom must come to a rough consensus on what is true before the paragraph stabilizes.
Consequently, although you can easily find errors in Wikipedia, they are much less common in these other media.
Yes, though by different degrees. I wouldn't take any claim I read on Wikipedia, got from an LLM, saw in a AskHistorians or Hacker News reply, etc., as fact, and I would never use any of those as a source to back up or prove something I was saying.
Newspaper articles? It really depends. I wouldn't take paraphrased quotes or "sources say" as fact.
But as you move to generally more reliable sources, you also have to be aware that they can mislead in different ways, such as constructing the information in a particular way to push a particular narrative, or leaving out inconvenient facts.
And that is still accurate today. Information always contains a bias from the narrators perspective. Having multiple sources allows one to triangulate the accuracy of information. Making people use one source of information would allow the business to control the entire narrative. Its just more of a business around people and sentiments than being bullish on science.
Correct; Wikipedia is still not authoritative.
Wikipedia will cite and often broadly source. Wikipedia has an auditable decision trail for content conflicts.
It behaves more like an accountable mediator of authority.
Perhaps LLMs offering those (among other) features would be reasonably matched in a authorativity comparison.
Authority, yes, accountable, not so much.
Basically at the level of other publishers, meaning they can be as biased as MSNBC or Fox News, depending on who controls them.
And they were right, right? They recognized it had structural faults that made it possible for bad data to sip in. The same is valid for LLMs: they have structural faults.
So what is your point? You seem to have placed assumptions there. And broad ones, so that differences between the two things, and complexities, the important details, do not appear.
> Thats not an LLM problem
It is, if the purpose of LLMs was to be AI. "Large language model" as a choir of pseudorandom millions converged into a voice - that was achieved, but it is by definition out of the professional realm. If it is to be taken as "artificial intelligence", then it has to have competitive intelligence.
> But my jaw kind of drops when people cite an LLM and just assume it’s correct.
Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor, so is it totally their fault?
They've heard about the uncountable sums of money spent on creating such software, why would they assume it was anything short of advertised?
> Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor
Why does this imply that they’re always correct? I’m always genuinely confused when people pretend like hallucinations are some secret that AI companies are hiding. Literally every chat interface says something like “LLMs are not always accurate”.
> Literally every chat interface says something like “LLMs are not always accurate”.
In small, de-emphasized text, relegated to the far corner of the screen. Yet, none of the TV advertisements I've seen have spent any significant fraction of the ad warning about these dangers. Every ad I've seen presents someone asking a question to the LLM, getting an answer and immediately trusting it.
So, yes, they all have some light-grey 12px disclaimer somewhere. Surprisingly, that disclaimer does not carry nearly the same weight as the rest of the industry's combined marketing efforts.
> In small, de-emphasized text, relegated to the far corner of the screen.
I just opened ChatGPT.com and typed in the question “When was Mr T born?”.
When I got the answer there were these things on screen:
- A menu trigger in the top-left.
- Log in / Sign up in the top right
- The discussion, in the centre.
- A T&Cs disclaimer at the bottom.
- An input box at the bottom.
- “ChatGPT can make mistakes. Check important info.” directly underneath the input box.
I dislike the fact that it’s low contrast, but it’s not in a far corner, it’s immediately below the primary input. There’s a grand total of six things on screen, two of which are tucked away in a corner.
This is a very minimal UI, and they put the warning message right where people interact with it. It’s not lost in a corner of a busy interface somewhere.
Maybe it's just down to different screen sizes, but when I open a new chat in chat GPT, the prompt is in the center of the screen, and the disclaimer is quite a distance away at the very bottom of the screen.
Though, my real point is we need to weigh that disclaimer, against the combined messaging and marketing efforts of the AI industry. No TV ad gives me that disclaimer.
Here's an Apple Intelligence ad: https://www.youtube.com/watch?v=A0BXZhdDqZM. No disclaimer.
Here's a Meta AI ad: https://www.youtube.com/watch?v=2clcDZ-oapU. No disclaimer.
Then we can look at people's behavior. Look at the (surprisingly numerous) cases of lawyers getting taken to the woodshed by a judge for submitting filings to a court with chat GPT introduced fake citations! Or, someone like Ana Navarro confidentially repeating an incorrect fact, and when people pushed back saying "take it up with chat GPT" (https://x.com/ananavarro/status/1864049783637217423).
I just don't think the average person who isn't following this closely understands the disclaimer. Hell, they probably don't even really read it, because most people skip over reading most de-emphasized text in most-UIs.
So, in my opinion, whether it's right next to the text-box or not, the disclaimer simply cannot carry the same amount of cultural impact as the "other side of the ledger" that are making wild, unfounded claims to the public.
I remember when Google results called out the ads as distinct from the search results.
That was necessary to build trust until they had enough power to convert that trust into money and power.
Speculating that they may at some point in the future remove that message does not mean that it is not there now. This was the point being made:
> Literally every chat interface says something like “LLMs are not always accurate”.
> Surprisingly, that disclaimer does not carry nearly the same weight as the rest of the industry's combined marketing efforts.
Thank you.
I’ve come to believe a more depressing take: they _want_ to believe it, and therefore do.
No disclaimer is gonna change that.
Is it worse or better than quoting a Twitter comment and taking that as it were authoritative?
It remains proportional to the earned prestige of the speaker.
> inbuilt into the models of all LLMs is that they'll fabricate a lot.
Still the elephant in the room. We need an AI technology that can output "don't know" when appropriate. How's that coming along?
Unfortunately they are trained first and foremost as plausibility engines. The central dogma is that plausibility will (with continuing progress & scale) converge towards correctness, or "faithfulness" as it's sometimes called in the literature.
This remains very far from proven.
The null hypothesis that would be necessary to reject, therefore, is a most unfortunate one, viz. that by training for plausibility we are creating the world's most convincing bullshit machines.
> plausibility [would] converge towards correctness
That is the most horribly dangerous idea, as we demand that the agent guesses not, even - and especially - when the agent is a champion at guessing - we demand that the agent checks.
If G guesses from the multiplication table with remarkable success, we more strongly demand that G computes its output accurately instead.
Oracles that, out of extraordinary average accuracy, people may forget are not computers, are dangerous.
One man's "plausibility" is another person's "barely reasoned bullshit". I think you're being generous, because LLMs explicitly don't deal in facts, they deal in making stuff up that is vaguely reminiscent of fact. Only a few companies are even trying to make reasoning (as in axioms-cum-deductions, i.e., logic per se) a core part of the models, and they're really struggling to hand-engineer the topology and methodology necessary for that to work roughly as facsimile of technical reasoning.
I’m not really being generous. I merely think if I’m gonna condemn something as high-profile snake oil for the tragically gullible, it’s helpful to have a solid basis for doing so. And it’s also important to allow oneself to be wrong about something, however remote the possibility may currently seem, and preferably without having to revise one’s principles to recognise it.
As a sort of related anecdote... if you remember the days before google, people sitting around your dinner table arguing about stuff used to spew all sorts of bullshit then drop that they have a degree from XYZ university and they won the argument... when google/wikipedia came around turned out that those people were in fact just spewing bullshit. I'm sure there was some damage but it feels like a similar thing. Our "bullshit-radar" seems to be able to adapt to these sorts of things.
Well, with every conspiracy theories thriving on this day and age with access to technology and information at one fingertips. If you add that now the US administration effectively spewing bullshit every few minutes.
The best example of this was an arguement I had a little while ago where I was talking about self driving and I was mentioning that I have a hard time trusting any system relying only on cameras, to which I was being told that I didn't understand how machine learning works and obviously they were correct and I was wrong and every car would be self driving within 5 years. All of these things could easily be verified independently.
Suffice to say that I am not sure that the "bullshit-radar" is that adaptive...
Mind you, this is not limited to the particular issue at hand but I think those situations needs to be highlighted, because we get fooled easily by authoritative delivery...
Language models are closing the gaps that still remain at an amazing rate. There are still a few gaps, but if we consider what has happened just in the last year, and extrapolated 2-3 years out....
If the training data had a lot of humans saying "I don't know", then the LLM's would too.
Humans don't and LLM's are essentially trained to resemble most humans.
I have seen many people not saying "don't know" when appropriate. If you believe whomever without some double-checking you will have (bad) surprises.
To make another parallel: that's why we have automated testing in software (long before LLMs). Because you can't trust without checking.
And what's you opinion of people that never say "don't know"?
Unless you are in sales or marketing, getting caught lying is really detrimental to your career.
Or politics, where confidently lying is an essential skill.
Is it? Seems like a lost art these days. Modern politicians hardly put in any effort making up convincing and subtle lies anymore..
and it still works. And problem seems to be related to the unconditional trust of LLM output.
Too many don't knows, when they should they are stupid
Too little don't knows and end up being wrong, an idiot.
There is lying and there is being incompetent.
Then the optimal strategy is to never say "don't know" and never get caught lying.
Seems to work for many people. I suspect my career has been hampered by a higher-than-average willingness to say "I don't know"...
Some people trust Alex Jones, while the vast majority realize that he just fabricates untruths constantly. Far fewer people realize that LLMs do the same.
People know that computers are deterministic, but most don't realize that determinism and accuracy are orthogonal. Most non-IT people give computers authoritative deference they do not deserve. This has been a huge issue with things like Shot Spotter, facial recognition, etc.
I think you are discounting the fact that you can weed out people who make a habit of that, but you can't do that with LLMs if they are all doing that.
That's what I really want.
One thing I see a lot on X is people asking Grok what movie or show a scene is from.
LLMs must be really, really bad at this because not only is it never right, it actually just makes something up that doesn't exist. Every, single, time.
I really wish it would just say "I'm not good at this, so I do not know."
When your model of the world is build on the relative probabilities of the next opaque apparently-arbitrary number in context of prior opaque apparently-arbitrary numbers, it must be nearly impossible to tell the difference between “there are several plausible ways to proceed, many of which the user will find useful or informative, and I should pick one” and “I don’t know”. Attempting to adjust to allow for the latter probably tends to make the things output “I don’t know” all the time, even when the output they’d have otherwise produced would have been good.
I thought about this of course, and I think a reasonable 'hack' for now is to more or less hardcode things that your LLM sucks at, and override it to say it doesn't know. Because continually failing at basic tasks is bad for confidence in said product.
I mean, it basically does the same thing if you ask it to do anything racist or offensive, so that override ability is obviously there.
So if it identifies the request as identifying a movie scene, just say 'I don't know', for example.
Hardcode by whom? Who do we trust with this task to do it correctly? Another LLM that suffers from the same fundamental flaw or by a low paid digital worker in a developing country? Because that's the current solution. And who's gonna pay for all that once the dumb investment money runs out, who's gonna stick around after the hype?
By the LLM team (Grok team, in this case). I don't mean for the LLM to be sentient enough to know it doesn't know the answer, I mean for the LLM to identify what is being asked of it, and checking to see if that's something on the 'blacklist of actions I cannot do yet', said list maintained by humans, before replying.
No different than when asking ChatGPT to generate images or videos or whatever before it could, it would just tell you it was unable to.
> It's impossible to predict with certainty who will be the U.S. President in 2046. The political landscape can change significantly over time, and many factors, including elections, candidates, and events, will influence the outcome. The next U.S. presidential election will take place in 2028, so it would be difficult to know for sure who will hold office nearly two decades from now.
So it can say “I don’t know”
I can do this because it is in fact the most likely thing to continue with, word by word.
But the most likely thing to continue a paper with is not to say at the end „I don‘t know“. It is actually providing sources which it proceeds to do wrongly.
>> We need an AI technology that can output "don't know" when appropriate. How's that coming along?
Heh. Easiest answer in the world. To be able to say "don't know", one has first to be able to "know". And we ain't there yet, by large. Not even flying by a million miles of it.
Needs meta annotation of certainty on all nodes and tokkens that accumulates while reasoning . Also gives the ability to train in believes, as in overriding any uncertainty. Right now we are in the pure believes phase.AI is its own god right now, pure blissful believe without the sin of doubt.
Not sure. We haven't figured it out for humans yet.
Sure we have. We don't have a perfect solution but it's miles better than what we have for LLMs.
If a lawyer consistently makes stuff up on legal filings, in the worst cases they can lose their license (though they'll most likely end up getting fines).
If a doctor really sucks, they become uninsurable and ultimately could lose their medical license.
Devs that don't double check their work will cause havoc with the product and, not only will they earn low opinions from their colleges, they could face termination.
Again, not perfect, but also not unfigured out.
By the same measure, of an llm really sucks we stop using it, same solution
Many people haven’t gotten the message yet, it seems.
Sure. I don't use GPT-3.5-Turbo for similar reasons. I fired it.
I always see this, and I always answer the same.
This exists, each next token has a probability assigned to it. High probability means "it knows", if there's two or more tokens of similar probability, or the prob of the first token is low in general, then you are less confident about that datum.
Of course there's areas where there's more than one possible answer, but both possibilities are very consistent. I feel LLMs (chatgpt) do this fine.
Also can we stop pretending with the generic name for ChatGPT? It's like calling Viagra sildenafil instead of viagra, cut it out, there's the real deal and there's imitations.
> low in general, then you are less confident about that datum
It’s very rarely clear or explicit enough when that’s the case. Which makes sense considering that the LLMs themselves do not know the actual probabilities
Maybe this wasn't clear, but the Probabilities are a low level variable that may not be exposed in the UI, it IS exposed through API as logprobs in the ChatGPT api. And of course if you have binary access like with a LLama LLM you may have even deeper access to this p variable
> it IS exposed through API as logprobs in the ChatGPT api
Sure but they often are not necessarily easily interpretable or reliable.
You can use it to compare a model’s confidence of several different answers to the same question but anything else gets complicated and not necessarily that useful.
>can we stop pretending with the generic name for ChatGPT?
What? I use several LLM's, including ChatGPT, every day. It's not like they have it all cornered..
This is very subjective, but I feel they are all imitators of ChatGPT. I also contend that the ChatGPT API (and UI) will or has become a de facto standard in the same manner that intel's 80886 Instruction set evolved into x86
How many companies train on data that contains 'i don't know' responses. Have you ever talked with a toddler / young child? You need to explicitly teach children to not bull shit. At least I needed to teach mine.
Never mind toddlers, have you ever hired people? A far smaller proportion of professional adults will say “I don’t know” than a lot of people here seem to believe.
I never thought about this but I have experienced this with my children.
> train on data that contains 'i don't know' responses
The "dunno" must not be hardcoded in the data, it must be an output of judgement.
Judgement is what we call a system trained on good data.
No I call judgement a logical process of assessment.
You have an amount of material that speaks of the endeavours in some sport of some "Michael Jordan", the logic in the system decides that if a "Michael Jordan" in context can be construed to be "that" "Michael Jordan" then there will be sound probabilities he is a sportsman; you have very little material about a "John R. Brickabracker", the logic in the system decides that the material is insufficient to take a good guess.
AI is not a toddler. It's not human. It fails in ways that are not well understood and sometimes in an unpredictable manner.
Actually it fails exactly like I would expect something trained purely in knowledge and not in morals.
Then I expect your personal fortunes are tied up in hyping the "generative AI are just like people!" meme. Your comment is wholly detached from the reality of using LLMs. I do not expect we'll be able to meet eye-to-eye on the topic.
> How's that coming along?
It isn't. LLMs are autocomplete with a huge context. It doesn't know anything.
I’d love a confidence percentage accompanying every answer.
So we just need to solve the halting problem in NLP?
would you rather the LLM make up something that sounds right when it doesn't know, or would you like it to claim "i don't know" for tasks it actually can figure out? because presumably both happen at some rate, and if it hallucinates an answer i can at least check what that answer is or accept it with a grain of salt.
nobody freaks out when humans make mistakes, but we assume our nascent AIs, being machines, should always function correctly all the time
> would you rather the LLM make up something that sounds right when it doesn't know, or would you like it to claim "i don't know" for tasks it actually can figure out?
The latter option every single time
> but we assume our nascent AIs, being machines, should always function correctly all the time
A tool that does not function is a defective tool. When I issue a command, it better does it correctly or it will be replaced.
And that's part of the problem - you're thinking of it like a hammer when it's not a hammer. It's asking someone at a bar a question. You'll often get an answer - but even if they respond confidently that doesn't make it correct. The problem is people assuming things are fact because "someone at a bar told them." That's not much better than, "it must be true I saw it on TV".
It's a different type of tool - a person has to treat it that way.
Asking a question is very contextual. I don't ask a lawyer house engineering problems, nor my doctor how to bake cake. That means If I'm asking someone at a bar, I'm already prepare to deal with the fact that the person is maybe drunk, probably won't know,... And more often than not, I won't even ask the question unless dire needs. Because it's the most inefficient way to get an informed answer.
I wouldn't bat an eye if people were taking code suggestions, then review it and edit it to make it correct. But from what I see, it's pretty a direct push to production if they got it to compile, which is different from correct.
It would be nice to have some kind of "confidence level" annotation.
The superficial view: „they hallucinate“
The underlying cause: 3rd order ignorance:
3rd Order Ignorance (3OI)—Lack of Process. I have 3OI when I don't know a suitably efficient way to find out I don't know that I don't know something. This is lack of process, and it presents me with a major problem: If I have 3OI, I don't know of a way to find out there are things I don't know that I don't know.
—- not from an llm
My process: use llms and see what I can do with them while taking their Output with a grain of salt.
But the issue of the structural fault remains. To state the phenomenon (hallucination) is not "superficial", as the root does not add value in the context.
Symptom: "Response was, 'Use the `solvetheproblem` command'". // Cause: "It has no method to know that there is no `solvetheproblem` command". // Alarm: "It is suggested that it is trying to guess a plausible world through lacking wisdom and data". // Fault: "It should have a database of what seems to be states of facts, and it should have built the ability to predict the world more faithfully to facts".
But this was true before LLMs. People would and still do take any old thing from an internet search and treat it as true. There is a known, difficult-to-remedy failure to properly adjudicate information and source quality, and you can find it discussed in research prior to the computer age. It is a user problem more than a system problem. In my experience, with the way I interact with LLMs, they are more likely to give me useful output than not, and this is borne out by mainstream non-edge-case academic peer-reviewed work. Useful does not necessarily equal 100% correct, just as a Google search does not. I judge and vet all information, whether from an LLM, search, book, paper, or wherever We can build a straw person who "always" takes LLM output as true and uses it as-is but those are the same people who use most information tools poorly, be they internet search, dictionaries, or even looking in their own files for their own work or sent mail (I say this as an IT professional who has seen the worker types from before the pre-internet days through now). In any case, we use automobiles despite others misusing them. But only the foolish among us completely take our hands off the wheel for any supposed "self-driving" features. While we must prevent and decry the misuse by fools, we cannot let their ignorance hold us back. Let's let their ignorance help make tools, as they help identify more undesirable scenarios.
My company just broadly adopted AI. It’s not a tech company and usually late to the game when it comes to tech adoption.
I’m counting down the days when some AI hallucination makes its way all the way to the C-suite. People will get way too comfortable with AI and don’t understand just how wrong it can be.
Some assumption will come from AI, no one will check it and it’ll become a basic business input. Then suddenly one day someone smart will say “thats not true” and someone will trace it back to AI. I know it.
I assume at that point in time there will be some general directive on using AI and not assuming it’s correct. And then AI will slowly go out of favor.
People fabricated a lot too. Yesterday I spent far less time fixing issues in the far more complex and larger changes Claude Code managed to churn out than what the junior developer I worked with needed. Sometimes it's the reverse. But with my time factored in, working with Claude Code is generally more productive for me than working with a junior. The only reason I still work with a junior dev is as an investment into teaching him.
Claude is cheaper, faster, produces better code.
You are mixing a point and the issue, largely heterogeneous: Claude being a champion in producing good code vs LLMs in general being delirious.
If your junior developer is just "junior", that is one matter; if your junior developer hallucinates documentation details, that's different.
Every developer I've ever worked with have gotten things wrong. Whether you call that hallucinating or not is irrelevant. What matters is the effort it takes to fix.
On the logically practical point I agree with you (what counts in the end in the specific process you mention is the gain vs loss game), but my point was that if your assistant is structurally delirious you will have to expect a big chunk of the "loss" as structural.
--
Edit: new information may contribute to even this exchange, see https://www.anthropic.com/research/tracing-thoughts-language...
> It turns out that, in Claude, refusal to answer is the default behavior
I.e., boxes that incline to different approaches to heuristic will behave differently and offer different value (to be further assessed within a framework of complexity, e.g. "be creative but strict" etc.)
And my direct experience is that I often spend less time directing, reviewing and fixing code written by Claude Code at this point than I do for a junior irrespective of that loss. If anything, Claude Code "knows" my code bases better. The rest, then, to me at least is moot.
Claude is substantially cheaper for me, per reviewed, fixed change committed. More importantly to me, it demands less of my limited time per reviewed, fixed change committed.
Having a junior dev working with me at this point wouldn't be worth it to me if it wasn't for the training aspect: We still need pipelines of people who will learn to use the AI models, and who will learn to do the things it can't do well.
> irrespective of that loss
Just to be clear, since that expression may reveal a misunderstanding, I meant the sophisticated version of
But my point was: it's good that Claude has become a rightful legend in the realm of coding, but before and regardless, a candidate that told you "that class will have a .SolveAnyProblem() method: I want to believe" presents an handicap. As you said no assistant revealed to be perfect, but assistants who attempt mixing coding sessions and creative fiction writing raise alarms.Have you talked to a non-artificial intelligence lately? I’ve got some news for you…
This is the problem of the internet writ large.
The solution is to be selective and careful like always
> My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.
The same is true about the internet, and people even used to use these arguments to try to dissuade people from getting their information online (back when Wikipedia was considered a running joke, and journalists mocked blogs). But today it would be considered silly to dissuade someone from using the internet just because the information there is extremely unreliable.
Many programmers will say Stack Overflow is invaluable, but it's also unreliable. The answer is to use it as a tool and a jumping off point to help you solve your problem, not to assume that its authoritative.
The strange thing to me these days is the number of people who will talk about the problems with misinformation coming from LLMs, but then who seem to uncritically believe all sorts of other misinformation they encounter online, in the media, or through friends.
Yes, you need to verify the information you're getting, and this applies to far more than just LLMs.
Shades of grey fallacy. You have way more context clues about the information on the internet than you do with an LLM. In fact, with an LLM you have zero(-ish?).
I can peruse your previous posts to see how truthful you are, I can tell if your post has been down/upvoted, I can read responses to your post to see if you've been called out on anything, etc.
This applies tenfold in real life where over time you get to build comprehensive mental models of other people.
I have decided it must be attached to a sort of superiority complex. These types of people believe they are capable of deciphering fact from fiction but the general population isn’t so LLMs scare them because someone might hear something wrong and believe it. It almost seems delusional. You have to be incredibly self aggrandizing in your mind to think this way. If LLMs were actually causing “a problem” then there would be countless examples of humans making critical mistakes because of bad LLM responses, and that is decidedly not happening. Instead we’re just having fun ghiblifying the last 20 years of the internet.
> that is decidedly not happening
Regardless of anything else it’s extremely too early to make such claims. We have to wait until people start allowing “AI agents” to make autonomous blackbox decision with minimal supervision since nobody has any clue what’s happening.
Even if we tone down the SciFi dystopia angle not that many people really use LMMs in non superficial ways yet. What I’m most afraid of would be the next generation growing without the ability to critically synthesize information on their own.
Most people - the vast majority of people - cannot critically synthesize information on their own.
But the implication of what you are saying is that academic rigour is going to be ditched overnight because of LLMs.
That’s a little bit odd. Has the scientific community ever thrown up its collective hands and said “ok, there are easier ways to do things now, we can take the rest of the decade off, phew what a relief!”
> what you are saying is that academic rigour is going to be ditched overnight
Not across all level and certainly not overnight. But a lot of children entering the pipeline might end up having a very different experience than anyone else before LLMs (unless they are very lucky to be in an environment that provides them better opportunities).
> cannot critically synthesize information on their own.
That’s true, but if we even less people will try to so that or even know where to start that will get even worse.
No matter how things will evolve, that Ghiblification is something we will look back to in twenty years and say: "Remember how cool that was?"
People need time to adapt to llms capacity to spit out nonsense. It’ll take time but I’m sure they will.
> What's worse, people are treating them as authoritative
Because in people's experience, LLMs are often correct.
You are right LLMs are not authoritative, but people trust it exactly because they often do produce correct answers.
> I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm
Happened to me as well. Wanted it to quickly write an algorithm for standard deviation over a stream of data, which is a text-book algorithm. It did it almost right, but messed up the final formula and the code gave wrong answers. Weird, considering some correct codes exist for that problem in Wikipedia.
It's always perplexing when people talk about LLMs as "it", as if there's only one model out there, and they're all equally accurate.
FWIW, here's 4o writing a selection sort: https://chatgpt.com/share/67e60f66-aacc-800c-9e1d-303982f54d...
I don't understand the point of that share. There are likely thousands of implementations of selection sort on the internet and so being able to recreate one isn't impressive in the slightest.
And all the models are identical in not being able to discern what is real or something it just made up.
I guess the point is that op was pretty adamant that LLMs refuse to write selection sorts?
No? I mean if they refused that would actually be a reasonably good outcome. The real problem is if they generally can write selection sorts but occasionally go haywire due to additional context and start hallucinating.
I mean asking a straightforward question like: https://chatgpt.com/share/67e60f66-aacc-800c-9e1d-303982f54d... is entirely pointless as a test
Because, to be blunt, I think this is total bullshit if you're using a decent model:
"I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over."
I was part of preparing an offer a few weeks ago. The customer prepared a lot of documents for us - maybe 100 pages on total. Boss insisted on using chatgpt to summarize this stuff and read only the summary. I did a loner, slower, reading and cought on some topics chatgpt outright dropped. Our offer was based on the summary - and fell through because we missed these nuances. But hey, boss did not read as much as previously...
I saw someone saying 80% of doctors believe that LLM's are trustworthy consultation partners.
Code created by LLM's doesnt compile, hallucinated API's.. invalid syntax and completely broken logic, why would you trust it with someones life !
I wonder if the exact phrasing has varied from the source, but even then if "consultation partners" is doing the heavy lifting there. If it was something like "useful consultation partners", I can absolutely see value as an extra opinion that is easy to override. "Oh yeah, I hadn't thought about that option - I'll look into it further."
I imagine we're talking about it as an extra resource rather than trusting it as final in a life or death decision.
> I imagine we're talking about it as an extra resource rather than trusting it > as final in a life or death decision.
I'd like to think so. Trust is also one of those non-concrete terms that have different meanings to different people. I'd like to think that doctors use their own judgement to include the output from their trained models, I just wonder how long it is till they become the default judgement when humans get lazy.
I think that's a fair assessment on trust as a term, and incorporating via personal judgement. If this was any public story, I'd also factor in breathless reporting about new tech.
Black-box decisions I absolutely have a problem with. But an extra resource considered by people with an understanding of risks is fine by me. Like I've said in other comments, I understand what it is and isn't good at, and have a great time using ChatGPT for feedback or planning or extrapolating or brainstorming. I automatically filter out the "Good point! This is a fantastic idea..." response it inevitably starts with...
I'll see if i can dig it up, it was from a real life meeting which I have tossed the printed notes from a while back in disgust.
Because LLM’s, with like 20% hallucination rate, are more reliable than overworked, tired doctors that can spend only one ounce of their brainpower on the patient they’re currently helping?
Yeah, I'm gonna need really strong evidence for that claim before I entrust my life to an AI.
Apologies, but have you noticed that if your entrusted (the "doctor") trusted the unentrustable (the "LLM"), then your entrusted is not trustworthy?
yes, I have noticed.. and I am concerned.
"Quis custodiet ipsos custodes". The old problem.
In fact, the phenomenon of pseudo-intelligence scares those who were hoping to get tools that limited the original problem, as opposed to potentially boosting it.
>I saw someone saying 80% of doctors believe that LLM's are trustworthy consultation partners.
See, now that is something I don't know why I should trust: a random person on the internet citing a statistics that they saw someone else say.
The claim seems plausible because it doesn't say there was any formal evaluation, just that some doctors (who may or may not understand how LLMs work) hold an opinion.
I wish I could cite the actual study, but I'm my feeble mind only remembers the anger I felt at the statistic.
Unlike the LLM, i'm willing to be truthful about my memory.
luckily us being programmers can fix things like syntax errors.
> I saw someone saying
The irony...
> What's worse, people are treating them as authoritative.
So what? People are wrong all the time. What happens when people are wrong? Things go wrong. What happens then? People learn that the way they got their information wasn't robust enough and they'll adapt to be more careful in the future.
This is the way it has always worked. But people are "worried" about LLMs... Because they're new. Don't worry, it's just another tool in the box, people are perfectly capable of being wrong without LLMs.
Being wrong when you are building a grocery management app is one thing, being wrong when building a bridge is another.
For those sensitive use cases, it is imperative we create regulation, like every other technology that came before it, to minimize the inherent risks.
In an unrelated example, I saw someone saying recently they don't like a new version of an LLM because it no longer has "cool" conversations with them, so take that as you will from a psychological perspective.
I have a hard time taking that kind of worry seriously. In ten years, how many bridges will have collapsed because of LLMs? How many people will have died? Meanwhile, how many will have died from fentanyl or cars or air pollution or smoking. Why do people care so much about the hypothetical bad effects from new technology and so little about the things we already know are harmful
A tool is good but lots of people are stupid and misuse it… That’s just life buddy.
Humans bullshit and hallucinate and claim authority without citation or knowledge. They will believe all manner of things. They frequently misunderstand.
The LLM doesn’t need to be perfect. Just needs to beat a typical human.
LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.
LLM's can't be held accountable.
And many, many companies are proposing and implementing uses for LLM's to intentionally obscure that accountability.
If a person makes up something, innocently or maliciously, and someone believes it and ends up getting harmed, that person can have some liability for the harm.
If a LLM hallucinates something, that somone believes and they end up getting harmed, there's no accountability. And it seems that AI companies are pushing for laws & regulations that further protect them from this liability.
These models can be useful tools, but the targets these AI companies are shooting for are going to be activly harmful in an economy that insists you do something productive for the continued right to exist.
This is correct. On top of that, the failure modes of AI system are unpredictable and incomprehensible. Present day AI systems can fail on/be fooled by inputs in surprising ways that no humans would.
Accountability exists for two reasons:
1. To make those harmed whole. On this, you have a good point. The desire of AI firms or those using AI to be indemnified from the harms their use of AI causes is a problem as they will harm people. But it isn't relevant to the question of whether LLMs are useful or whether they beat a human.
2. To incentivize the human to behave properly. This is moot with LLMs. There is no laziness or competing incentive for them.
> This is moot
That’s not a positive at all, the complete opposite. It’s not about laziness but being able to somewhat accurately estimate and balance risk/benefit ratio.
The fact that making a wrong decision would have significant costs for you and other people should have a significant influence on decision making.
> This is moot with LLMs. There is no laziness or competing incentive for them.
The incentives for the LLM are dictated by the company, at the moment it only seems to be 'whatever ensures we continue to get sales'.
[flagged]
That reads as "people shouldn't trust what AI tells them", which is in opposition to what companies want to use AI for.
An airline tried to blame its chatbot for inaccurate advice it gave (whether a discount could be claimed after a flight). Tribunal said no, its chatbot was not a separate legal entity.
https://www.bbc.com/travel/article/20240222-air-canada-chatb...
Yeah. Where I live, we are always reminded that our conversations with insurance provider personnel over phone are recorded and can be referenced while making a claim.
Imagine a chatbot making false promises to prospective customers. Your claim gets denied, you fight it out only to learn their ToS absolves them of "AI hallucinations".
> LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.
On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.
My problem isn't that humans are doing similar things to LLMs, my problem is that humans can understand consequences of bullshitting at the wrong time. LLMs, on the other hand, operate purely on bullshitting. Sometimes they are right, sometimes they are wrong. But what they'll never do or tell you is "how confident am I that this answer is right". They leave the hard work of calling out the bullshit on the human.
There's a level of social trust that exists which LLMs don't follow. I can trust when my doctor says "you have a cold" that I probably have a cold. They've seen it a million times before and they are pretty good at diagnosing that problem. I can also know that doctor is probably bullshitting me if they start giving me advice for my legal problems, because it's unlikely you are going to find a doctor/lawyer.
> Just needs to beat a typical human.
My issue is we can't even measure accurately how good humans are at their jobs. You now want to trust that the metrics and benchmarks used to judge LLMs are actually good measures? So much of the LLM advocates try and pretend like you can objectively measure goodness in subjective fields by just writing some unit tests. It's literally the "Oh look, I have an oracle java certificate" or "Aws solutions architect" method of determining competence.
And so many of these tests aren't being written by experts. Perhaps the coding tests, but the legal tests? Medical tests?
The problem is LLM companies are bullshiting society on how competently they can measure LLM competence.
> On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.
Some humans can, certainly. Humans as a race? Maybe, ish.
Well there are still millions that can. There is a handful of competitive LLMs and their output given the same inputs are near identical in relative terms (compared to humans).
"On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something."
You can do the same with LLM, I gaslight chatgpt all the time so it not hallucinate
Your second point directly contradicts your first point.
In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.
As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.
No external help, just conversations with ChatGPT and some Googling.
Obviously LLMs have issues, but if we're now in the "Beginners can program their own custom apps" phase of the cycle, the potential is huge.
> As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.
This is actually an interesting one - I’ve seen a case where some copy/pasted PDF saving code caused hundreds of thousands of subtly corrupted PDFs (invoices, reports, etc.) over the span of years. It was a mistake that would be very easy for an LLM to make, but I sure wouldn’t want to rely on chatgpt to fix all of those PDFs and the production code relying on them.
Well humans are not a monolithic hive mind that all behave exactly the same as an “average” lawyer, doctor etc. that provides very obvious and very significant advantages.
> days to go from a cold start with zero dev experience
How is that relevant?
>> In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.
This paragraph makes little sense. A negligence claim is based on a deviation from some reasonable standard, which is essentially a proxy for the level of care/service that most practitioners would apply in a given situation. If doctors were as regularly incompetent as you are trying to argue then the standard for negligence would be lower because the overall standard in the industry would reflect such incompetence. So the existence of negligence claims actually tells us little about how good a doctor is individually or how good doctors are as a group, just that there is a standard that their performance can be measured against.
I think most people would agree with you that medical negligence claims are a huge problem, but I think that most of those people would say the problem is that so many of these claims are frivolous rather than meritorious, resulting in doctors paying more for malpractice insurance than necessary and also resulting in doctors asking for unnecessarily burdensome additional testing with little diagnostic value so that they don’t get sued.
I won’t defend lawyers. They’re generally scum.
It's fine if it isn't perfect if whomever is spitting out answers assumes liability when the robot is wrong. But, what people want is the robot to answer questions and there to be no liability when it is well known that the robot can be wildly inaccurate sometimes. They want the illusion of value without the liability of the known deficiencies.
If LLM output is like a magic 8 ball you shake, that is not very valuable unless it is workload management for a human who will validate the fitness of the output.
I never ask a typical human for help with my work, why should that be my benchmark for using an information tool? Afaik, most people do not write about what they don't know, and if one made a habit of it, they would be found and filtered out of authoritative sources of information.
ok, but people are building determinative software _on top of them_. It's like saying "it's ok, people make mistakes, but lets build infrastructure on some brain in a vat". It's just inherently not at the point that you can make it the foundation of anything but a pet that helps you slop out code, or whatever visual or textual project you have.
It's one of those "quantities is so fascisnating, lets ignore how we got here in the first place"
You’re moving the goalposts. LLMs are masquerading as superb reference tools and as sources of expertise on all things, not as mere “typical humans.” If they were presented accurately as being about as fallible as a typical human, typical humans (users) wouldn’t be nearly as trusting or excited about using them, and they wouldn’t seem nearly as futuristic.
> I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out.
If you're lucky it figures it out. If you aren't, it makes stuff up in a way that seems almost purposefully calculated to fool you into assuming that it's figured everything out. That's the real problem with LLM's: they fundamentally cannot be trusted because they're just a glorified autocomplete; they don't come with any inbuilt sense of when they might be getting things wrong.
I see this complaint a lot, and frankly, it just doesn't matter.
What matters is speeding up how fast I can find information. Not only will LLMs sometimes answer my obscure questions perfectly themselves, but they also help to point me to the jargon I need to use to find that information online. In many areas this has been hugely valuble to me.
Sometimes you do just have to cut your losses. I've given up on asking LLMs for help with Zig, for example. It is just too obscure a language I guess, because the hallucination rate is too high to be useful. But for webdev, Python, matplotlib, or bash help? It is invaluable to me, even though it makes mistakes every now and then.
[dead]
We're talking about getting work done here, not some purity dance about how you find your information the "right way" by looking in books in libraries or something. Or wait, do you use the internet? How very impure of you. You should know, people post misinformation on there!
Humans also get things wrong surprisingly large percentage of the time.
Yeah but if your accountant bullshits when doing your taxes, you can sue them.
> Yeah but if your accountant bullshits when doing your taxes, you can sue them.
What is the point of limiting delegation to such an extreme dichotomy? As apposed to getting more things done?
The vast majority of useful things we delegate, or do for others ourselves, are not as well specified, or as legally liable for any imperfections, as an accountant doing accounting.
Because we’ve already automated everything else. What’s left is work that’s done by specialists.
The entire pitch behind AI is that it can automate these jobs. If you can’t trust it, then AI is useless. Excluding art theft obviously.
Good luck with that. People get send to prison for decades, for other people's bullshit all the time!
I will take LLM over real person anytime! At least it does not get b*hurt when I double check!
they don't come with any inbuilt sense of when they might be getting things wrong
Spend some time with current reasoning models. Your experience is obsolete if you still hold this belief.
Can you be more specific than "current reasoning models"? Maybe I missed it, but I have not yet seen any that would not hallucinate wildly.
Let's try it this way: give me one or two prompts that you personally have had trouble with, in terms of hallucinated output and lack of awareness of potential errors or ambiguity. I have paid accounts on all the major models except Grok, and I often find it interesting to probe the boundaries where good responses give way to bad ones, and to see how they get better (or worse) between generations.
Sounds like your experiences, along with zozbot234's, are different enough from mine that they are worth repeating and understanding. I'll report back with the results I see on the current models.
I am so confused too. I hold these beliefs at the same time, and I don't feel they don't contradict each other, but apparently for many people some of these do:
- LLMs are a miraculous technology that are capable of tasks far beyond what we believed would be achievable with AI/ML in the near future. Playing with them makes me constantly feel like "this is like sci-fi, this shouldn't be possible with 2025's technology".
- LLMs are fairly clueless for many tasks that are easy enough for humans, and they are nowhere near AGI. It's also unclear whether they scale up towards that goal. They are also worse programmers than people make them to be. (At least I'm not happy with their results.)
- Achieving AGI doesn't seem impossibly unlikely any more, and doing so is likely to be an existentially disastrous event for humanity, and the worst fodder of my nightmares. (Also in the sense of an existential doomsday scenario, but even just the tought of becoming... irrelevant is depressing.)
Having one of these beliefs makes me the "AI hyper" stereotype, another makes me the "AI naysayer" stereotype and yet another makes me the "AI doomer" stereotype. So I guess I'm all of those!
I guess that Sabine's beef with LLM's that they are hyped as a legit "human level assistant" -kind of thing by the business people, which they clearly aren't yet. Maybe I've just managed to... manage my expectations?
That's on her then for fully believing what marketing and business execs are 'telling her' about LLMs. Does she get upset when she buys a coke around Christmas and her life doesn't become all warm and fuzzy with friendliness and cheer all around?
Seems like she's given a drill with a flathead, and just complains for months on end that it often fails (she didnt charge the drill) or gives her useless results (she uses philipheads). How about figuring out what works and what doesn't, and adjusting your use of the tool accordingly? If she is a painter, don't blame the drill for messing up her painting.
I kinda agree. But she seems smart and knowledgeable. It's kinda disappointing, like... She should know better. I guess it's the Gell-Mann amnesia effect once again.
> but even just the tought of becoming... irrelevant is depressing
In my opinion, there can exist no AI, person, tool, ultra-sentient omniscient being, etc. that would ever render you irrelevant. Your existence, experiences, and perception of reality are all literally irreplaceable, and (again, just my opinion) inherently meaningful. I don't think anyone's value comes from their ability to perform any particular feat to any particular degree of skill. I only say this because I had similar feelings of anxiety when considering the idea of becoming "irrelevant", and I've seen many others say similar things, but I think that fear is largely a product of misunderstanding what makes our lives meaningful.
Thank you, you really are a sweetheart. And correct. But it's not easy to combat the anxiety.
Back when handwriting recognition was a new thing I was greatly impressed by how good it was. This was primarily because being an engineer I knew how difficult the problem is to solve. %90 recognition seemed really good to me.
When I tried to use the technology that %90 meant 1 out of every 10 things I wrote were incorrect. If it had been a keyboard I would have thrown it in the trash. That is were my Palm ended up.
People expect their technology to do things better not almost as well as a human. Waymo with LIDAR hasn't killed people. Tesla, with camera only, has done so multiple times. I will ride in a Waymo never in a Tesla self driving car.
Anyone who doesn't understand this either isn't required to use to utility it provides or has no idea how to prompt it correctly. My wife is a bookkeeper. There are some tasks that are a pain in the ass without writing some custom code. In her case, we just saved her about 2 hours by asking Claude to do it. It wrote the code, applied the code to a CSV we uploadrd and gave us exactly what we needed in 2 minutes.
>Anyone who doesn't understand this either isn't required to use to utility it provides or has no idea how to prompt it correctly.
Almost every counter-criticism of LLMs almost boil down to
1. you're holding it wrong
2. Well, I use it $DAYJOB and it works great for me! (And $DAYJOB is software engineering).
I'm glad your wife was able to save 2 hours of work, but forgive me if that doesn't translate to the trillion dollar valuation OpenAI is claiming. It's strange you don't see the inherent irony in your post. Instead of your wife just directly uploading the dataset and a prompt, she first has to prompt it to write code. There are clear limitations and it looks like LLMs are stuck at some sort of wall.
> 1. you're holding it wrong
When computers/internet first came about, there were (and still are!) people who would struggle with basic tasks. Without knowing the specific task you are trying to do, its hard to judge whether its a problem with the model or you.
I would also say that prompting isn't as simple as made out to be. It is a skill in itself and requires you to be a good communicator. In fact, I would say there is a reasonable chance that even if we end up with AGI level models, a good chunk of people will not be able to use it effectively because they can't communicate requirements clearly.
So it's a natural language interface, except it can only be useful if we stick to a subset of natural language. Then we're stuck trying to reverse engineer a non documented, non deterministic API. One that will keep changing under whatever you build that uses it. That is a pretty horrid value proposition.
Short of it being able to mind read, you need to communicate with it in some way. No different from the real world where you'll have a harder time getting things done if you don't know how to effectively communicate. I imagine for a lot of popular use-cases, we'll build a simpler UX for people to click and tap before it gets sent to a model.
I'd rather run commands and write code than try to reverse engineer an non-deterministic, non-documented, ever changing API.
Boiling down to a couple cases would be more useful if you actually tried to disprove those cases or explain why they're not good enough.
> It's strange you don't see the inherent irony in your post. Instead of your wife just directly uploading the dataset and a prompt, she first has to prompt it to write code. There are clear limitations and it looks like LLMs are stuck at some sort of wall.
What's ironic about that? That's such a tiny imperfection. If that's anything near the biggest flaw then things look amazing. (Not that I think it is, but I'm not here to talk about my opinion, I'm here to talk about your irony claim.)
>Boiling down to a couple cases would be more useful if you actually tried to disprove those cases or explain why they're not good enough.
This reply is 4 comments deep into such cases, and the OP is about a well educated person who describes their difficulties.
>What's ironic about that? That's such a tiny imperfection.
I'd argue it's not tiny - it highlights the limitations of LLMs. LLMs excel at writing basic code but seem to struggle, or are untrustworthy, outside of those tasks.
Imagine generalizing his case: his wife goes to work and tells other bookkeepers "ChatGPClaudeSeek is amazing, it saved 2 hours for me". A coworker, married to a lawyer, instead of a software engineer, hearing this tries it for himself, and comes up short. Returning to work the next day and talking about his experience is told - "oh you weren't holding it right, ChatGPClaudeSeek can't do the work for you, you have to ask it to write code, that you must then run". Turns out he needs an expert to hold it properly and from the coworker's point of view he would probably need to hire an expert to help automate the task, which will likely only be marginally less expensive than it was 5 years ago.
From where I stand, things don't look amazing; at least as amazing as the fundraisers have claimed. I agree that LLMs are awesome tools - but I'm evaluating from a point of a potential future where OpenAI is worth a trillion dollars and is replacing every job. You call it a tiny imperfection, but that comes across as myopic to me - large swaths of industries can't effectively use LLMs! How is that tiny?
> Turns out he needs an expert to hold it properly and from the coworker's point of view he would probably need to hire an expert to help automate the task, which will likely only be marginally less expensive than it was 5 years ago.
The LLM wrote the code, then used the code itself, without needing a coder around. So the only negative was needing to ask it specifically to use code, right? In that case, with code being the thing it's good at, "tell the LLM to make and use code" is going to be in the basic tutorials. It doesn't need an expert. It really is about "holding it right" in a non-mocking way, the kind of instructions you expect to go through for using a new tool.
If you can go through a one hour or less training course while only half paying attention, and immediately save two hours on your first use, that's a great return on the time investment.
Honest question, how do you validate it?
I´m in the same boat and I think it boils down to this: some people are actually quite passive, while others are more active in their use of technology.
It`d take more time for me to flesh this out than I want to give but the basic idea is I am not just sitting there "expecting things". I´ve been puzzled too at why so many people don´t seem to get it or are so frustrated like this lady, and in my observation this is their common element. It just looks very passive to me, the way they seem to use the machines and expect a result to be "given" to them.
PS. It reminds me very strongly of how our parent generation uses computers. Like the whole way of thinking is different, I cannot even understand why they would act certain ways or be afraid of acting in other ways, it´s like they use a different compass or have a very different (and wrong) model in their head of how this thing in front of them works.
It's more like duct-taping a VR headset to your head, calibrating your environment to a bunch of cardboard boxes and walls, and calling it a holodeck. It actually kinda works until you push at it too hard.
It reminds me a lot of when I first started playing No Man's Sky (the video game). Billions of galaxies! Exotic, one of a kind life forms on every planet! Endless possibilities! I poured hundreds of hours into the game! But, despite all the variety and possibilities, the patterns emerge, and every 'new' planet just feels like a first-person fractal viewer. Pretty, sometimes kinda nifty, but eventually very boring and repetitive. The illusion wore off, and I couldn't really enjoy it anymore.
I have played with a LOT of models over the years. They can be neat, interesting, and kinda cool at times, but the patterns of output and mistakes shatters the illusion that I'm talking to anything but a rather expensive auto-complete.
It's definitely a tech that's here to stay, unlike block chain/nfts
But I mirror the confusion why people are still bullish on it. The current valuation for it is because the market thinks that it's able to write code like a senior engineer and have AGI, because that's how they're marketed by the LLM providers.
I'm not even certain if they'll be ubiquitous after the venture capital investments are gone and the service needs to actually be priced without losing money, because they're (at least currently) mostly pretty expensive to run.
There seems to be a widely held misconception that company valuations have any basis in the underlying fundamentals of what the companies do. This is not and has not been the case for several years. The US stock market’s darlings are Kardashians, they are valuable for being valuable the way the Kardashians are famous for being famous.
In markets, perception is reality, and the perception is that these companies are innovative. That’s it.
You seem to be under the misconception that this is a new phenomenon.
“In the short run, the market is a voting machine but in the long run, it is a weighing machine.”
- Benjamin Graham, 1949
Until there is a correction, which inevitably does happen.
NFTs are the perfect example.
NFT is still a great tool if you want a bunch of unique tokens as part of a blockchain app. ERC-721 was proven a capable protocol in a variety of projects. What it isn't, and never will be, is an amazing investment opportunity, or a method to collect cool rare apes and go to yacht parties.
LLMs will settle in and have their place too, just not in the forefront of every investors mind.
I am more than happy to pay for access to LLMs, and models continue to get smaller and cheaper. I would be very surprised if they are not far more widely used in 5 or 10 years time than they are today.
None of that means that the current companies will be profitable or that their valuations are anywhere close to justified though. The future could easily be "Open-weight models are moderately useful for some niches, no-name cloud providers charge slightly higher than the cost of electricity to use them at low profit margins".
Dot-com boom/bubble all over again. A whole bunch of the current leaders will go bust. A new generation of companies will take over, actually focused on specific customer problems and growing out of profitable niches.
The technology is useful, for some people, in some situations. It will get more useful for more people in more situations as it improves.
Current valuations are too high (Gartner hype cycle), after they collapse valuations will be too low (again, hype cycle), then it'll settle down and the real work happens.
The existing tech giants will just hoover up all the niche LLM shops once the valuations deflate somewhat.
There's almost a negligible chance any one of these shops stays truly independent, unless propped up by a state-level actor (China/EU)
You might have some consulting/service companies that will promise to tailor big models to your specific needs, but they will be valued accordingly (nowhere near billions).
"The technology is useful, for some people, in some situations"
this endgame AI company are to create Intelligence equal to human, imagine you are not paying 23k workforce, that's a lot of money to be made
That's been the 'endgame' of technology improvements since the industrial revolution - there are many industries that mechanized, replaced nearly their entire human workforce, and were never terribly profitable. Consider farming - in developed countries, they really did replace like 98% of the workforce with machines. For every farm that did so, so did all of their competitors, and the increased productivity caused the price of their crops to fall. Cheap food for everyone, but no windfall for farmers.
If machines can easily replace all of your workers, that means other people's machines can also replace your workers.
I think it will go in the opposite direction. Very massive closed-weight models that are truly miraculous and magical. But that would be sad because of all the prompt pre-processing that will prevent you from doing much of what you'd really want to do with such an intelligent machine.
I expect it to eventually be a duopoly like android and iOS. At world scale, it might divide us in a way that politics and nationalities never did. Humans will fall into one of two AI tribes.
Except that we've seen that bigger models don't really scale in accuracy/intelligence well, just look at GPT4.5. Intelligence scales logarithmically with parameter count, the extra parameters are mostly good for baking in more knowledge so you don't need to RAG everything.
Additionally, you can use reasoning model thinking with non-reasoning models to improve output, so I wouldn't be surprised if the common pattern was routing hard queries to reasoning models to solve at a high level, then routing the solution plan to a smaller on device model for faster inference.
Exactly. If some company ever does come up with an AI that is truly miraculous and magical the very last thing they'll do is let people like you and me play with it at any price. At best, we'd get some locked down and crippled interface to heavily monitored pre-approved/censored output. My guess is that the miracle isn't going to happen.
If I'm wrong though and some digital alchemy finally manages to turn our facebook comments into a super-intelligence we'll only have a few years of an increasingly hellish dystopia before the machines do the smart thing and humanity gets what we deserve.
By the time the capital runs out, I suspect we'll be able to get open models at the level of current frontier and companies will buy a server ready to run it for internal use and reasonable pricing. It will be useful but a complete commodity.
I know folk now that are selling, basically, RAG on lammas, "in a box". Seems a bunch of mid-level at SME are ready to burn budget on hype (to me). Gotta get something deployed in the hype-cycle for quarterly bonus.
I think we can already get open-weight frontier class models today. I've run Deepseek R1 at home, and it's every bit as good as any of the ChatGPT models I can use at work.
Which companies? Google and Microsoft are only up a little over the past several years, and I doubt much of their valuation is coming from LLM hype. Most of the discussions about x.com say it's worth substantially less than some years ago.
I feel like a lot of people mean that OpenAI is burning through venture capital money. It's debatable, but it's a huge jump to go from that to thinking it's going to crash the stock market (OpenAI isn't even publicly traded).
The "Magnificent Seven" stocks (Apple, Amazon, Alphabet, Meta, Microsoft, Nvidia, and Tesla) were collectively up >60% last year and are now 30% of the entire S&P500. They are all heavily invested in AI products.
I just checked the first two, Apple and Amazon, and they're trading 28% and 23% higher than they were 3 years ago. Annualized returns from the SP 500 have been a little over 10%. Some of that comes from dividends, but Apple and Amazon give out extremely little in the way of dividends.
I'm not going to check all of the companies, but at least looking at the first two, I'm not really seeing anything out of the ordinary.
Currently, Nvidia enjoys a ton of the value capture from the LLM hype. But that's a weird state of affairs and once LLM deployments are less dependent on Nvidia hardware, the value capture will likely move to software companies. Or the LLM hype will reduce to the point that there isn't a ton of value to capture here anymore. This tech may just get commoditized.
Nvidia is trading below its historical PE from pre-AI times at this point. This is just on confirmed revenue, and its profitability keeps increasing. NVIDIA is undervalued right now
Sure, as long as it keep selling $130B worth of GPUs each year. Which is entirely predicated on the capital investment in Machine Learning attracting revenue streams that are still imaginary at this point.
> None of that means that the current companies will be profitable ... The future could easily be "Open-weight models are moderately useful for some niches, no-name cloud providers charge slightly higher than the cost of electricity to use them at low profit margins".
They just need to stay a bit ahead of the open source releases, which is basically the status quo. The leading AI firms have a lot of accumulated know-how wrt. building new models and training them, that the average "no-name cloud" vendor doesn't.
> They just need to stay a bit ahead of the open source releases, which is basically the status quo
No, OpenAI alone additionally need approximately $5B of additional cash each and every year.
I think Claude is useful. But if they charged enough money to be cashflow positive, it's not obvious enough people would think so. Let alone enough money to generate returns to their investors.
The big boys can also get away with stealing all the copyrighted material ever created by human beings.
How far back do you think copyright should extend? Is it perpetual, forever?
How short should it be? Two years? Two months?
I wasn’t the one outraged at the “theft” of ancient works.
Congress keeps extending it and I do not approve of that. I think 50 years is a reasonable length of time.
It makes engineers a hell of a lot more efficient and opens up software to a whole new class of people. There is plenty of data to back this up.
Based on that alone it’s worth quite a lot.
I don't doubt the first part, but how true is the second?
Is there a shortage of React apps out there that companies are desperate for?
I'm not having a go at you--this is a genuine inquiry.
How many average people are feeling like they're missing some software that they're able to prompt into existence?
I think if anything, the last few years have revealed the opposite, that there's a large/huge surplus of people in the greater software business at large that don't meet the demand when money isn't cheap.
I think anyone in the "average" range of skill looking for a job can attest to the difficulties in finding a new/any job.
I think there is plenty of demand for software but not enough economic incentive to fulfill every single demand. Even for the software that is being worked on, we are constantly prioritizing between the features we need or want, deciding whether to write our own vs modifying something open source etc etc. You can also look at stuff like electron apps which is a hack to reduce programmer dev time and time to market for cross platform apps. Ideally, you should be writing highly performant native apps for each.
IMO if coding models get good enough to replace devs, we will see an explosion of software before it flattens out.
"Is there a shortage of React apps out there that companies are desperate for?"
tech always shortage because its a kitchen sink problem, corporation want to reduce headcount so it saves money in the long run
Are you sure about that?
We're several years in now, and have lots of A:B comparisons to study across orgs that allowed and prohibited AI assistants. Is one of those groups running away with massive productivity gains?
Because I don't think anybody's noticed that yet. We see layoffs that makes sense on their own after a boom, and cut across AI-friendly and -unfriendly orgs. But we don't seem to see anybody suddenly breaking out with 2x or 5x or 10x productivity gains on actual deliverables. In contrast, the enshittening just seems to be continuing as it has for years and the pace of new products and features is holding steady. No?
> We're several years in now, and have lots of A:B comparisons to study across orgs that allowed and prohibited AI assistants. Is one of those groups running away with massive productivity gains?
You mean... two years in? Where was the internet 2 years into it?
You’re not making the argument you think you’re making when you ask “Where was the [I]ntwenet 2 years into it?”
You may be intending to refer to 1971 (about two years after the creation of ARPANet) but really the more accurate comparison would be to 1995 (about two years since ISPs started offering SLIP/PPP dialup to the general public for $50/month or less).
And I think the comparison to 1995, the year of the Netscape IPO and URLs starting to appear in commercials and on packaging for consumer products, is apt: LLMs have been a research technology for a while, it’s their availability to the general public that’s new in the last couple of years. Yet while the scale of hype is comparable, the products aren’t: LLMs still don’t anything remotely like what their boosters claim, and have done nothing to justify the insane amounts of money being poured into them. With the Internet, however, there were already plenty of retailers starting to make real money doing electronic commerce by 1995, not just by providing infrastructure and related services.
It’s worth really paying attention to Ed Zitron’s arguments here: The numbers in the real world just don’t support the continued amount of investment in LLMs. They’re a perfectly fine area of advanced research but they’re not a product, much less a world-changing one, and they won’t be any time soon due to their inherent limitations.
They're not a product? Isn't Cursor on the leaderboard for fastest to $100m ARR? What about just plain usage or dependency. College kids are using chrome extensions that direct their searches to chatgpt by default. I think your connection to the internet uptake is a bit weak, and then you've ended by basically saying too much money is being thrown at this stuff, which is quite disconnected from the start of you arg.
>they’re not a product
I think it's pretty fair to say that they have close to doubled my productivity as a programmer. My girlfriend uses ChatGPT daily for her work, which is not "tech" at all. It's fair to be skeptical of exactly how far they can go but a claim like this is pretty wild.
Both your and her usage is currently being subsidized by venture capital money.
It remains to be seen how viable this casual usage actually is once this money dries up and you actually need to pay per prompt. We'll just have to see where the pricing will eventually settle, before that we're all just speculating.
I pay for chatgpt and would pay more.
> And I think the comparison to 1995, the year of the Netscape IPO and URLs starting to appear in commercials and on packaging for consumer products, is apt
My grandfather didn’t care about these and you don’t care about LLMs, we get it
> They’re a perfectly fine area of advanced research but they’re not a product
lol come on man
We’ll need at least three more years to sort out the Lycos, AltaVista, and HotBots of the LLM world.
what makes you more productive, say C++ compiler + IDE or LLM? Do C++ compilers and IDE have similar valuation?
It does let beginners do a lot more than they are capable of currently.
But these are people that wanted to be in programming in the first place.
This "my mom can now code and got a job because of LLMs" myth, does this creature really exist in the wild?
No, it lets good engineers parallelize work. I can be adding a route to the backend while Cline with Sonnet 3.7 adds a button to the frontend. Boilerplate work that would take 20-30 minutes is handled by a coding agent. With Claude writing some of the backend routes with supervision, you've got a very efficient workflow. I do something like this daily in a 80k loc codebase.
I look forward to good standard integrations to assign a ticket to an agent and let it go through ci and up for a preview deploy & pr. I think there's lots of smaller issues that could be raised and sorted without much intervention.
Even if the VC-backed companies jacked up their prices, the models that I can run on my own laptop for "free" now are magical compared to the state of the art from 2 years ago. Ubiquity may come from everyone running these on their own hardware.
Takes like yours are just crazy given the pace of things. We can argue all day if people are "too bullish" or literally on the market size of enterprise AI, but truly, absolutely no one knows how good these things will get and the problems they'll overcome in the next 5 years. You saying "I am confused on why people are still bullish" is implicitly building in some huge assumptions about the near future.
Most “AI” companies are simply wrapping the ChatGPT API in some form. You can tell from the job posts.
They aren’t building anything themselves. I find this to be disingenuous as best, and is a sign to me of bubble attribution.
I also think that re-branding Machine Learning as AI to also be disingenuous.
These technologies of course have their use cases and excel in some things, but this isn’t the ushering of actual, sapient intelligence, that for the majority of the term’s existence was the de facto agreed standard for the term “AI”. This technology does lack the actual markers of what is generally accepted as intelligence to begin with
their value proposition to VC is market capture, its marketing companies and they outsource tech heavy lifting to OpenAI and others.
Remember the quote that IBM thought there would be a total market for maybe 10 or 15 computer computers in the entire world? They were giant, and expensive, and very limited in application.
A popular myth, it seems to be made-up from a way-less-interesting statement about a single specific model of computer during a 1953 stockholder meeting:
> IBM had developed a paper plan for such a machine and took this paper plan across the country to some 20 concerns that we thought could use such a machine. I would like to tell you that the [IBM 701] machine rents for between $12,000 and $18,000 a month, so it was not the type of thing that could be sold from place to place. But, as a result of our trip, on which we expected to get orders for five machines, we came home with orders for 18.
Did you really think that people on HN, who I assume are very informed, wouldn't fall for popular disbelief? How can this happen?
And that might have been true for a period of time. Advancements made it so they could become smaller and more efficient, and opened up a new market.
LLMs today feel like the former, but are being marketed as the latter. Fully believe that advancements will make them better, but in their current state they're being touted for their possibilities, not their actual capabilities.
I'm for using AI now as the tool they are, but AI is a while off taking senior development jobs. So when I see them being hyped for doing that it just feels like a hype bubble.
Turned out there was only room for about 4 oligopolist cloud providers.
I believe they was actually the number they expected to get orders for on one particular sales tour.
That quote is as apocryphal as “640K should be enough for anyone”
Tesla is valued based on the hope that it'll be the first to full self-driving cars. I don't think stock markets need to make sense, you invest in things that if true, could have huge growth, that's why LLM is being invested in, because alternatives will make you some ROI, but if LLM do break through major disruption in even a handful of large markets, your ROI will be huge.
Waymo is already full self-driving cars, doesn't look like people buy google stocks on this sentiment.
I personally think Tesla is positioned well for EV first world compared to other brands. But Chinese companies are catching up.
"Waymo is already full self-driving cars"
they are neither FSD nor having cars at the moment
That's not really true. Just the entertainment value alone is already causing OpenAI to rate limit its systems, and they're buying up significant amounts of NVIDIA's capacity, and NVIDIA itself is buying up significant portions of the entire world's chip-making budget. Even if just limited to entertainment, the value is immense, apparently.
That's a funny comparison, I can and do use cryptocurrency to pay web hosting, VPN and a few other things as it's become the native currency of the internet. I love llms too but agree with the parent comment that says it's inevitable they'll be replaced with something better well Bitcoin seems to be sticking around for the long long term.
In my office most people use chatGPT or a similar LLM every day. I don't know a single coworker that's ever used a cryptocurrency. One guy has bought some crypto stocks.
It can be an amazing technology and the valuation for companies be completely wrong.
We saw this with the web. Pet.com was not a billion dollar company but the web was real.
I am actually of the belief that LLMs will be amazing but that rank and file companies are going to be the ones that benefit the most.
Just like the internet.
Bearish/bullish are relative terms, it describes how you feel about X as opposed to how the rest of the world/market feel.
> because they're (at least currently) mostly pretty expensive to run.
But moore's law should kick in, shouldn't it?
Steam Market that is basically an nft store has been going for 10+ years
> The current valuation for it is because the market thinks that it's able to write code like a senior engineer and have AGI, because that's how they're marketed by the LLM providers.
No it's not. If it was valued for that it'd be at least 10X what it is now.
blockchain is here to stay, dont kid yourself
While it could be said that LLMs are in the 'peak of inflated expectations', blockchain is definitely still in the 'trough of disillusionment'. Even if there was a way for blockchain to affordably facilitate everyday transactions without destroying the planet and somehow sideloading into government acceptance, it's not clear that there would be anything novel enough to motivate people to use it vs a bank - beyond a complete collapse of the banking system.
Of course. You can also still buy new pogs, and they have a subreddit.
Blockchain is here to stay, this is way past the point of "believing in the tech" - recently an wss:// order book exchange (Hyperliquid) crossed $1T volume traded, and they started in 2023.
Blockchains are becoming real-time data structures where everyone has admin level read-only access to everyone.
HN doesn't like blockchain. They had the chance to get in very early and now they're salty. I first heard about bitcoin on HN, before Silk Road made headlines.
I got involved with bitcoin in 2010. I co-founded a bitcoin unicorn. It is exactly because of this experience that I’m salty on blockchain.
"HN doesn't like blockchain"
HN not believe blockchain same way Apple,Microsoft,Google etc does
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly?
IMO there are two distinct reasons for this:
1. You've got the Sam Altman's of the world claiming that LLMs are or nearly are AGI and that ASI is right around the corner. It's obvious this isn't true even if LLMs are still incredibly powerful and useful. But Sam doing the whole "is it AGI?" dance gets old really quick.
2. LLMs are an existential threat to basically every knowledge worker job on the planet. Peoples' natural response to threats is to become defensive.
I’m not sure how anyone can claim number 2 is true, unless it’s someone who is a programmer doing mostly grunt code and thinks every knowledge worker job is similar.
Just off the top of my head there are plenty of knowledge worker jobs where the knowledge isn’t public, nor really in written form anywhere. There just simply wouldn’t be anything for AI to train on.
> LLMs are an existential threat to basically every knowledge worker job on the planet.
Given the typical problems of LLMs they are not. You still need them to check the results. It’s like FSD, impressive when it works, bad if not, scary because you never known beforehand when it’s failing
Yeah, the vast majority of what I spend my time on in a day isn’t something an LLM can help with.
My wife and I both work on and with LLMs and they seem to be, like… 5-10% productivity boosters on a good day. I’m not sure they’re even that good averaged over a year. And they don’t seem to be getting a lot better in ways that change that. Also, they’re that good if you’re good at using them and I can tell you most people really, really are not.
I remember when it was possible to be “good at Google”. It was probably a similar productivity boost. I was good at Google. Most (like, over 95% of) people were not, and didn’t seem to be able to get there, and… also remained entirely employable despite that.
I used to be firmly in this camp, but I'm not any more.
Even if they fail 1% of the time, the cost savings are too great. Businesses will take the risk.
A lot of software developers have an initial bad experience and assume it's terrible and give up on it.
I feel bad for people who haven't yet experienced how useful these models are for programming.
Some also just prefer manually entering everything. Those people I will never understand.
how much time do I need to devote to see anything but garbage?
For reference, I program systems code in C/C++ in a large, proprietary codebase.
My experiences with OpenAI(a year ago or more), and more recently, Cursor, Grok-v3 and Deepseek-r1, were all failures. The later two started out OK and got worse over time.
What I haven't done is asked "AI" to whip up a more standard application. I have some ideas(an ncurses frontend to p4 written in python similar to tig, for instance), but haven't gotten around to it.
I want this stuff to work, but so far it hasn't. Now I don't think "programming" a computer in english is a very good idea anyway, but I want a competent AI assistant to pair program with. To the degree that people are getting results, to me it seems they are leveraging very high-level APIs/libraries of code which are not written by AI and solving well-solved, "common" problems(simple games, simple web or phone apps). Sort of like how people gloss over the heavy lifting done by language itself when they praise the results from LLMs in other fields.
I know it eventually will work. I just don't know when. I also get annoyed by the hype of folks who think they can become software engineers because they can talk to an LLM. Most of my job isn't programming. Most of my job is thinking about what the solution should be, talking to other people like me in meetings, understanding what customers really want beyond what they are saying, and tracking what I'm doing in various forms(which is something I really do want AI to help me with).
Vibe coding is aptly named because it's sort of the VB6 of the modern era. Holy cow! I wrote a Windows GUI App!!!. It's letting non-programmers and semi-programmers(the "I write glue code in Python to munge data and API ins/outs" crowd) create usable things. Cool! So did spreadsheets. So did Hypercard. Andrej tweeting that he made a phone app was kinda cool but also kinda sad. If this is what the hundreds of billions spent on AI(and my bank account thanks you for that) delivers then the bubble is going to pop soon.
I think there is a big problem of expectations. People are told that it is great for software development, so they try to use it on big existing software projects, and it sucks.
Usually that's because of context: LLMs are not very good at understanding a very large amount of context, but if you don't give LLMs enough context, they can't magically figure it out on their own. This relegates AI to only really being useful for pretty self-contained examples where the amount of code is small, and you can provide all the context it needs to do its job in a relatively small amount of text (few thousand words or lines of code at most).
That's why I think LLMs are only useful right now in real-world software development for things like one-off functions, new prototypes, writing small scripts, or automating lots of manual changes you have to do. For example, I love using o3-mini-high to take existing tests that I have and modifying them to make a new test case. Often this involves lots of tiny changes that are annoying to write, and o3-mini-high can make those changes pretty reliably. You just give it a TODO list of changes, and it goes ahead and does it. But I'm not asking these models how to implement a new feature in our codebase.
I think this is why a lot of software developers have a bad view of AI. It's just not very good at the core software development work right now, but it's good enough at prototypes to make people freak out about how software development is going to be replaced.
That's not to mention that often when people first try to use LLMs for coding, they don't give the LLMs enough context or instructions to do well. Sometimes I will spend 2-3 minutes writing a prompt, but I often see other people putting the bare minimum effort into it, and then being surprised when it doesn't work very well.
Serious question as someone who has also tried these things out and not found them very useful in the context of working on a large, complex codebase in not python not javascript: when I imagine the amount of time it would take me to select some test cases, copy and paste them, and then think of a todo list or prompt to generate another case, even assuming the output is perfect, I feel like I’m getting close to the amount of time and mental effort it would take me to just write the test. In a way, having to ask in english for what I want in code for me adds an additional barrier: rather than just doing the thing I have to also think of a promptable description. Is that not a problem? Is it just fast enough that it doesn’t matter? What’s the deal?
I mean, for me personally, I am writing out the English TODO list while I am figuring out exactly what changes I need to make. So, the thinking and writing the prompt take up the same unit of time.
And in terms of time saved, if I am just changing string constants, it’s not going to help much. But if I’m restructuring the test to verify things in a different way, then it is helpful. For example, recently I was writing tests for the JSON output of a program, using jq. In this case, it’s pretty easy to describe the tests I want to make in English, but translating that to jq commands is annoying and a bit tedious. But o3-mini-high can do it for me from the English very well.
Annoying to do myself, but easy to describe, is the sweet spot. It is definitely not universally useful, but when it is useful it can save me 5 minutes of tedium here or there, which is quite helpful. I think for a lot of this, you just have to learn over time what works and what doesn't.
Thanks for the reply, that makes sense. jq syntax is one of those things that I’m just familiar enough with to remember what’s possible but not how to do it, so I could definitely see an LLM being useful for that.
Maybe one of my problems is that I tend to jump into writing simple code or tests without actually having the end point clearly in mind. Often that works out pretty well. When it doesn’t, I’ll take a step back and think things through. But when I’m in the midst of it, it feels like being interrupted almost, to go figure out how to say what I want in English.
Will definitely keep toying with it to see where I can find some utility.
That definitely makes a lot of sense. I think if you are coding in a flow state on something, and LLMs interrupt that, then you should avoid them for those cases.
The areas that I've found LLMs work well for are usually small simple tasks I have to do where I would end up Googling something or looking at docs anyway. LLMs have just replaced many of these types of tasks for me. But I continue to learn new areas where they work well, or exceptions where they fail. And new models make it a moving target too.
Good luck with it!
> I think if you are coding in a flow state on something, and LLMs interrupt that, then you should avoid them for those cases.
Maybe that's why I don't like them. I'm always in a flow state, or reading docs and/or a lot of code to understand something. By the time I'm typing, I already know what exactly to write, and thanks to my vim-fu (and emacs-fu), getting it done is a breeze. Then comes the edit-compile-run, or edit-test cycle, and by then it's mostly tweaks.
I get why someone would generate boilerplate, but most of the time, I don't want the complete version from the get go. Because later changes are more costly, especially if I'm not fully sure of the design. So I want something minimal that's working, then go work on things that are dependent, then get back when I'm sure of what the interface should be. I like working iteratively which then means small edits (unless refactoring). Not battling with a big dump of code for a whole day to get it working.
Yeah, I think it matters a lot what type of work you do. I have to jump between projects a lot that are all in different languages with a lot of codebases I'm not deeply familiar with. So for me, LLMs are really useful to get up-to-speed on the knowledge I need to work on new projects.
If I've got a clear idea of what I want to write, there's no way I'm touching an LLM. I'm just going to smash out the code for exactly what I need. However, often I don't get that luxury as I'll need to learn different file system APIs, different sets of commands, new jargon, different standard libraries for the new languages, new technologies, etc...
It does an ok job with C# but it’s generally outdated code I.e [required] as an annotation rather than as a keyword. Plus it generates some unnecessary constructors occasionally.
Mostly I use it for stupid templates stuff for which it isn’t bad. It’s not the second coming but it definitely speeds you up
> Most of my job is thinking about what the solution should be, talking to other people like me in meetings, understanding what customers really want beyond what they are saying, and tracking what I'm doing in various forms
None of this is particularly unique to software engineering. So if someone can already do this and add the missing component with some future LLM why shouldn’t they think they can become a software engineer?
Yeah I mean, if you can reason about, say, how an automobile engine works, then you can reason about how a modern computer works too, right? If you can discuss the tradeoffs in various engine design parameters then surely you understand amdahl's law, caching strategies of a modern CPU, execution pipelining, etc... We just need to give those auto guys an LLM and then they can do systems software engineering, right?
Did you catch the sarcasm there?
Are you a manager by any chance? The non-coding parts of my job largely require domain experience. How does an LLM provide you with that?
If your mind has trouble expanding outside the domain of "use this well known tool to do something that has already been done" then no amount of improvements will free you from your belief that chatbots are glorified autocomplete.
What?
It doesn't matter how useful it is, there will always be naysayers who can't see it, or who refuse to see it.
That's okay.
It's not my responsibility to convince or convert them.
I prefer to just let them be and not engage.
I hear you, I'm tired of getting people that don't care to care. It's the people that should know how cool this stuff is and don't - they frustrate me!
You people frustrate me because you don't listen that I've tried to use AI to do help with my job and it fails horribly in every way. I see that it is useful to you, and that's great, but that doesn't make it useful for everybody... I don't understand why you must have everyone agree with you, and that it's "tires" you out to hear other people's contracting opinions. It feels like a religion.
I mean, it is trivial to show that it can do things literally impossible even 5 years ago. And you don't acknowledge that fact, and that's what drives me crazy.
It's like showing someone from 1980 a modern smart phone and them saying, yeah but it can't read my mind.
I'm not trying to pick on you or anything, but at the top of the thread you said "I mean, I can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out" and now you're saying "it is trivial to show that it can do things literally impossible even 5 years ago"
This leads me to believe that the issue is not that llm skeptics refuse to see, but that you are simply unaware of what is possible without them--because that sort of fuzzy search was SOTA for information retrieval and commonplace about 15 years ago (it was one of the early accomplishments of the "big data/data science" era) long before LLMs and deepnets were the new hotness.
This is the problem I have with the current crop of AI tools: what works isn't new and what's new isn't good.
It's also a red flag to hear "it is trivial to show that it can do things literally impossible even 5 years ago" 10 comments deep without anybody doing exactly that...
All of what LLMs do now was impossible 5 years ago. Like it is so self-evident, that I don't know how to take the request for examples seriously.
> What specifically was impossible 5 years ago that llms can do
> It's so self-evident, that I don't know how to take the request for examples seriously
Do you see why people are hesitant to believe people with outrageous claims and no examples
It's more like showing someone from 1980 a modern smart phone you call a Portable Mind Reader and them saying, "yeah, but it can't read my mind."
Are people really this hung up on the term “AI”? Who cares? The fact that this is a shockingly useful piece of technology has nothing to do with what it’s called.
Because the AI term makes people anthropomorphize those tools.
They "hallucinate", they "know", they "think".
They're just the result of matrix calculus on which your own pattern recognition capacities fool you into thinking there is intelligence there. There isn't. They don't hallucinate, their output is wrong.
The worst example I've seen of anthropomorphism was the blog from a searcher working on adverse prompting. The tool spewing "help me" words made them think they were hurting a living organism https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-...
Speaking with AI proponents feels like speaking with cryptocurrencies proponents: the more you learn about how things work, the more you understand they don't and just live in lalaland.
Because marketing keeps deeply overpromising things.
Maybe hype is overly beneficial to them but if you promise me 1500 and I get 1100 then I will underwhelmed.
And especially around LLM marketing hype is fairly extreme.
If you lived before the invention of cars, and if when they were invented, marketers all said "these will be able to fly soon" (which of course, we know now wouldn't have been true), you would be underwhelmed? You wouldn't think it was extremely transformative technology?
From where does the premise that "artificial intelligence" is supposed to be infallible and super human come from? I think 20th century science fiction did a good job of establishing the premise that artificial intelligence will be sometimes useful but will often fail in bizarre ways that seem interesting to humans. Misunderstandings orders, applying orders literally in a way humans never would, or just flat out going haywire. Asimov's stories, HAL9000, countless others. These were the popular media tropes about artificial intelligence and the "real deal" seems to line up with them remarkably well!
When businessmen sell me "artificial intelligence", I come prepared for lots of fuckery.
Have you considered you are using it incorrectly?
Have you considered that the problems you encounter in daily life just happen to be more present in the training data than problems other users encounter?
Stitching together well-known web technologies and protocols in well-known patterns, probably a good success rate.
Solving issues in legacy codebases using proprietary technologies and protocols, and non-standard patterns. Probably not such a good success rate.
Have you considered you have no knowledge of how I'm using AI?
Yes I have. I did some research and read some blog posts, and nothing really helped. Do you have any good resources I could look at maybe?
I think you would benefit from a personalized approach. If you like, send me a Loom or similar of you attempting to complete one software task with AI, that fails as you said, and I'll give you my feedback. Email in profile.
This is the correct approach. Also see: https://ezyang.github.io/ai-blindspots/
Well said
Far from just programming too. They're useful for so many things. I use it for quickly coming up with shell scripts (or even complex piped commands (or if I'm being honest even simple commands since it's easier than skimming the man page)). But I also use it to bounce ideas off of when negotiating contracts. Or to give me a spoiler-free reminder of a plot point I'd forgotten in a book or TV series. Or to explain legal or taxation issues (which I of course verify, but it points me in the right direction). Or any number of other things.
As the parent says, while far from perfect, they're an incredible aid in so many areas. When used well, they help you produce not just faster but also better results. The only trick really is that you need to treat it as a (very knowledgeable but overconfident) collaborator rather than an oracle.
I love using it to boilerplate code for a new API I want to integrate. Much better than having to manually search. In the near future, not knowing how to effectively use AI to enhance productivity will be a disadvantage to potential employers.
I use ChatGPT all the time. I really like it. It's not perfect; how I've described it (and I doubt that I'm unique in this): it's like having a really smart and eager intern at your disposal.
I say "intern" in the sense that its error-prone and kind of inexperienced, but also generally useful. I can ask it to automatically create a lot of the bootstrapping or tedious code that I always dread writing so that I can focus on the fun stuff, which is often the stuff that's pawned off onto interns and junior-level engineers. I think for the most part, when you treat it like that, it lives up to and sometimes even surpasses expectations.
I mean, I can't speak for everyone, but whenever I begin a new project, a large percentage of the first ~3 hours is simply copying and pasting and editing from documentation, either an API I have to call or some bootstrapping code from a framework or just some cruft to make built-in libraries work how you want. I hate doing all that, it actively makes me not want to start a new project. Being able to get ChatGPT to give me stuff that I need to actually get started on my project has made coding a lot more fun for me again. At this point, you can take my LLM from my cold dead hands.
I do think it will keep getting better, but I'm also at a point where even if it never improves I will still keep using it.
I guess it totally depends on the specific use cases. Just yesterday, I asked Gemini 2.5 for how to test a Laravel artisan command that internally calls another artisan command using the Pest testing framework. Gemini showed me a method that looked perfect. But when I tried to execute it, I got an error that the method doesn't exist. So I asked Gemini about this, and it told me: "Ah, apologies! That's a classic sign that you might be using a version of Laravel older than 5.7." So I told it that I am actually using the latest release (12.3), to which it responded with: "You're right to question things when you get an error! However, let's clarify the Laravel version first.
As of today, March 27, 2025, the latest stable version of Laravel is Laravel 11, which was released in March 2024. Laravel 12 has not been released yet (it's expected roughly in Q1 2026 based on the usual schedule).
Could you please double-check the exact Laravel version you are using?" So it did not believe me and I had to convince it first that I was using a real version. This went on for a while, with Gemini not only hallucinating stuff, but also being very persistent and difficult to convince of anything else.
Well, in the end it was still certain that this method should exist, even though it could not provide any evidence for it and my searching through the internet and the Git history of the related packages did also not provide any results.
So I gave up and tried it with Claude 3.7 which could also not provide any working solution.
In the end, I found an entirely different solution for my problem, but that wasn't based on anything the AIs told me, but just my own thinking and talking to other software developers.
I would not go that far to call these AIs useless. In software development they can help with simple stuff and boilerplate code, and I found them a lot more helpful in creative work. This is basically the opposite from what I would have expected 5 years ago ^^
But for any important tasks, these LLMs are still far too unreliable. They often feel like they have a lot of knowledge, but no wisdom. They don't know how to apply their knowledge ideally, and they often basically brute-force it with a mix of strange creativity and statistical models that are apparently based on a vast amount of internet content that has big parts of troll content and satire.
My issue with higher ups pushing LLMs is that what slows me down at work is not having to write the code. I can write the code. If all I had to do was sit down and write code, then I would be incredibly productive because I'm a good programmer.
But instead, my productivity is hampered by issues with org communication, structure, siloed knowledge, lack of documentation, tech debt, and stale repos.
I have for years tried to provide feedback and get leadership to do something about these issues, but they do nothing and instead ask "How have you used AI to improve your productivity?"
Ive had the same experience as you, and also rather recently. I had to learn two lessons: first, what I could trust it with (as with Wikipedia when it was new), and second, what makes sense to ask it (as with YouTube when it was new). Once I got that down, it is one fabulous tool to have on my belt, among many other tools.
Thing is, the LLMs that I use are all freeware, and they run on my gaming PC. Two to six tokens per second are alright honestly. I have enough other things to take care of in the meantime. Other tools to work with.
I don't see the billion dollar business. And even if that existed, the means of production would be firmly in the hands of the people, as long as they play video games. So, have we all tripled our salaries?
If we haven't, is that because knowledge work is a limited space that we are competing in, and LLMs are an equalizer because we all have them? Because I was taught that knowledge work was infinite. And the new tool should allow us to create more, and better, and more thoroughly. And that should get us all paid better.
Right?
There are basically 3 categories of LLM users (very roughly).
1. People creating or dealing with imprecise information. People doing SEO spam, people dealing with SEO spam, almost all creative arts people, people writing corporatese- or legalese- documents or mails, etc. For these tasks LLMs are god-like.
2. People dealing with precise information and or facts. For these people LLMs is no better than a parrot.
3. Subset of 2 - programmers. Because of the huge amount of stolen training data, plus almost perfect proofing software is the form of compilers, static analyzers etc. for this case LLMs are more or less usable, the more data was used the better (JS is the best as I understand).
This is why people's reaction is so polarizing. Their results differ.
Depends on your use case. If you don't need them to be the source of truth, then they work great, but if you do, the experience sucks because they're so unreliable.
The problems start when people start hyperventilating because they think since LLMs can generate tests for a function for you, that they'll be replacing engineers soon. They're only suitable for generating output that you can easily verify to be correct.
Indeed, isn’t that the point?
LLM training is designed to distill a massive corpus of facts, in the form of token sequences, into a much, much smaller bundle of information that encodes (somehow!) the deep structure of those facts minus their particulars.
They’re not search engines, they’re abstract pattern matchers.
I asked Grok to describe a picture I took of me and my kid at Hilton Head island. Based on the plant life it guesses it was a southeast barrier island in Georgia or the Carolinas. It guessed my age and my son’s age. LLMs are completely insane tech for a 90s kid. The first fundamental advance in tech I’ve seen in my lifetime—like what it must’ve been like for people who used a telephone for the first time, or watched a television.
Flat TVs, digital audio players (the iPod), the smartphone, laptops, the smartwatches,... You have a very selective definition of advance in tech. Compare today (minus LLMs) with any movies depicting life in the nineties and you can see how much tech have developed.
It’s not that impressive to me, as a programmer.
The crisis in programming hasn’t been writing code. It has been developing languages and tools so that we can write less of it that is easy to verify as correct. These tools generate more code. More than you can read and more than you will want to before you get bored and decide to trust the output. It is trained on the most average code available that could be sucked up and ripped off the Internet. It will regurgitate the most subtle errors that humans are not good at finding. It only saves you time if you don’t bother reading and understanding what it outputs.
I don’t want to think about the potential. It may never materialize. And much of what was promised even a few years ago hasn’t come to fruition. It’s always a few years away. Always another funding round.
Instead we have massive amounts of new demand for liquid methane, infrastructure struggling to keep up, billions of gallons of fresh water wasted, all so that rich kids can vibe code their way to easy money and realize three months later they’ve been hacked and they don’t know what to do. The context window has been lost and they ran out of API credits. Welcome to the future.
Yeah basically this. If I look at how it helps me as an individual, I can totally see how AI can sometimes be useful. If I take a look at what societal effect of AI, it becomes apparent that AI just is a net negative. Some examples:
- AI is great for disinformation
- AI is great at generating porn of women without their consent.
- Open source projects massively struggle as AI scrapers DDOS them.
- AI uses massive amounts of energy and water, most importantly the expectation is that energy usage will rise when we drastically in a world where we need to lower it. If Sam Altman gets his way, we're toast.
- AI makes us intellectually lazy and worse thinkers. We were already learning less and less in school because of our impoverished attention span. This is even worse now with AI.
- AI makes us even more dependent on cloud vendors and third-parties, further creating a fragile supply chain.
Like AI ostensibly empowers us as individuals, but in reality I think it's a disservice, and the ones it truly empowers are the tech giants, as citizens become dumber and even more dependent on them and tech giants amass more and more power.
I can't believe I had to dig this deep to find this comment.
I have yet to see an AI-generated image that was "really cool".
AI images and videos strike me as the coffee pods of the digital world -- we're just absolutely littering the internet with garbage. And as a bonus, it's also environmentally devastating to the real world!
I live nearby a landfill, and go there often to get rid of yard waste, construction materials, etc. The sheer volume of perfectly serviceable stuff people are throwing out in my relatively small city (<200k) is infuriating and depressing. I think if more people visited their local landfills, they might get a better sense for just how much stuff humans consume and dispose. I hope people are noticing just how much more full of trash the internet has become in the last few years. It seems like it, but then I read this thread full of people that are still hyped about it all and I wonder.
This isn't even to mention the generated text... it's all just so inane and I just don't get it. I've tried a few times to ask for relatively simple code and the results have been laughable.
If you ask for obscure things how do you know you are getting right answers? From my experience unless the thing you are looking for is not found easily with a google search LLMs have no hope getting it correct. This is mostly trying to code against obscure API that isn’t well documented and the little documentation there is is spread across multiple wikis. And the LLMs keep hallucinating functions that simply do not exist
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly
Hossenfelder is a scientist. There's a certain level of rigour that she needs to do her job, which is where current LLMs often fall down. Arguably it's not accelerating her work to have to check every single thing the LLM says.
I use them everyday and they save me so much time and enable me to do things that I wouldn't be able to do otherwise just due to the amount of time it would take.
I think some people just aren't using them correctly or don't understand their limitations.
They are especially helpful for helping me get over thought paralysis when starting new project.
I think the tech would have landed better with more people if it wasn't branded as "artificial intelligence."
I don't have a proposal for what a better name would have been, naming things is hard, but AI carries quite a bit of baggage and expectations with it.
It is an amazing technology and like crypto/blockchain it is nerdy to understand how it works and play with it. I think there are two things at stake here:
1. Some people are just uncomfortable with it because it “could” replace their jobs.
2. Some people are warning that the ecosystem bubble is significantly out of proportions. They are right and having the whole stock market, companies and US economy attached to LLMs is just down right irresponsible.
> Some people are just uncomfortable with it because it “could” replace their jobs.
What jobs are seriously at risk of being totally replaced by LLM's? Even in things like copywriting and natural language translation, which is somewhat of a natural "best case" for the underlying tech, their output is quite sub par compared to the average human's.
They can do fun and interesting stuff, but we keep hearing how they’re going to replace human workers, and too many people in positions of power not only believe they are capable of this, but are taking steps to replace people with LLMs.
But while they are fun to play with, anything that requires a real answer, but can’t be directly and immediately checked, like customer support, scientific research, teaching, legal advice, identifying humans, correctly summarizing text - LLMs are very bad at these things, make up answers, mix contexts inappropriately, and more.
I’m not sure how you can have played with LLMs so much and missed this. I hope you don’t trust what they say about recipes or how to handle legal problems or how to clean things or how to treat disease or any fact-checking whatsoever.
>I’m not sure how you can have played with LLMs so much and missed this. I hope you don’t trust what they say about recipes or how to handle legal problems or how to clean things or how to treat disease or any fact-checking whatsoever.
This is like a GPT3.5 level criticism. o1-pro is probably better at pure fact retrieval than most PhDs in any given field. I challenge you to try it.
In fact take the GPQA test yourself and see how you do then give the same questions to o1. https://arxiv.org/pdf/2311.12022
The main issue is that if you ask most LLMs to do something they aren't good at, they don't say "Sorry, I'm not sure how to do that yet," they says "Sure, absolutely! Here you go:" and proceed to make things up, provide numbers or code that don't actually add up, and make up references and sources.
To someone who doesn't actually check or have the knowledge or experience to check the output, it sounds like they've been given a real, useful answer.
When you tell the LLM that the API it tried to call doesn't exist it says "Oh, you're right, sorry about that! Here's a corrected version that should work!" and of course that one probably doesn't work either.
Yes. One of my early observations about LLMs was that we've now produced software that regularly lies to us. It seems to be a quite intractable problem. Also, since there's no real visibility as to how an LLM reaches a conclusion, there's no way to validate anything.
One takeaway from this is that labelling LLMs as "intelligent" is a total misnomer. They're more like super parrots.
For software development, there's also the problem of how up to date they are. If they could learn on the fly (or be constantly updated) that would help.
They are amazing in some ways, but they've been over-hyped tremendously.
The frustration of using an LLM is greater than the frustration of doing it myself. If it's going to be a tool, it needs to work. Otherwise, it's just a research a toy.
Nothing it can do I couldn‘t do myself before: All the information it gives me I could get myself, though admittedly, slower.
I wonder if people that are amazed by LLM lack this information gathering skill.
After all I met plenty of architect and senior level people that just… had zero google and research skills.
I agree, they are an amazing piece of technology, but the investment and hype doesn't match the reality. This might age like milk, but I don't think OpenAI is going to make it. They burnt $9B to lose $5B in 2024, trying to raise money like their life depends on it.. because their life depends on it. From what I can tell, none of the AI model produces are profiting from their model usage at this point, except maybe Deepseek. This will be a market, they are useful, astonishing impressive even, but IMO they are either going to become waaayy more expensive to use and/or/combo of market/investment will greatly shrink to be sustainable.
> can ask for obscure things with subtle nuance where I misspell words and mess up my question and it figures it out. It talks to me like a person. It generates really cool images. It helps me write code. And just tons of other stuff that astounds me.
It is an impressive technology but is it US$244.22bn [1] impressive (I know this stat is supposed to account for computer vision as well but seeing as to how LLMs are now a big chunk of that I think it's a safe assumption)? It's projected to grow to over US$1tr by 2031. That's higher than the market size of commercial aviation at its peak [2]. I'm sorry if I agree that a cool chatbot is not approximately as important as flying.
[1] https://www.statista.com/outlook/tmo/artificial-intelligence...
[2] https://www.statista.com/markets/419/topic/490/aviation/#sta...
Couldn’t agree more!
When I saw GPT-3 in action in 2023, I couldn’t believe my eyes. I thought I was being tricked somehow. I’d seen ads for “AI-powered” services and it was always the same unimpressive stuff. Then I saw GPT-3 and within minutes I knew it was completely different. It was the first thing I’d ever seen that felt like AI.
That was only a few years ago. Now I can run something on my 8GB MacBook Air that blows GPT-3 out of the water. It’s just baffling to me when people say LLM’s are useless or unimpressive. I use them constantly and I can still hardly believe they exist!!
> I use them constantly and I can still hardly believe they exist!!
Exactly how I feel. I probably write 50 prompts/day, and a few times a week I still think, "I can't believe this is real tech."
LLMs are better at formally verifiable tasks like coding, also coding makes more money on a pure demand basis so development for it gets more resources. In descriptive science fields, it's not great because science fields don't generate a lot of text compared to other things, so the training data is dwarfed by the huge corpus of general internet text. The software industry created the internet and loves using it, so they have published a lot more text in comparison. It can be really bad in bio for example.
Is your testing adversarial or merely anecdotal curiosity? If you don't actively look for it why would you expect to find it?
It's bad technology because it wastes a lot of labor, electricity, and bandwidth in a struggle to achieve what most human beings can with minimal effort. It's also a blatant thief of copyrighted materials.
If you want to like it, guess what, you'll find a way to like it. If you try to view it from another persons use case you might see why they don't like it.
I think it’s more that the people who are boosting LLMs are claiming that perfect super intelligence is right around the corner.
Personally, I look back at how many years ago it was that we were seeing claims that truck drivers were all going to lose there jobs and society would tear itself apart over it within the next few years… and yet here we still are.
An LLM is like a mouse.
You no longer have the console as the primary interface, but a GUI, which 99.9+% of computer users control via a mouse.
You no longer have the screen as the primary interface, but an AUI, which 99.9+% of computer users control via a headset, earbuds, or a microphone and speaker pair.
You mostly speak and listen to other humans, and if you're not reading something they've written, you could have it read to you in order to detach from the screen or paper.
You'll talk with your computer while in the car, while walking, or while sitting in the office.
An LLM makes the computer understand you, and it allows you to understand the computer.
Even if you use smart glasses, you'll mostly talk to the computer generating the displayed results, and it will probably also talk to you, adding information to the displayed results. It's LLMs that enable this.
Just don't focus too much on whether the LLM knows how high Mount Kilimanjaro is; its knowledge of that fact is simply a hint that it can properly handle language.
Still, it's remarkable how useful they are at analyzing things.
LLMs have a bright future ahead, or whatever technology succeeds them.
I don’t even argue that they might get useful at some point, but when I point a mouse at a button and press the button it usually results in a reliable action. When I use the LLM (I have so far tried: Claude, ChatGPT, DeepSeek, Mistral) it does something but that something usually isn’t what I want (~the linked tweet). Prompting, studying and understanding the result and then cleaning up the mess for the low price of an expensive monthly sub leaves me with worse results than if I did the thing myself, usually takes longer and often leaves me with subtle bugs I’m genuinely afraid of growing into exploitable vulnerabilities. Using it strictly as a rubber duck is neat but also largely pointless. Since other people are getting something out of the tech, I’ll just assume that the hammer doesn’t fit my nails.
These are the beginnings and it will only improve. The premise is "I genuinely don't understand why some people are still bullish about LLMs", which I just can't share.
When the mouse and GUI was invented nobody needed to say "just wait a couple years for it to improve and you'll understand why it's useful, until then please give me money". The benefits are immediately obvious and improve the experience for practically every computer user.
LLMs are very useful for some (mostly linguistic) tasks, but the areas where they're actually reliable enough to provide more value than just doing it yourself are narrow. But companies really need this tech to be profitable and so they try to make people use it for as many things as possible and shove it in everyone's face[0] in hopes that someone finds a use-case where the benefits are indeed immediately obvious and revolutionary.
[0] For example my dad's new Android phone by default opens a Gemini AI assistant when you hold the power button and it took me minutes of googling to figure out how to make it turn off the damn thing. Whoever at Google thought that this would make people like AI more is in the wrong profession.
Cursor is at $100m ARR. People (users, not just investors) are giving them money, not because they expect to see value in a couple of years.
It's like a mouse that some variable proportion of the time pretends it's moved the cursor and clicked a button, but actually it hasn't and you have to put a lot of work in to find out whether it did or didn't do what you expected.
It used to be annoying enough just having to clean the trackball, but at least you knew when it wasn't working.
This is true. But it needs to be more than a toy if it is to be economically viable.
So far the industrial applications haven't been that promising, code writing and documentation is probably the most promising but even there it's not like it can replace a human or even substantially increase their productivity.
I'm completely with you. The technology is absolutely fascinating in its own right.
That said, I do experience frustrations: - Getting enraged when it messes up perfectly good code it wrote just 10 minutes ago - Constantly reminding it we're NOT using jest to write tests - Discovering it's created duplicate utilities in different folders
There's definitely a lot of hand-holding required, and I've encountered limitations I initially overlooked in my optimism.
But here's what makes it worthwhile: LLMs have significantly eased my imposter syndrome when it comes to coding. I feel much more confident tackling tasks that would have filled me with dread a year ago.
I honestly don't understand how everyone isn't completely blown away by how cool this technology is. I haven't felt this level of excitement about a new technology since I discovered I could build my own Flash movies.
Its fine if you don't care about correctness but a lot of people do.
It depends. For small tasks like summarization or self-contained code snippets, it’s really good—like figuring out how to inspect a binary executable on Linux, or designing a ranking algorithm for different search patterns. If you only want average performance or don’t care much about the details, it can produce reasonable results without much oversight.
But for larger tasks—say, around 2,000 lines of code—it often fails in a lot of small ways. It tends to generate a lot of dead code after multiple iterations, and might repeatedly fail on issues you thought were easy to fix. Mentally, it can get exhausting, and you might end up rewriting most of it yourself. I think people are just tired of how much we expect LLMs to deliver, only for them to fail us in unexpected ways. The LLM is good, but we really need to push to understand its limitations.
I think its perception of usefulness depends on how often you ask/google questions. If you are constantly wondering about X thing, LLMs are amazing - especially compared to previous alternatives like googling or asking on Reddit.
If you don’t constantly look for information, they might be less useful.
I don't like tools that can't be trusted to work 100% of the time. Is this hard to grasp?
I'm a senior engineer with 20 years of experience and mostly find all of the AI bs for the last couple of years to be occasionally helpful for general stuff but absolutely incompetent when I need help mildly complicated tasks.
I did have a eureka moment the other day with deepseek and a very obscure bug I was trying to tackle. One api query was having a very weird, unrelated side effect. I loaded up cursor with a very extensive prompt and it actually figured out the call path I hadn't been able to track down.
Today, I had a very simple task that eventually only took me half an hour to manually track. But I started with cursor using very similar context as the first example. It just kept repeatedly dreaming up non-existent files in the PR and making suggestions to fix code that doesn't exist.
So what's the worth to my company of my very expensive time? Should I spend 10,20,50 percent of my time trying to get answers from a chatbot, or should I just use my 20 years of experience to get the job done?
I’ve been playing with Gemini 2.5 pro throwing all kinds of problems that will help me with personal productivity and it’s mostly one shoting them. I’m still in disbelief tbh. A lot of people who don’t understand how to use LLM effectively will be at an economic disadvantage.
Can you give some examples? Do you mean things like "How do I control my crippling anxiety", things like "What highways would be best to take to Chicago", things like "Write me a Python library to parse the file format in this hex dump", or things like "What should I make for dinner"?
As a 50+ nerd, for decades I carried the idea that can't we just build a sufficiently large neural net, throw some data at it and have somehow be usefully intelligent? So it's kind of showing strong signs of something I've been waiting for.
In the 70's I read in some science book for kids about how one day we will likely be able to use light emitting diodes for illumination instead of light bulbs, and this "cold light" will save us lots of energy. Waited out that one too; it turned out so.
Same as reading books, Internet, Wikipedia, working towards/keeping your health and fitness, etc...
The quote about books being a mirror reflecting genius or idiocy seems to apply.
I see LLMs a kind of hyper-keyboard. Speeding up typing AND structuring content, completing thoughts, and inspiring ideas.
Unlike a regular keyboard, an LLM transforms input contextually. One no longer merely types but orchestrates concepts and modulates language, almost like music.
Yet mastery is key. Just as a pianist turns keystrokes into a symphony through skill, a true virtuoso wields LLMs not as a crutch but as an amplifier of thought.
> And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly
More like we note the frequency with which these tools produce shallow bordering on useless responses, note the frequency with which they produce outright bullshit, and conclude their output should not be taken seriously. This smells like the fervor around ELIZA, but with several multinational marketing campaigns behind it pushing.
Probably because the applications of the technology are things like automating calls for collections agencies:
https://www.ycombinator.com/companies/domu-technology-inc/jo...
If we judge a technology by how it transforms our lives, LLMs and GenAI has mostly been a net negative (at least that is how it feels).
People were impressed by Eliza in 1964.
I’m reminded of how I always think current cutting edge good examples of CG in movies looks so real and then, consistently, when I watch it again in 10 years it always looks distractingly shitty.
I honestly believe GP comment demonstrates a level of gullibility that AI hypesters are exploiting. Generative LLMs are text[1] generators, statistical machines extruding plausible text. To the extent that it a human believes it to be credible, the output exhibits all the hallmarks of a confidence game. Once you know the trick, after Toto pulls back the curtain[2], its not all that impressive.
1. I'm aware that LLMs can generate images and video as well. The point applies.
2. https://www.youtube.com/watch?v=NZR64EF3OpA
Perhaps you have already paid off your mortgage and saved up a million dollars for retirement? And you're not threatened by dismissal or salary reduction because supposedly "AI will replace everyone."
By the way, you don't need to be a 50+ year old nerd. Nerds are a special culture-pen where smart straight-A students from schools are placed so they can work, increase stakeholder revenues, and not even accidentally be able to do anything truly worthwhile that could redistribute wealth in society.
The speed at which anything progresses is impressive if you're not paying attention while other people toil away on it for decades, until one day you finally look up and say, "Wow, the speed at which this thing progressed is insane!"
I remember seeing an AI lab in the late 1980's and thinking "that's never going to work" but here we are, 40 years later. It's finally working.
Yeah, like I. I. Rabi said in regard to people no longer being amazed by the achievements of physics, "What more do you want, mermaids?"
Anyone who remembers further back than a decade or so remembers when the height of AI research was chess programs that could beat grandmasters. Yes, LLMs aren't C3PO or the like, but they are certainly more like that than anything we could imagine just a few years ago.
I'm glad I'm not the only person in awe with LLMs. It feels like it came straight out of science fiction novel. What does it take to impress people nowadays?
I feel like if teleportation was invented tomorrow, people would complain that it can't transport large objects so it's useless.
Absolutely, as someone with far less experience than you I feel quite spoiled I'll have this for help coding for the rest of time.
“The growth of the Internet will slow drastically, as the flaw in ‘Metcalfe’s law' becomes apparent: most people have nothing to say to each other! By 2005, it will become clear that the Internet’s impact on the economy has been no greater than the fax machine’s”
Same vibe.
Yeah the amount of "piffle work" that LLMs save me is astounding. Sure, I can look up fifty different numbers and copy them into excel. Or I can just tell an LLM "make a chart comparing measurements XYZ across devices ABC" and I'm looking at the info right there.
Because we're being told it is a perfect super intellegence, that it is going to replace senior engineers. The hype cycle is real, and worse than blockchain ever was. I'm sure llms will be able to code a full enterprise app about the same time moon coin replaces $usd.
Probably because you don't have the same use-case as them... doing "code" is an "easy" use-case, but pondering on a humanities subject is much harder... you cannot "learn the structure" of humanities, you have to know the facts... and LLMs are bad at that
Because they're not perfect, and they're going to lobotomise societal ability to invent if we're not actually careful with them.
https://news.ycombinator.com/item?id=43504459
The real problem is that they are fundamentally unreliable once you start looking closer at their answers.
I often ask "So you say LLMs are worthless because you can't blindly trust the first thing they say? Do you blindly trust the first google search result? Do you trust every single thing your family members tell you?" It reminds me of my high school teachers saying Wikipedia can't be trusted.
You did not mention any useful business case, with real economic impact, either than a very smart template generator for Developers.
Choose a very narrow domain, that you known well, and you quickly realize they are just repeating the training data.
I wholeheartedly agree with you and it’s funny reading the replies to your comment.
Basically people just doubling down on everything you just described. I can’t quite put a finger on it but it has a tinge of insecurity or something like that, hope that’s not the case and me just misinterpreting
Indeed, it is the stuff of science fiction, and the you get an "akshually, it's just statistics" comment. I feel people projecting their fears, because deep down, they're simply afraid.
I like LLMs for what they are. Classifiers. I don’t trust them as search engines because of hallucinations. I use them to get a bearing on a subject but then I’ll turn to Google to do the real research.
Because the marketers oversold it. That is why you are seeing a pushback. I also outright rejected them because 1) they were sold and marketed as end all be all replacements for human thought, and 2) they promised to replace only the parts of my job that I enjoy. Billboards were up in San Francisco telling my "bosses" that I was soon to be replaced, and the loudest and earliest voices told me that the craft I love is dead. Imagine Nascar drivers excitedly discussing how cool it was they wouldn't have to turn left anymore - made me wonder why everyone else was here.
It was, more or less, the same narrative arc as Bitcoin, and was (is) headed for a crash.
That said, I've spent a few weeks with augment, and it is revelatory, certainly. All the marketing - aimed at a suite I have no interest in - managed to convince me it was something it wasn't. It isn't a replacement, any more than a power drill is a replacement for a carpenter.
What it is, is very helpful. "The world's most fully functioning scaffolding script", an upgrade from copilot's "the world's most fully functioning tab-completer". I appreciate it usefulness as a force multiplier, but I am already finding corners and places where I'd just prefer to do it myself. And this is before we get into the craft of it all - I am not excited by the pitch "worse code, faster", but the utility is undeniable in this capitalistic hell planet, and I'm not a huge fan of writing SQL queries anyway, so here we are!
Of course you'd be confused if your transform a list of basic fails people complain about into
> And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
But LLMs should be good enough to resolve this confusion, ask them!
It's like computer graphics and VR: Amazing advances over the years, very impressive, fun, cool, and by no means a temporary fad...
... But I do not believe we're on the cusp of a Lawnmower-Man future where someone's Metaverse eats all the retail-conference-halls and movie-theaters and retail-stores across the entire globe in an unbridled orgy of mind-shattering investor returns.
Similarly, LLMs are neat and have some sane uses, but the fervor about how we're about to invent the Omnimind and usher in the singularity and take over the (economic) world? Nah.
Ah yes, the "computer graphics, which has generated billions upon billions of revenue and change are just a fad" argument.
What next, "This Internet thing was just a fad" or "The industrial age was a fad"?
We find this thing you said:
> Ah yes, the "computer graphics, which has generated billions upon billions of revenue and change are just a fad" argument.
in reference to this thing that OP said:
> It's like computer graphics and VR: Amazing advances over the years, very impressive, fun, cool, and by no means a temporary fad...
Either you, yourself are an LLM, or you need to slow the fuck down and read.
That is not at all what they are saying. If you seriously think that, you have serious reading comprehension problems.
For exploring topics in a shallow fashion is fine with LLMs, doing anything deep is just too unreliable due to hallucination. All models I’ve talked to desperately want to give a positive answer, and thus will often just lie.
Today's models are far from autonomous thinking machines. It is a cognitive bias among the masses that agree. It is just a giant calculator. It predicts "the most probable next word" from a sea of all combinations of next words.
What you're impressed with is 40% human skill in creating an LLM, 0.5% value created by the model. And 59.5% the skills of all the people it ate and is now trying to destroy the livelihood of
There are legitimate concerns to have in regards to LLMs, it's not all a sea of roses.
As far as capabilities? Meh. It will improve, or not, but we'll figure out really cool things regardless.
As far as breaking our reality and society? Absolutely :(
I don't see it as a bigger leap than the internet itself. I recall needing books on my desk or a road trip to the local bookshop to find out coding answers. Stack Overflow beats AI most days, but the LLMs are another nice tool.
Could it be that people who are too bullish are being deceived by smoke and mirrors?
As others have pointed out already, the hype about writing code like senior engineer, or in general acting as a competent assistant is what created the expectation in the first place. They keep over-promising, but underdelivering. Who is the guy talking about AGI most of the time? Could it be the top-executive of one of the largest gen AI companies, do you think? I won't deny it has occasionally a certain 'star-trek-computer' flair to it, but most of the time it feels like having a heavily degraded version of "rain man". He may count your cards perfectly one moment, then will get stuck trying to untie his shoes. I stopped counting how many times it produced just outright wrong outputs, to the point of suggesting literally the opposite of what one is asking of them. I would not mind it so much, if they were being advertised for what they are, not for what they could potentially be, if only another half a trillion dollar were invested in data-centers. It is not going to happen with this technology, the issue is structural, not resource-related.
It is absurd. The research and learning power is currently right now a miracle.
I go back and forth. I share your amazement. I used Gemini Deep Research the other day and was blown away. It claimed to go read 20 websites, I showed its "thinking" and steps. Its conclusions at each step. Then it wrote a large summary (several pages)
On the other hand, I saw github recently added Copilot as a code reviewer. For fun I let it review my latest pull request. I hated its suggestions but could imagine a not too distant future where I'm required by upper management to satisfy the LLM before I'm allowed to commit. Similarly, I've asked ChatGPT questions and it's been programmed to only give answers that Silicon Valley workers have declared "correct".
The thing I always find frustrating about the naysayers is that they seem to think how it works today is the end if it. Like I recently listened to an episode of Econtalk interviewing someone on AI and education. See lives in the UK and used Tesla FSD as an example of how bad AI is. Yet I live in California and see Waymo mostly working today and lots of people using it. I believe she wouldn't have used the Tesla FSD example, and would possibly have changed her world view at least a little, if she'd updated on seeing self driving work.
> And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
Except this isn't true. The code quality varies dramatically depending on what you're doing, the length of the chat/context, etc. It's an incredible productivity booster, but even earlier today, I wasted time debugging hallucinated code because the LLM mixed up methods in a library.
The problem isn't so much that it's not an amazing technology, it's how it's being sold. The people who stand to benefit are speaking as though they've invented a god and are scaring the crap out of people making them think everyone will be techno-serfs in a few years. That's incredibly careless, especially when as a technical person, you understand how the underlying system works and know, definitively, that these things aren't "intelligent" the way they're being sold.
Like the startups of the 2010s, everyone is rushing, lying, and huffing hopium deluding themselves that we're minutes away from the singularity.
Really? I just get garbage. Both Claude and CoPilot kept insisting that it was ok to use react hooks outside of function components. There have been many other situations where it gave me some code and even after refining the prompt it just gave me wrong or non working code. I’m not expecting perfection, but at least don’t waste my time with hallucinations or just flat out stuff that doesn’t work.
You forget the large group of people that proudly declare they invent AGI and they can make everyone lose jobs and starve. complains are for them, not for you.
Keep in mind it understands nothing. The notion that LLMs understand anything is fundamentally flawed, as they do not demonstrate any markers of understanding
> Just amazing, doing things we dreamed about for decades.
Chatbots like in the sci-fi of your nostalgia? I never dreamed about that shit, sorry.
> Wah, it can't write code like a Senior engineer with 20 years of experience!
Thank goodness for that too. I want it to help me with my job, not replace me.
No, it can't write code like a Junior Zig engineer with 1 year of experience.
The fact that people call them Markov chains when they clearly haven't used the early chat bots that were dumb Markov chains pisses me off.
The fact that you don't know what Markov chain means and get angry over others over that pisses me off.
Both are Markov chains, that you used to erroneously think Markov chain is a way to make a chatbot rather than a general mathematical process is on you not them.
not one of them have managed to generate a successful promise based implementation of recaptcha v2 in javascript from scratch https://developers.google.com/recaptcha/docs/loading they have a million+ references for this
> it isn't a perfect superintelligence that understands everything perfectly?
it isnt ANY form of intelligence.
I have it help me with reporting (with private info taken out of course). I've easily saved myself hundreds of hours already.
These are the people who give themselves self value by saying that amazing things have no value.
Maybe Freud could explain.
I mean. How would you feel if you coded a menu in Python with certain choices but when you used it the choices were never the same or in the same order, sometimes there were fake choices, sometimes they are improperly labelled and sometimes the menu just completely fails to open. And you as a coder and you as a user have absolutely no control over any of those issues. Then, when you go online to complain people say useful stuff like "Isn't it amazing that it does anything at all!? Give us a break, we're working on it bro."
That's how I see LLMs and the hype surrounding them.
I feel like both standpoints are true. Yes the tech is miraculous, and yes it has a very long way to go.
a lot of it is just plain denial. a certain subgenre of person will forever attack everything AI does because they feel threatened by it and a certain other subgenre of person will just copy this behaviour and parrot opinions for upvotes/likes/retweets.
For me, LLMs are a bit like if you were shown a talking dog with the education and knowledge of a first grad student: a talking dog is amazing in itself, and a truly impressive technical feat, that said you wouldn't make the dog file your taxes or represent you in court.
To quote Joel Spolsky, "When you’re working on a really, really good team with great programmers, everybody else’s code, frankly, is bug-infested garbage, and nobody else knows how to ship on time.", and that's the state we end up if we believe in the hype and use LLMs willy-nilly.
That's why people are annoyed, not because LLMs cannot code like a senior engineer, but because lots of content marketing a company valuation is dependent on making people believe it's the case.
Same
And people keep forgetting how new this stuff is
This is like trashing video games in 1980 because Pong has awful graphics
I blame overpromised expectations from startups and public companies, screaming about AGI and superintelligence.
Truly amazing technology which is very good at generating and correcting texts is marketed as senior developer, talented artist, and black box that has solution to all your problems. This impression shatters on the first blatant mistake, e.g. counting elephant legs: https://news.ycombinator.com/item?id=38766512
For me, I think they're valuable but also overhyped. They're not at the point they're replacing entire dev teams like some articles point out. In addition, they are amazingly accurate sometimes and amazingly misleading other times. I've noticed some ardent advocates ignore the latter.
It's incredibly frustrating when people think they're a miracle tool and blindly copy/paste output without doing any kind of verification. This is especially frustrating when someone who's supposed to be a professional in the field is doing it (copy lasting non working AI generated code and putting it up for review)
That said, on one hand, they multiply productive and useful information. On the other hand, they kill productive and spread misinformation. That said, I still seem them as useful but not a miracle
"It talks to me like a person."
No, it provides responses. It does not talk.
ChatGPT Advanced Voice Mode?
I'll keep bringing up this example whenever people dismiss LLMs.
I can ask Claude the most inane programming question and got an answer. If I were to do that on StackOverflow, I'd get downvoted, rude comments, and my question closed for being off-topic. I don't have to be super knowledgeable about the thing I'm asking about with Claude (or any LLM for that matter).
Even if you ignore the rudeness and elitism of power-users of certain platforms, there's no more waiting for someone to respond to your esoteric questions. Even if the LLM spews bullshit, you can ask it clarifying questions or rephrase until you see something that makes sense.
I love LLMs, I don't care what people say. Even when I'm just spitballing ideas[1], the output is great.
---
[1]: https://blog.webb.page/2025-03-27-spitball-with-claude.txt
Wicked, selfish children!Only the real power users understand it generates really cool images and talks just like a person.
It's the classic HN-like anti-anything bubble we see with Javascript frameworks. Hundreds of thousands of people are productive with them and enjoy them. They created entire industries and job fields. The same is happening with LLMs, but the usual counter-culture dev crowd is denying it while it's happening right before their eyes. I too use LLMs every day. I never click and a link and it doesn't exist. When I want to take my mind off of things, I just talk with GPT.
You're being disingenuous. The tweet was talking about asserting the existence of fake articles, claiming that a paper was written in one year while summarizing a paper that explicitly says it was written in another, and severe hallucinations. Nowhere does she even imply that she's looking for superintelligence.
Obligatory Louis CK everything is amazing and no one is happy bit goes here:
https://youtu.be/aGnMbKwP36U?si=WbXzphhhP8Hak1OQ
It’s a human nature thing - we’re supposed to be collecting nuts in the forest.
What I find interesting is that my experience has been 100% the opposite. I’ve been using ChatGPT, Claude, and Gemini for almost a year (well only the ChatGPT for a year since the rest are more recent.) I’ve been using them to help build circuits and write code. They are almost always wrong with circuit design, and create code that doesn’t work north of 80% of the time. My patience has dropped off to the point where I only experiment with LLM a few times a week because they are so bad. Yes it is miraculous that we can have a conversation, but it means nothing if the output is always wrong.
But I will admit the dora muckbang feet shit is fucking insane. And that just flat out scares the pants off me.
>They are almost always wrong with circuit design, and create code that doesn’t work north of 80% of the time.
Sorry but this is a total skill issue lol. 80% code failure rate is just total nonsense. I don't think 1% of the code I've gotten from LLMs has failed to execute correctly.
Do you work in circuit design?
[dead]
LLMs can't be trusted. They aé like an overconfident idiot who is pretending quite impressively, butif you check on the result it's just a bit too much bullshit in the result. So there's practically zero gain in using LLMs except WHEN you actually need a text that's nice and eloquent bullshit.
Almost everytime I've tried using LLMs I've fallen into thepattern on calling out, correcting and argueing with the LLMs which is of course in itself sillyto do, because they don't learn, they don't really "get it" when they are wrong. There's no benefit to talking to a human.
This is the place where tech shiny meets actual use cases, and users aren’t really good at articulating their problems.
Its also a slow burn issue - you have to use it for a while for what is obvious to users, to become obvious to people who are tech first.
The primary issue is the hype and forecasted capabilities vs actual use cases. People want something they can trust as much as an authority, not as much as a consultant.
If I were to put it in a single sentence? These are primarily narrative tools, being sold as factual /scientific tools.
When this is pointed out, the conversation often shifts to “well people aren’t that great either”. This takes us back to how these tools are positioned and sold. They are being touted as replacements to people in the future. When this claim is pressed, we get to the start of this conversation.
Frankly, people on HN aren’t pessimistic enough about what is coming down the pipe. I’ve started looking at how to work in 0 Truth scenarios, not even 0 trust. This is a view held by everyone I have spoken to in fraud, misinformation, online safety.
There’s a recent paper which showed that GAI tools improved the profitability of Phishing attempts by something like 50x in some categories, and made previously loss making (in $/hour terms) targets, profitable. Schneier was one of the authors.
A few days ago I found out someone I know who works in finance, had been deepfaked and their voice/image used to hawk stock tips. People were coming to their office to sue them.
I love tech, but this is the dystopia part of cyberpunk being built. These are narrative tools, good enough to make people think they are experts..
The thing LLMs are really really good at, is sounding authoritative.
If you ask it random things the output looks amazing, yes. At least at first glance. That's what they do. It's indeed magical, a true marvel that should make you go: Woooow, this is amazing tech: Coming across as convincing, even if based on hallucinations, is in itself a neat trick!
But is it actually useful? The things they come up with are untrustworthy and on the whole far less good than previously available systems. In many ways, insidiously worse: It's much harder to identify bad information than it was before.
It's almost like we designed a system to pass turing tests with flying colours but forgetting that usefulness is what we actually wanted, not authoritative, human sounding bullshit.
I don't think the LLM naysayers are 'unimpressed', or that they demand perfection. I think they are trying to make statements aimed at balancing things:
Both the LLMs themselves, and the humans parroting the hype, are severely overstating the quality of what such systems produce. Hence, and this is a natural phenomenon you can observe in all walks of life, the more skeptical folks tend to swing the pendulum the other way, and thus it may come across to you as them being overly skeptical instead.
I totally agree, and this community far is from the worst. In trans communities there's incredible hostility towards LLMs - even local ones. "You're ripping off artists", "A pissing contest for tech bros", etc.
I'm trans, and I don't disagree that this technology has aspects that are problematic. But for me at least, LLMs have been a massive equalizer in the context of a highly contentious divorce where the reality is that my lawyer will not move a finger to defend me. And he's lawyer #5 - the others were some combination of worse, less empathetic, and more expensive. I have to follow up a query several times to get a minimally helpful answer - it feels like constant friction.
ChatGPT was a total game-changer for me. I told it my ex was using our children to create pressure - feeding it snippets of chat transcripts. ChatGPT suggested this might be indicative of coercive control abuse. It sounded very relevant (my ex even admitted in a rare, candid moment that she feels a need to control everyone around her one time), so I googled the term - essentially all the components were there except physical violence (with two notable exceptions).
Once I figured that out, I asked it to tell me about laws related to controlling relationships - and it suggested laws either directly addressing (in the UK and Australia), and the closest laws in Germany (Nötigung, Nachstellung, violations of dignity, etc., translating them to English - my best language). Once you name specific laws broken and provide a rationale for why there's a Tatbestand (ie the criterion for a violation is fulfilled), your lawyer has no option but to take you more seriously. Otherwise he could face a malpractice suit.
Sadly, even after naming specific law violations and pointing to email and chat evidence, my lawyer persists in dragging his feet - so much so that the last legal letter he sent wasn't drafted by him - it was ChatGPT. I told my lawyer: read, correct, and send to X. All he did was to delete a paragraph and alter one or two words. And the letter worked.
Without ChatGPT, I would be even more helpless and screwed than I am. It's far from clear I will get justice in a German court, but at least ChatGPT gives me hope, a legal strategy. Lastly - and this is a godsend for a victim of coercive control - it doesn't degrade you. Lawyers do. It completely changed the dynamics of my divorce (4 years - still no end in sight, lost my custody rights, then visitation rights, was subjected to confrontational and gaslighting tactics by around a dozen social workers - my ex is a social worker -, and then I literally lost my hair: telogen effluvium, tinea capitis, alopecia areata... if it's stress-related, I've had it), it gave me confidence when confronting my father and brother about their family violence.
It's been the ONLY reliable help, frankly, so much so I'm crying as I write this. For minorities that face discrimination, ChatGPT is literally a lifeline - and that's more true the more vulnerable you are.
[flagged]
I agree. I recently asked if a certain GPU would fit in a certain computer... And it understood that fit could mean physically inside by could also mean that the interface is compatible, and answered both.
WhY aRe PeOpLe BuLlIsH
Did it answer correctly though?
It did. It mentioned PCIe connectors, what connects to what, and said this computer has motherboard with such and such PCIe, the card needs such and such, so it's compatible. Regarding physical size it, it said that it depends on the physical size of the case (implying that it understood that the size of the card is known but the size of the computer isn't know to it)
[flagged]
It's quite insulting that you just assume I don't know how to read specs. You're either assuming based on nothing, or you're inferring from my comment in which case I worry for your reading comprehension. At no point did I say I didn't know how to find the answer or indeed that I didn't know the answer.
TBH, they produce trash results for almost any question I might want to ask them. This is consistently the case. I must use them differently than other people.
LLMs produce midwit answers. If you are an expert in your domain, the results are kind of what you would expect for someone who isn’t an expert. That is occasionally useful but if I wanted a mediocre solution in software I’d use the average library. No LLM I have ever used has delivered an expert answer in software. And that is where all the value is.
I worked in AI for a long time, I like the idea. But LLMs are seemingly incapable of replacing anything of value currently.
The elephant in the room is that there is no training data for the valuable skills. If you have to rely on training data to be useful, LLMs will be of limited use.
> No LLM I have ever used has delivered an expert answer...and that's where all the value is.
If this were true, no one would hire junior employees and assistants. There's a huge amount of work that requires more time than expertise.
Here’s when we can start getting excited about LLMs: when they start making new and valid scientific discoveries that can radically change our world.
When an AI can say “Here’s how you make better, smaller, more powerful batteries, follow these plans”, then we will have a reason to worship AI.
When AI can bring us wonders like room temperature semiconductors, fast interstellar travel, anti-gravity tech, solutions to world hunger and energy consumption, then it will have fulfilled the promise of what AI could do for humanity.
Until then, LLMs are just fancy search and natural language processors. Puppets with strings. It’s about as impressive as Google was when it first came out.
Google was fantastically impressive when it first came out.
Only because most people didn’t understand how it worked.
My experience (almost exclusively Claude), has just been so different that I don't know what to say. Some of the examples are the kinds of things I explicitly wouldn't expect LLMs to be particularly good at so I wouldn't use them for, and others, she says that it just doesn't work for her, and that experience is just so different than mine that I don't know how to respond.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
> Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
The problem is that I feel I am constantly being bombarded by people bullish on AI saying "look how great this is" but when I try to do the exact same things they are doing, it doesn't work very well for me
Of course I am skeptical of positive claims as a result.
I don't know what you are doing or why it's failed. Maybe my primary use cases really are in the top whatever percentile for AI usefulness, but it doesn't feel like it. All I know is that frontier models have already been good enough for more than a year to increase my productivity by a fair bit.
Your use case is in fact in the top whatever percentile for AI usefulness. Short simple scripting that won't have to be relied on due to never being widely deployed. No large codebase it has to comb through, no need for thorough maintenance and update management, no need for efficient (and potentially rare) solutions.
The only use case that would beat yours is the type of office worker that cannot write professional sounding emails but has to send them out regularly manually.
I fully believe it's far better at the kind of coding/scripting that I do than the kind that real SWEs do. If for no other reason than the coding itself that I do is far far simpler and easier, so of course it's going to do better at it. However, I don't really believe that coding is the only use case. I think that there are a whole universe of other use cases that probably also get a lot of value from LLMs.
I think that HN has a lot of people who are working on large software projects that are incredibly complex and have a huge numbers of interdependencies etc., and LLMs aren't quite to the point that they can very usefully contribute to that except around the edges.
But I don't think that generalizing from that failure is very useful either. Most things humans do aren't that hard. There is a reason that SWE is one of the best paid jobs in the country.
Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.
Real programming is on a totally different scale than what you're describing.
I think that's true for most jobs. Superficially an AI looks like it can do good.
But LLMs:
1. Hallucinate all the time. If they were human we'd call them compulsive liars
2. They are consistenly inconsistent, so are useless for automation
3. Are only good at anything they can copy from their data set. They can't create, only regurgitate other people's work
4. AI influencing hasn't happened yet, but will very soon start making AI LLMs useless, much like SEO has ruined search. You can bet there are a load of people already seeding the internet with a load of advertising and misinformation aimed solely at AIs and AI reinforcement
> Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.
For what it's worth, I mostly work on projects in the 100-200 files range, at 20-40k LoC. When using proper tooling with appropriate models, it boosts my productivity by at least 2x (being conservative). I've experimented with this by going a few days without using them, then using them again.
Definitely far from the massive codebases many on here work on, small beans by HN standards. But also decidedly not just writing one-off scripts.
> Real programming is on a totally different scale than what you're describing.
How "real" are we talking?
When I think of "real programming" I think of flight control software for commercial airplanes and, I can assure you, 1 month != 5,000 LoC in that space.
It's not about the size it's more about if the task is trivial.
And... I know people who now use AI to write their professional-sounding emails, and they often don't sound as professional as they think they do. It can be easy to just skim what an AI generates and think it's okay to send if you aren't careful, but the people you send those emails to actually have to read what was written and attempt to understand it, and doing that makes you notice things that a brief skim doesn't catch.
It's actually extremely irritating that I'm only half talking to the person when I email with these people.
It's kinda like machine translated novels. You have to really be passionate about the novel to endure these kinds of translations. That's when you realize how much work novel translators do to get a coherent result.
Especially jarring when you have read translation that put thought in them. Noticed this in Xianxia so Chinese power-fantasy. Where selection of what to translate and what to transliterate can have huge impact. And then editorial work also becomes important if something in past need to be changed based on future information.
I literally had a developer of an open source package I’m working with tell me “yeah that’s a known problem, I gave up on trying to fix it. You should just ask ChatGPT to fix it, I bet it will immediately know the answer.”
Annoying response of course. But I’d never used an LLM to debug before, so I figured I’d give it a try.
First: it regurgitated a bunch of documentation and basic debugging tips, which might have actually been helpful if I had just encountered this problem and had put no thought into debugging it yet. In reality, I had already spent hours on the problem. So not helpful
Second: I provided some further info on environment variables I thought might be the problem. It latched on to that. “Yes that’s your problem! These environment variables are (causing the problem) because (reasons that don’t make sense). Delete them and that should fix things.” I deleted them. It changed nothing.
Third: It hallucinated a magic numpy function that would solve my problem. I informed it this function did not exist, and it wrote me a flowery apology.
Clearly AI coding works great for some people, but this was purely an infuriating distraction. Not only did it not solve my problem, it wasted my time and energy, and threw tons of useless and irrelevant information at me. Bad experience.
The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.
If I give it all my information and add "I think the problem might be X, but I'm not sure", the LLM always agrees that the problem is X and will reinterpret everything else I've said to 'prove' me right.
Then the conversation is forever poisoned and I have to restart an entirely new chat from scratch.
98% of the utility I've found in LLMs is getting it to generate something nearly correct, but which contains just enough information for me to go and Google the actual answer. Not a single one of the LLMs I've tried have been any practical use editing or debugging code. All I've ever managed is to get it to point me towards a real solution, none of them have been able to actually independently solve any kind of problem without spending the same amount of time and effort to do it myself.
> The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.
I'm seeing this sentiment a lot in these comments, and frankly it shows that very few here have actually gone and tried the variety of models available. Which is totally fine, I'm sure they have better stuff to do, you don't have to keep up with this week's hottest release.
To be concrete - the symptom you're talking about is very typical of Claude (or earlier GPT models). o3-mini is much less likely to do this.
Secondly, prompting absolutely goes a huge way to avoiding that issue. Like you're saying - if you're not sure, don't give hints, keep it open-minded. Or validate the hint before starting, in a separate conversation.
I literally got this problem earlier today on ChatGPT, which claims to be based on o4-mini. So no, does not sound like it's just a problem with Claude or older GPTs.
And on "prompting", I think this is a point of friction between LLM boosters and haters. To the uninitiated, most AI hype sounds like "it's amazing magic!! just ask it to do whatever you want and it works!!" When they try it and it's less than magic, hearing "you're prompting it wrong" seems more like a circular justification of a cult follower than advice.
I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively. I buy that. But some more specific advice would be helpful. Cause as is, it sounds more like "LLMs are magic!! didn't work for you? oh, you must be holding it wrong, cause I know they infallibly work magic".
> I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively
I don't buy it this at all.
At best "learning to prompt" is just hitting the slot machine over and over until you get something close to what you want, which is not a skill. This is what I see when people "have a conversation with the LLM"
At worst you are a victim of sunk cost fallacy, believing that because you spent time on a thing that you have developed a skill for this thing that really has no skill involved. As a result you are deluding yourself into thinking that the output is better.. not because it actually is, but because you spent time on it so it must be
On the other hand, when it works it's darn near magic.
I spent like a week trying to figure out why a livecd image I was working on wasn't initializing devices correctly. Read the docs, read source code, tried strace, looked at the logs, found forums of people with the same problem but no solution, you know the drill. In desperation I asked ChatGPT. ChatGPT said "Use udevadm trigger". I did. Things started working.
For some problems it's just very hard to express them in a googleable form, especially if you're doing something weird almost nobody else does.
i started (re)using AI recently. it/i mostly failed until i decided on a rule.
if it's "dumb and annoying" i ask the AI, else i do it myself.
since that AI has been saving me a lot of time on dumb and annoying things.
also a few models are pretty good for basic physics/modeling stuff (get basic formulas, fetching constants, do some calculations). these are also pretty useful. i recently used it for ventilation/co2 related stuff in my room and the calculations matched observed values pretty well, then it pumped me a broken desmos syntax formula, and i fixed that by hand and we were good to go!
---
(dumb and annoying thing -> time-consuming to generate with no "deep thought" involved, easy to check)
> For some problems it's just very hard to express them in a googleable form
I had an issue where my Mac would report that my tethered iPhone's batteries were running low when the battery was in fact fine. I had tried googling an answer, and found many similar-but-not-quite-the-same questions and answers. None of the suggestions fixed the issue.
I then asked the 'MacOS Guru' model for chatGPT my question, and one of the suggestions worked. I feel like I learned something about chatGPT vs Google from this - the ability of an LLM to match my 'plain English question without a precise match for the technical terms' is obviously superior to a search engine. I think google etc try synonyms for words in the query, but to me it's clear this isn't enough.
If the solution to "devices not setting up" was "udevadm trigger" and it took an LLM suggestion to get there, I question your google skills.
When I google "linux device not initializing correctly", someone suggesting "udevadm trigger" is the 5th result
Google isn't the same for everyone. Your results could be very different from mine. They're probably not quite the same as months ago either.
I may also have accidentally made it harder by using the wrong word somewhere. A good part of the difficulty of googling for a vague problem is figuring out how to even word it properly.
Also of course it's much easier now that I tracked down what the actual problem was and can express it better. I'm pretty sure I wasn't googling for "devices not initializing" at the time.
But this is where I think LLMs offer a genuine improvement -- being able to deal with vagueness better. Google works best if you know the right words, and sometimes you don't.
There is a difference between a directly correct answer and a “5th result”
There is, but if it's the 5th result then either that exact wording is magic or something is wrong with the story.
And it might not have been the first and only thing ChatGPT said. It got there fast but 5th result isn't too slow either.
Honestly this says more about how bad Google has become than about how good GPT is
This morning I was using an LLM to develop some SQL queries against a database it had never seen before. I gave it a starting point, and outlined what I wanted to do. It proposed a solution, which was a bit wrong, mostly because I hadn't given it the full schema to work with. Small nudges and corrections, and we had something that worked. From there, I iterated and added more features to the outputs.
At many points, the code would have an error; to deal with this, I just supply the error message, as-is to the LLM, and it proposes a fix. Sometimes the fix works, and sometimes I have to intervene to push the fix in the right direction. It's OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three.
A key part of the workflow, imo, was that we were working in the medium of the actual code. If the code is broken, we get an error, and can iterate. Asking for opinions doesn't really help...
I often wonder if people who report that LLMs are useless for code haven't cracked the fact that you need to to have a conversation with it - expecting a perfect result after your first prompt is setting it up for failure, the real test is if you can get to a working solution after iterating with it for a few rounds.
As someone who has finally found a way to increase productivity by adding some AI, my lesson has sort of been the opposite. If the initial response after you've provided the relevant context isn't obviously useful: give up. Maybe start over with slightly different context. A conversation after a bad result won't provide any signal you can do anything with, there is no understanding you can help improve.
It will happily spin forever responding in whatever tone is most directly relevant to your last message: provide an error and it will suggest you change something (it may even be correct every once in a while!), suggest a change and it'll tell you you're obviously right, suggest the opposite and you will be right again, ask if you've hit a dead end and yeah, here's why. You will not learn anything or get anywhere.
A conversation will only be useful if the response you got just needs tweaks. If you can't tell what it needs feel free to let it spin a few times, but expect to be disappointed. Use it for code you can fully test without much effort, actual test code often works well. Then a brief conversation will be useful.
Why would I do this, when I can just write it from scratch in less time than it takes you to have this conversation with the LLM?
Because once you get good at using LLMs you can write it with 5 rounds with an LLM in way less time than it would have taken you to type out the whole thing yourself, even if you got it exactly right first time coding it by hand.
I suspect this is only true if you are lousy at writing code or have a very slow typing speed
I suspect the opposite is only true if you haven't taken the time to learn how to productively use LLMs for coding.
(I've written a fair bit about this: https://simonwillison.net/tags/ai-assisted-programming/ and https://simonwillison.net/2025/Mar/11/using-llms-for-code/ and 80+ examples of tools I've built mostly with LLMs on https://tools.simonwillison.net/colophon )
Maybe I've missed it, but what did you use to perform the actual code changes on the repo?
You mean for https://tools.simonwillison.net/colophon ?
I've used a whole bunch of techniques.
Most of the code in there is directly copied and pasted in from https://claude.ai or https://chatgpt.com - often using Claude Artifacts to try it out first.
Some changes are made in VS Code using GitHub Copilot
I've used Claude Code for a few of them https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
Some were my own https://llm.datasette.io tool - I can run a prompt through that and save the result straight to a file
The commit messages usually link to either a "share" transcript or my own Gist showing the prompts that I used to build the tool in question.
So the main advantage is that LLMs can type faster than you?
Yes, exactly.
Burning down the rainforests so I don’t have to wait for my fingers.
The environmental impact of running prompts through (most) of these models is massively over-stated.
(I say "most" because GPT-4.5 is 1000x the price of GPT-4o-mini, which implies to me that it burns a whole lot more energy.)
If you do a basic query to GPT-4o every ten seconds it uses a blistering... hundred watts or so. More for long inputs, less when you're not using it that rapidly.
This is honestly really unimpressive
Typing speed is not usually the constraint for programming, for a programmer that knows what they are doing
Creating the solution is the hard work, typing it out is just a small portion of it
I know. That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do. (One recent example: https://simonwillison.net/2024/Sep/10/software-misadventures... )
(I get boosts from LLMs to a bunch of activities too, like researching and planning, but those are less obvious than the coding acceleration.)
> That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do
This explains it then. You aren't a software developer
You get a productivity boost from LLMs when writing code because it's not something you actually do very much
That makes sense
I write code for probably between 50-80% of any given week, which is pretty typical for any software dev I've ever worked with at any company I've ever worked at
So we're not really the same. It's no wonder LLMs help you, you code so little that you're constantly rusty
I'm a software developer: https://github.com/simonw
I very much doubt you spend 80% of your working time actively typing code into a computer.
My other activities include:
- Researching code. This is a LOT of my time - reading my own code, reading other code, reading through documentation, searching for useful libraries to use, evaluating if those libraries are any good.
- Exploratory coding in things like Jupyter notebooks, Firefox developer tools etc. I guess you could call this "coding time", but I don't consider it part of that 10% I mentioned earlier.
- Talking to people about the code I'm about to write (or the code I've just written).
- Filing issues, or updating issues with comments.
- Writing documentation for my code.
- Straight up thinking about code. I do a lot of that while walking the dog.
- Staying up-to-date on what's new in my industry.
- Arguing with people about whether or not LLMs are useful on Hacker News.
"typing code is a small portion of programming"
"I agree, only 10% of what I do is typing code"
"that explains it, you aren't a software developer"
What the hell?
You should check out Simon’s wikipedia and github pages, when you have time between your coding sprints.
You must not be learning very many new things then if you can't see a benefit to using an LLM. Sure, for the normal crud day-to-day type stuff, there is no need for an LLM. But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.
Sure, it often spits out incomplete, non-ideal, or plain wrong answers, but that's where having SWE experience comes in to play to recognize it
> But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.
In the middle of this thought, you changed the context from "learning new things" to "not being faster than an LLM"
It's easy to guess why. When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything
But yes, you're right. I don't learn new things from scratch very often, because I'm not changing contexts that frequently.
I want to be someone who had 10 years of experience in my domain, not 1 year of experience repeated 10 times, which means I cannot be starting over with new frameworks, new languages and such over and over
"When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything"
Here's some code I threw together without even looking at yesterday: https://github.com/simonw/tools/blob/main/incomplete-json-pr... (notes here: https://simonwillison.net/2025/Mar/28/incomplete-json-pretty... )
Reading it now, here are the things it can teach me:
That's a very clean example of CSS variables, which I've not used before in my own projects. I'll probably use that pattern myself in the future. Really nice focus box shadow effect there, another one for me to tuck away for later. It honestly wouldn't have crossed my mind that embedding a tiny SVG inline inside a button could work that well for simple icons. Very clean example of clipboard interaction using navigator.clipboard.writeTextAnd the final chunk of code on the page is a very pleasing implementation of a simple character-by-character non-validating JSON parser which indents as it goes: https://github.com/simonw/tools/blob/1b9ce52d23c1335777cfedf...
That's half a dozen little tricks I've learned from just one tiny LLM project which I only spent a few minutes on.
My point here is that if you actively want to learn things, LLMs are an extraordinary gift.
Exactly! I learn all kinds of things besides coding-related things, so I don't see how it's any different. ChatGPT 4o does an especially good job of walking thru the generated code to explain what it is doing. And, you can always ask for further clarification. If a coder is generating code but not learning anything, they are either doing something very mundane or they are being lazy and just copy/pasting without any thought--which is also a little dangerous, honestly.
It really depends on what you're trying to achieve.
I was trying to prototype a system and created a one-pager describing the main features, objectives, and restrictions. This took me about 45 minutes.
Then I feed it into Claude and asked to develop said system. It spent the next 15 minutes outputting file after file.
Then I ran "npm install" followed by "npm run" and got a "fully" (API was mocked) functional, mobile-friendly, and well documented system in just an hour of my time.
It'd have taken me an entire day of work to reach the same point.
Yeah nah. The endless loop of useless suggestions or ”solutions” is very easily achiavable and common, at least on my use cases, not matter how much you iterate with it. Iterating gets counter-productive pretty fast, imo. (Using 4o).
When I use Claude to iterate/troubleshoot I do it in a project and in multiple chats. So if I test something and it throws and error or gives an unexpected result I’ll start a new chat to deal with that problem, correct the code, update that in the project, then go back to my main thread and say “I’ve update this” and provide it the file, “now let’s do this”. When I started doing this it massively reduced the LLM getting lost or going off on weird quests. Iteration in side chats, regroup in the main thread. And then possibly another overarching “this is what I want to achieve” thread where I update it on the progress and ask what we should do next.
I have been thinking about this a lot recently. I have a colleague who simply can’t use LLMs for this reason - he expects them to work like a logical and precise machine, and finds interacting with them frustrating, weird and uncomfortable.
However, he has a very black and white approach to things and he also finds interacting with a lot of humans frustrating, weird and uncomfortable.
The more conversations I see about LLMs the more I’m beginning to feel that “LLM-whispering” is a soft skill that some people find very natural and can excel at, while others find it completely foreign, confusing and frustrating.
It really requires self-discipline to ignore the enthusiasm of the LLM as a signal for whether you are moving in the direction of a solution. I blame myself for lazy prompting, but have a hard time not just jumping in with a quick project, hoping the LLM can get somewhere with it, and not attempt things that are impossible, etc.
> OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three
If you have any reasonable understanding of SQL, I guarantee you could brush up on it and write it yourself in less than a couple of hours unless you're trying to do something very complex
SQL is absolutely trivial to write by hand
Sure, I could do that. But I would learn where to put my join statements relative to the where statements, and then forget it again in a month because I have lots of other tihngs that I actually need to know on a daily basis. I can easily outsource the boilerplate to the LLM and get to a reasonable starting place for free.
Think of it as managing cognitive load. Wandering off to relearn SQL boilerplate is a distraction from my medium-term goal.
edit: I also believe I'm less likely to get a really dumb 'gotcha' if I start from the LLM rather than cobbling together knowledge from some random docs.
If you don’t take care to understand what the LLM outputs, how can you be confident that it works in the general case, edge cases and all? Most of the time that I spend as a software engineer is reasoning about the code and its logic to convince myself it will do the right thing in all states and for all inputs. That’s not something that can be offloaded to an LLM. In the SQL case, that means actually understanding the semantics and nuances of the specific SQL dialect.
Obviously to a mega super genius like yourself an LLM is useless. But perhaps you can consider that others may actually benefit from LLMs, even if you’re way too talented to ever see a benefit?
You might also consider that you may be over-indexing on your own capabilities rather than evaluating the LLM’s capabilities.
Lets say an llm is only 25% as good as you but is 10% the cost. Surely you’d acknowledge there may be tasks that are better outsourced to the llm than to you, strictly from an ROI perspective?
It seems like your claim is that since you’re better than LLMs, LLMs are useless. But I think you need to consider the broader market for LLMs, even if you aren’t the target customer.
Knowing SQL isn't being a "mega super genius" or "way talented". SQL is flawed, but being hard to learn is not among its flaws. It's designed for untalented COBOL mainframe programmers on the theory that Codd's relational algebra and relational calculus would be too hard for them and prevent the adoption of relational databases.
However, whether SQL is "trivial to write by hand" very much depends on exactly what you are trying to do with it.
That makes sense, and from what I’ve heard this sort of simple quick prototyping is where LLM coding works well. The problem with my case was I’m working with multiple large code bases, and couldn’t pinpoint the problem to a specific line, or even file. So I wasn’t gonna just copy multiple git repos into the chat
(The details: I was working with running a Bayesian sampler across multiple compute nodes with MPI. There seemed to be a pathological interaction between the code and MPI where things looked like they were working, but never actually progressed.)
I wonder if it breaks like this: people who don't know how to code find LLMs very helpful and don't realize where they are wrong. People who do know immediately see all the things they get wrong and they just give up and say "I'll do it myself".
> Small nudges and corrections, and we had something that worked. From there, I iterated and added more features to the outputs.
FWIW, I've seen people online refer to this as "vibe coding".
This is exactly my experience, every time! If I offer it the slightest bit of context it will say 'Ah! I understand now! Yes, that is your problem, …' and proceed to spit out some non-existent function, sometimes the same one it has just suggested a few prompts ago which we already decided doesn't exist/work. And it just goes on and on giving me 'solutions' until I finally realise it doesn't have the answer (which it will never admit unless you specifically ask it to – forever looking to please) and give up.
My experiences have all been like this too. I am puzzled by how some people say it works for them
I wrote this article precisely for people who are having trouble getting good results out of LLMs for coding: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
I’ve followed your blog for a while, and I have been meaning to unsubscribe because the deluge of AI content is not what I’m looking for.
I read the linked article when it was posted, and I suspect a few things that are skewing your own view of the general applicability of LLMs for programming. One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
I think it’s great that it’s a technology you’re passionate about and that it’s useful for you, but my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful. And that’s okay, it doesn’t have to be all things to all people. But it’s not fair to say that we’re just holding it wrong.
"my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful."
It's possible that changed this week with Gemini 2.5 Pro, which is equivalent to Claude 3.7 Somnet in terms of code quality but has a 1 million token context (with excellent scores on long context benchmarks) and an increased output limit too.
I've been dumping hundreds of thousands of times of codebase into it and getting very impressive results.
See this is one of the things that’s frustrating about the whole endeavor. I give it an honest go, it’s not very good, but I’m constantly exhorted to try again because maybe now that Model X 7.5qrz has been released, it’ll be really different this time!
It’s exhausting. At this point I’m mostly just waiting for it to stabilize and plateau, at which point it’ll feel more worth the effort to figure out whether it’s now finally useful for me.
Not going to disagree that it's exhausting! I've been trying to stay on top of new developments for the past 2.5 years and there are so many days when I'll joke "oh, great, it's another two new models day".
Just on Tuesday this week we got the first widely available high quality multi-modal image output model (GPT-4o images) and a new best-overall model (Gemini 2.5) within hours of each other. https://simonwillison.net/2025/Mar/25/
> One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
Take a look at the 2024 StackOverflow survey.
70% of professional developer respondents had only done extensive work over the last year in one of:
JS 64.6% SQL 54.1% JTML/CSS 52.9% PY 46.9% TS 43.4% Bash/Shell 34.2% Java 30%
LLMs are of course very strong in all of these. 70% of developers only code in languages LLMs are very strong at.
If anything, for the developer population at large, this number is even higher than 70%. The survey respondents are overwhelmingly American (where the dev landscape is more diverse), and self-select to those who use niche stuff and want to let the world know.
Similar argument can be made for median codebase size, in terms of LOC written every year. A few days ago he also gave Gemini Pro 2.5 a whole codebase (at ~300k tokens) and it performed well. Even in huge codebases, if any kind of separation of concerns is involved, that's enough to give all context relevant to the part of the code you're working on. [1]
[1] https://simonwillison.net/2025/Mar/25/gemini/
What’s 300k tokens in terms of lines of code? Most codebases I’ve worked on professionally have easily eclipsed 100k lines, not including comments and whitespace.
But really that’s the vision of actual utility that I imagined when this stuff first started coming out and that I’d still love to see: something that integrates with your editor, trains on your giant legacy codebase, and can actually be useful answering questions about it and maybe suggesting code. Seems like we might get there eventually, but I haven’t seen that we’re there yet.
We hit "can actually be useful answering questions about it" within the last ~6 months with the introduction of "reasoning" models with 100,000+ token contest limits (and the aforementioned Gemini 1 million/2 million models).
The "reasoning" thing is important because it gives models the ability to follow execution flow and answer complex questions that down many different files and classes. I'm finding it incredible for debugging, eg: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8...
I built a files-to-prompt tool to help dump entire codebases into the larger models and I use it to answer complex questions about code (including other people's projects written in languages I don't know) several times a week. There's a bunch of examples of that here: https://simonwillison.net/search/?q=Files-to-prompt&sort=dat...
How much lines of context and understanding can we, human developers, keep in our heads, taken into account and refer to when implementing something?
Whatever the amount may be, it definitely fits into 300k tokens.
After more than a few years working on a codebase? Quite a lot. I know which interfaces I need and from where, what the general areas of the codebase are, and how they fit together, even if I don’t remember every detail of every file.
> But it’s not fair to say that we’re just holding it wrong.
<troll>Have you considered that asking it to solve problems in areas it's bad at solving problems is you holding it wrong?</troll>
But, actually seriously, yeah, I've been massively underwhelmed with the LLM performance I've seen, and just flabbergasted with the subset of programmer/sysadmin coworkers who ask it questions and take those answers as gospel. It's especially frustrating when it's a question about something that I'm very knowledgeable about, and I can't convince them that the answer they got is garbage because they refuse to so much as glance at supporting documentation.
LLMs need to stay bad. What is going to happen if we have another few GPT-3.5 to Gemini 2.5 sized steps? You're telling people who need to keep the juicy SWE gravy train running for another 20 years to recognize that the threat is indeed very real. The writing is on the wall and no one here (here on HN especially) is going to celebrate those pointing to it.
I don't think people really realize the danger of mass unemployment
Go look up what happens in history when tons of people are unemployed at the same time with no hope of getting work. What happens when the unemployed masses become desperate?
Naw I'm sure it will be fine, this time will be different
Just wanted to chime in and say how appreciative I’ve been about all your replies here, and overall content on AI. Your takes are super reasonable and well thought out.
Exactly which model did you use? You talk about LLMs as though they are all the same.
Alien 1: I gave Jeff Dean a giant complex system to build, he crushed it! Humans are so smart.
Alien 2: I gave a random human a simple programming problem and he just stared at me like an idiot. Humans suck.
It's worse than that.
I see people say, "Look how great this is," and show me an example, and the example they show me is just not great. We're literally looking at the same thing, and they're excited that this LLM can do a college grads's job to the level of a third grader, and I'm just not excited about that.
What changed my point of view regarding LLMs was when I realized how crucial context is in increasing output quality.
Treat the AI as a freelancer working on your project. How would you ask a freelancer to create a Kanban system for you? By simply asking "Create a Kanban system", or by providing them a 2-3 pages document describing features, guidelines, restrictions, requirements, dependencies, design ethos, etc?
Which approach will get you closer to your objective?
The same applies to LLM (when it comes to code generation). When well instructed, it can quickly generate a lot of working code, and apply the necessary fixes/changes you request inside that same context window.
It still can't generate senior-level code, but it saves hours when doing grunt work or prototyping ideas.
"Oh, but the code isn't perfect".
Nor is the code of the average jr dev, but their codes still make it to production in thousands of companies around the world.
I see it as a knowledge multiplier. You still need to know enough about the subject to verify the output.
They're sophisticated tools at much as any other software.
About 2 weeks ago I started on a streaming markdown parser for the terminal because none really existed. I've switched to human coding now but the first version was basically all llm prompting and a bunch of the code is still llm generated (maybe 80%). It's a parser, those are hard. There's stacks, states, lookaheads, look behinds, feature flags, color spaces, support for things like links and syntax highlighting... all forward streaming. Not easy
https://github.com/kristopolous/Streamdown
Exactly, thanks to all the money involved in such hype the incentives will always skew towards over spamming naive optimism about it's features.
> LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Exactly this.
I once had a function that would generate several .csv reports. I wanted these reports to then be uploaded to s3://my_bucket/reports/{timestamp}/.csv
I asked ChatGPT "Write a function that moves all .csv files in the current directory to and old_reports directory, calls a create_reports function, then uploads all the csv files in the current directory to s3://my_bucket/reports/{timestamp}/.csv with the timestamp in YYYY-MM-DD format""
And it created the code perfectly. I knew what the correct code would look like, I just couldn't be fucked to look up the exact calls to boto3, whether moving files was os.move or os.rename or something from shutil, and the exact way to format a datetime object.
It created the code far faster that I would have.
Like, I certainly wouldn't use it to write a whole app, or even a whole class, but individual blocks like this, it's great.
I have been saying this about llms for a while - if you know what you want, how to ask for it, and what the correct output will look like, LLMs are fantastic (at least Claude Sonnet is). And I mean that seriously, they are a highly effective tool for productive development for senior developers.
I use it to produce whole classes, large sql queries, terraform scripts, etc etc. I then look over that output, iterate on it, adjust it to my needs. It's never exactly right at first, but that's fine - neither is code I write from scratch. It's still a massive time saver.
> they are a highly effective tool for productive development for senior developers
I think this is the most important bit many people miss. It is advertised as an autonomous software developer, or something that can take a junior to senior levels, but that's just advertising.
It is actually most useful for senior developers, as it does the grunt work for them, while grunt work is actually useful work for a junior developer as a learning tool.
Precisely -- you have to be experienced in your field to use these tools effectively.
These are power tools for the mind. We've been working with the equivalent of hand tools, now something new came along. And yeah, a hole hawg will throw you clear off a ladder if you're not careful -- does that mean you're going to bore 6" holes in concrete ceilings by hand? Think not.
> It is advertised as an autonomous software developer
By a few currently niche VC players, I guess. I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
> I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
Are you sure about that? [1]:
> "I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code," Amodei said at a Council of Foreign Relations event on Monday.
[1] https://www.entrepreneur.com/business-news/anthropic-ceo-pre...
This depends on the circles you move in.
I have capital allocator friends warning me about vibe coding taking my job
"How to ask for it" is the most important part. As soon as you realize that you have to provide the AI with CONTEXT and clear instructions (you know, like a top-notch story card on a scrum board), the quality and assertiveness of the results increase a LOT.
Yes, it WON'T produce senior-level code for complex tasks, but it's great at tackling down junior to mid-level code generation/refactoring, with minor adjustments (just like a code review).
So, it's basically the same thing as having a freelancer jr dev at your disposal, but it can generate working code in 5 min instead of 5 hours.
I've had so many cases exactly like your example here. If you build up an intuition that knows that e.g. Claude 3.7 Sonnet can write code that uses boto3, and boto3 hasn't had any breaking changes that would affect S3 usage in the past ~24 months, you can jump straight into a prompt for this kind of task.
It doesn't just save me a ton of time, it results in me building automations that I normally wouldn't have taken on at all because the time spent fiddling with os.move/boto3/etc wouldn't have been worthwhile compared to other things on my plate.
I think you have an interesting point of view and I enjoy reading your comments, but it sounds a little absurd and circular to discount people's negativity about LLMs simply because it's their fault for using an LLM for something it's not good at. I don't believe in the strawman characterization of people giving LLMs incredibly complex problems and being unreasonably judgemental about the unsatisfactory results. I work with LLMs every day. Companies pay me good money to implement reliable solutions that use these models and it's a struggle. Currently I'm working with Claude 3.5 to analyze customer support chats. Just as many times as it makes impressive, nuanced judgments it fails to correctly make simple trivial judgements. Just as many times as it follows my prompt to a tee, it also forgets or ignores important parts of my prompt. So the problem for me is it's incredibly difficult to know when it'll succeed and when it'll fail for a given input. Am I unreasonable for having these frustrations? Am I unreasonable for doubting the efficacy of LLMs to address problems that many believe are already solved? Can you understand my frustration to see people characterize me as such because ChatGPT made a really cool image for them once?
It's a weird circle with these things. If you _can't_ do the task you are using the LLM for, you probably shouldn't.
But if you can do the task well enough to at least recognize likely-to-be-correct output, then you can get a lot done in less time than you would do it without their assistance.
Is that worth the second order effects we're seeing? I'm not convinced, but it's definitely changed the way we do work.
I think this points to much of the disagreement over LLMs. They can be great at one-off scripts and other similar tasks like prototypes. Some folks who do a lot of that kind of work find the tools genuinely amazing. Other software engineers do almost none of that and instead spend their coding time immersed in large messy code bases, with convoluted business logic. Looping an LLM into that kind of work can easily be net negative.
Maybe they are just lazy around tooling. Cursor with Claude works well for project sizes much larger than I expected but it takes a little set up. There is a chasm between engineers who use tools well and who do not.
I don't really agree with framing it as lazy. Adding more tools and steps to your workflow isn't free, and the cost/benefit of each tool will be different for everyone. I've lost count of how many times someone has evangelized a software tool to me, LLM or not. Once in a while they turn out to be useful and I incorporate them into my regular workflow, but far more often I don't. This could be for any number of reasons like it does not fit with my workflow well, or I already have a better way of doing whatever it does, or the tool adds more friction than it remove.
I'm sure spending more time fiddling with the setup of LLM tools can yield better results, but that doesn't mean that it will be worth it for everyone. In my experience LLMs fail often enough at modestly complex problems that they are more hassle than benefit for a lot of the work I do. I'll still use them for simple tasks, like if I need some standard code in a language I'm not too familiar with. At the same time, I'm not at all surprised that others have a different experience and find them useful for larger projects they work on.
I'm tired of people bashing LLMs. AI is so useful in my daily work that I can't understand where these people are coming from. Well, whatever...
As you said, examples where I wouldn't expect LLMs to be good at from people who dismiss the scenarios where LLMs are great at. I don't want to convince anyone, to be honest - I just want to say they are incredibly useful for me and a huge time saver. If people don't want to use LLMs, it's fine for me as I'll have an edge over them in the market. Thanks for the cash, I guess.
I'm starting to resign myself to just enjoying the benefits and let those who can't evolve fall behind.
I'm growing weary of trying to help people use these tools properly.
I'm pretty weary of all the people telling me I'm "holding it wrong"
I'll give you a simple and silly example which could give you additional ideas. LLMs can be great for checking whether people can understand something.
One day I came up with a joke and wondered whether people would "get it". I told the joke to ChatGPT and asked it to explain it back to me. ChatGPT did a great job and nailed what's supposedly funny about the joke. I used it in an email so I have no idea whether anyone found it funny, but at least I know it wasn't too obscure. If an AI can understand a joke, there's a good chance people will understand it too.
This might not be super useful but demonstrates that LLMs aren't only about generating text for copy-and-paste or retrieving information. It's "someone" you can bounce ideas with, ask opinions and that's how I use it most frequently.
every time someone brings up "Code that doesn't need to deal with edge cases" I like to point at that such code is not likely to be used for anything that matters
Oh, but it is. I can have code that does something nice to have, needs not to be 100% correct etc. For example, I want a background for my playful webpage. Maybe a WebGL shader. It might not be exactly what I asked for, but I can have it in few minutes up and running. Or some non-critical internal tools - like scraper for lunch menus from restaurants around office. Or simple parking spot sharing app. Or any kind of prototypes which in some companies are being created all the time. There are so many use cases that are forgiving regarding correctness and are much more sensitive to development effort.
There is a cost burden to not being 100% correct when it comes to programming. You simply have chosen to ignore that burden, but it still exists for others. Whether it's for example a percent of your users now getting stalled pages due to the webgl shader, or your lunch scraper ddosing local restaurants. They aren't actually forgiving regarding correctness.
Which is fine for actual testing you're doing internally, since that cost burden is then remedied by you fixing those issues. However, no feature is as free as you're making it sound, not even the "nice to have" additions that seem so insignificant.
I never said it's free. (But also aiming for 100% correctness is very very expensive) I'm talking about trading correctness, readability, security and maybe others for another metrics. What I said is just not every project that has value should be optimized for the same metrics. Bank or medical software needs to be correct as close to 100% as possible. Some tool I'm creating for my team to simplify a process does not necessarily need to. I would not mind my webgl shader possibly causing problems to some users. It would get reported and fixed. Or not. It's my call what I would spend my effort on.
Of course the tradeoffs should be well considered. That's why it may get out of hand real bad if software will be created (or vibe coded) by people with little understanding of these metrics and tradeoffs. I'm absolutely not advocating for that.
I’m always amazed in these discussions how many people apparently have jobs doing a bunch of stuff that either doesn’t need to be correct or is simple enough that it doesn’t require any significant amount of external context.
I'm always amazed by the arrogance that if you can't hack it, then everyone else can't either.
Yes, my arrogance amazes even me.
The point is more that everyone seems to acknowledge that a) output is spotty, and b) it’s difficult to provide enough context to work on anything that’s not fairly self-contained. And yet we also constantly have people saying that they’re using AI for some ridiculous percentage of their actual job output. So, I’m just curious how one reconciles those two things.
Either most people’s jobs consist of a lot more small, self-contained mini-projects than my jobs generally have, or people’s jobs are more accepting of incorrect output than I’m used to, or people are overstating their use of the tool.
Or something else!
In other words, people preaching LLMs are noobs with no real stake in what they are doing. But you can't really call people noobs these days.
Ha, a little harsher than I intended it to come off, but it really does make me wonder.
Is such code hard to write in the first place?
Automating the easy 80% sounds useful, but in practice I'm not convinced that's all that helpful. Reading and putting together code you didn't write is hard enough to begin with.
It's not hard, but it's time consuming.
The things I'm wary of are pitfalls that are often only in the command/function docs. Kinda like rsync with how it handles terminating slashes at the end of the path. Which is why I always took a moment to read them.
You never write code to automate something in your personal environment, or to analyze some one-off data, rather than to actually go into production?
Not GP, but more often than not I reach out to tools I already know (sed,awk,python) or read the docs which don't take that much time if you know how to get to the sections you need.
While you're scanning through the docs, I'm already done and working on the next task.
I write code like that all the time. It's used for very specific use cases, only by myself or something I've also written. It's not exposed to random end users or inputs.
> Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
I’ve never seen it from my students. Why do you think this? It’s trivial to pick a real book/article. No student is generating fake material whole cloth and fake references to match. Even if they could, why would they risk it?
Exactly. Lazy students just refer to vaguely related but existing material that they didn't read. Much better than llms! :-)
TBD whether that makes the effort to spot-check their references greater (does actually say what the student - explicitly or implicitly - claims it does?), or less (proving the non-existence of an obscure references is proving a negative)?
I know arguments from authority aren't primary, but I think this point highlights some important context: Dr. Hossenfelder has gained international renown by publishing clickbait-y YouTube videos that ostensibly debunk scientific and technological advances of all kinds. She's clearly educated and thoughtful (not to mention otherwise gainfully employed), but her whole public persona kinda relies on assuming the exclusively-critical standpoint you mention.
I doubt she necessarily feels indebted to her large audience expecting this take (it's not new...), but that certainly does seem like a hard cognitive habit to break.
More often than not, when I inquire deeper, I often find their prompting isn't very good at all.
"Garbage in, garbage out" as the law says.
Of course, it took a lot of trial and error for me to get to my current level of effectiveness with LLMs. It's probably our responsibility to teach these who are willing.
It seems hard to be bullish on LLMs as a generally useful tool if the solution to problems people have is "use trial and error to improve how you write your prompts, no, it's not obvious how to do so, yes, it depends heavily on the exact model you use."
You could say that about any power tool.
A Mitre Saw is an amazing thing to have in a woodshop, but if you don't learn how to use it you're probably going to cut off a finger.
The problem is that LLMs are power tools that are sold as being so easy to use that you don't need to invest any effort in learning them at all. That's extremely misleading.
Unlike the manufacturers of LLMs, the manufacturers of real-world power tools
* Know how they work
* Are legally liable for defects in design or manufacture that cause injury, death, or property damage
* Provide manuals that instruct the operator how to effectively and safely use the power tool
> Are legally liable for defects in design or manufacture that cause injury, death, or property damage
Except when you use them for purposes other than declared by them - then it's on you. Similarly, you get plenty of warnings about limitation and suitability of LLMs from the major vendors, including even warnings directly in the UI. The limitations of LLMs are common knowledge. Like almost everyone, you ignore them, but then consequences are on you too.
> Provide manuals that instruct the operator how to effectively and safely use the power tool
LLMs come with manuals much, much more extensive than any power tool ever (or at least since 1960s or such, as back then hardware was user-serviceable and manuals weren't just generic boilerplate).
As for:
> Know how they work
That is a real difference between power tool manufacturers and LLM vendors, but then if you switch to comparing against pharmaceutical industry, then they don't know how most of their products work either. So it's not a requirement for useful products that we benefit from having available.
So is using an LLM to write SQL for you like using a mitre saw instead of a table saw? I guess the crux is that you still need to do work either way.
Using LLMs to write SQL is a fascinating case because there are so many traps you could fall into that aren't really the fault of the LLM.
My favorite example: you ask the LLM for "most recent restaurant opened in California", give it a schema and it tries "select * from restaurants where state = 'California' order by open_date desc" - but that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
There are tricks that can help here - I've tried sending the LLM an example row from each table, or you can set up a proper loop where the LLM gets to see the results and iterate on them - but it reflects the fact that interacting with databases can easily go wrong no matter how "smart" the model you are using is.
> that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
As you’ve identified, rather than just giving it the schema you give it the schema and a some data when you tell it what you want.
A human might make exactly the same error - based on misassumption - and would then look at the data to see why it was failing.
If we assume that a LLM would magically realise that when you ask it to find something based on an identifier which you tell it is ‘California’ it would magically assume that the query should be based on ‘CA’ rather than what you told it, then that’s not really the fault of the LLM.
I expect one of the database MCPs would do the trick too, if you didn't mind burning _several_ rounds of back-and-forth under the covers.
Many people have only tried the free version of ChatGPT, which is a completely different experience than the two most recent sonnet models.
And many people tried it a year ago and wrote it off as useless after trying to get results with basic one-shot prompting.
Those people, likely, will never change their opinion.
And that’s fine, because they won’t get the huge benefits that come from spending time learning how to use the tool properly.
Agreed. If one compares ChatGPT to, say, the Cline IDE plugin backed by Claude 3.7, they might well be blown away by how far behind ChatGPT seems. A lot of the difference has to do with prompting, for sure -- Cline helps there by generating prompts from your IDE and project context automatically.
Every once in a while I send a query off to ChatGPT and I'm often disappointed and jam on the "this was hallucinated" feedback button (or whatever it is called). I have better luck with Claude's chat interface but nowhere near the quality of response that I get with Cline driving.
You should try Claude Code. I was pretty impressed the first time I tried it. Go in with a specific task in mind
I want to sit next to you and stop you every time you use your LLM and say, “Let me just carefully check this output.” I bet you wouldn’t like that. But when I want to do high quality work, I MUST take that time and carefully review and test.
What I am seeing is fanboys who offer me examples of things working well that fail any close scrutiny— with the occasional example that comes out actually working well.
I agree that for prototyping unimportant code LLMs do work well. I definitely get to unimportant point B from point A much more quickly when trying to write something unfamiliar.
What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time? Nobody knows! A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result. If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
> What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time?
Benchmarks could track that too - I don't know if they do, but that information should actually be available and easy to get.
When models are scored on e.g. "pass10", i.e. pass the challenge in under 10 attempts, and then the benchmark is rerun periodically, that literally produces the information you're asking for: how frequently a given model fails at particular task.
> A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result.
For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place. For those tasks, LLMs are very useful.
> If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
How can you be sure whether a human you're asking isn't hallucinating/guessing the answer, or straight up bullshitting you? Apply the same approach to LLMs as you apply to navigating this problem with humans - for example, don't ask it to solve high-consequence problems in areas where you can't evaluate proposed solutions quickly.
> For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place.
A good example that I use frequently is a reverse dictionary.
It's also useful for suggesting edits to text that I have written. It's easy for me to read its suggestions and accept/reject them.
I think part of it is that, from eons of experience, we have a pretty good handle on what kinds of mistakes humans make and how. If you hire a competent accountant, he might make a mistake like entering an expense under the wrong category. And since he's watching for mistakes like that, he can double-check (and so can you) without literally checking all his work. He's not going to "hallucinate" an expense that you never gave him, or put something in a category he just made up.
I asked Gemini for the lyrics to a song that I knew was on all the lyrics sites. To make a long story short, it gave me the wrong lyrics three times, apparently making up new ones the last two times. Someone here said LLMs may not be allowed to look at those sites for copyright reasons, which is fair enough; but then it should have just said so, not "pretended" it was giving me the right answer.
I have a python script that processes a CSV file every day, using DictReader. This morning it failed, because the people making the CSV changed it to add four extra lines above the header line, so DictReader was getting its headers from the wrong line. I did a search and found the fix on Stack Overflow, no big deal, and it had the upvotes to suggest I could trust the answer. I'm sure an LLM could have told me the answer, but then I would have needed to do the search anyway to confirm it--or simply implemented it, and if it worked, assume it would keep working and not cause other problems.
That was just a two-line fix, easy enough to try out and see if it worked, and guess how it worked. I can't imagine implementing a 100-line fix and assuming the best.
It seems to me that some people are saying, "It gives me the right thing X% of the time, which saves me enough developer time (mine or someone else's) that it's worth the other (100-X)% of the time when it gives me garbage that takes extra time to fix." And that may be a fair trade for some folks. I just haven't found situations where it is for me.
Better yet than the whole 'unknown, fluctuating and non deterministic rates of failure' is the whole 'agentic' shtick. People proposing to chain together these fluctuating plausibility engines should study probability theory a bit deeper to understand just what they are in for with these rube goldberg machines of text continuation.
I think it’s very odd that you think that people using LLMs regularly aren’t carefully checking the outputs. Why do you think that people using LLMs don’t care about their work?
> invented references that just don't exist"...all I can say is "press X to doubt
This doesn’t include lying and cheating which LLMs can’t.
On the other hand AI is used to solve problems that are already solved. I just recently got an ad about a software for process modeling where one claim was you don’t need always to start from the ground up but can say the AI give me the customer order process to start from that point. That is basically what templates are for with much less energy consumption.
> there are two kinds of people who use AI
You hit the nail on this one. Around me I noticed that the bashing of LLMs come from the smart people that want others to know they are smart.
I've noticed there seems to be a gatekeeping archetype that operates as a hard cynic to nearly everything, so that when they finally judge something positively they get heaps of attention.
It doesn't always correlate with narcissism, but it happens much more than chance.
I don’t think your main experience with LLMs (coding clearly defined tasks) refute what Sabine is saying.
The use cases are vastly different and the first is just _not_ world changing. It’s great, don’t get me wrong, but it won’t change the world.
>A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
Yes somewhat. Its good for powershell/bash/cmd scripts and configs, but early models it would hallucinate PowerShell cmdlets especially.
One thing I think is clear is society is now using a lot of words to describe things when the words being used are completely devoid of the necessary context. It's like calling a powder you've added to water "juice" and also freshly-squeezed fruit just picked perfectly ripe off a tree "juice". A word stretched like that becomes nearly devoid of meaning.
"I write code all day with LLMs, it's amazing!" is in the exact same category. The code you (general you, I'm not picking on you in particular) write using LLMs, and the code I write apart from LLMs: they are not the same. They are categorically different artifacts.
all fun and games until your AI generated script deletes the production database. I think that's the point, fault tolerance in academic and financial settings is too high for LLMs to be useful
The point is that given the current valuations, being good at a bunch of narrow use cases is just not good enough. It needs to be able to replace humans in every role where the primary output is text or speech to meet expectations.
I don't think that "replacing humans in every role" is the line for "being bullish on AI models". I think they could stop development exactly where they are, and they would still make pretty dramatic improvements to productivity in a lot of places. For me at least, their value already exceeds the $20/month I'm paying, and I'm pretty sure that way more than covers inference costs.
> I think they could stop development exactly where they are, and they would still make pretty dramatic improvements to productivity in a lot of places.
Yup. Not to mention, we don't even have time to figure out how to effectively work with one generation of models, before the next generation of models get released and rises the bar. If development stopped right now, I'd still expect LLMs to get better for years, as people slowly figure out how to use them well.
Completely agree. As is, Cursor and ChatGPT and even Bing Image Create (for free generation of shoddy ideas, styles, concepts, etc) are very useful to me. In fact, it would suit me if everything stalled at this point rather than improve to the point that everyone can catch up in how they use AI.
The most interesting thing about this post is how it reinforces how terrible the usability of LLMs still is today:
"I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error. I Google for the alleged quote, it doesn't exist. They reference a scientific publication, I look it up, it doesn't exist."
To experienced LLM users that's not surprising at all - providing citations, sources for quotes, useful URLs are all things that they are demonstrably terrible at.
But it's a computer! Telling people "this advanced computer system cannot reliably look up facts" goes against everything computers have been good at for the last 40+ years.
One of the things that’s hard about these discussions is that behind them is an obscene amount of money and hype. She’s not responding to realists like you. She’s responding to the bulls. The people saying these tools will be able to run the world by the end of this year, maybe next.
And that’s honestly unfair to you since you do awesome realistic and level headed work with LLM.
But I think it’s important when having discussions to understand the context within which they are occurring.
Without the bulls she might very well be saying what you are in your last paragraph. But because of the bulls the conversation becomes this insane stratified nonsense.
Possibly a reaction to Bill Gates recent statements that it will begin replacing doctors and teachers. It's ridiculous to say LLMs are incredibly useful and valuable. It's highly dubious to think they can be trusted with actual critical tasks without careful supervision.
I think it's ridiculous to say LLMs are NOT "incredibly useful and valuable", but I 100% agree that it's "highly dubious to think they can be trusted with actual critical tasks without careful supervision".
Yeah that's actually what I meant to type
It's honestly so scary because Sam Altman and his ilk would gladly replace all teachers with LLM's right now, because it makes their lines go up, doesn't matter to them that would result in a generation of dumb people in like 10 years. Honestly would just create more LLM users for them to sell to so its a win win I guess, but it completely fucks up our world.
“Replacing teachers” is of course laughable.
Teachers are there to observe and manage behavior, resolve conflict, identify psychological risks and get in front of fixing them, set and maintain a positive tone (“setting the weather”), lift pupils up to that tone, and to summarize, assess and report on progress.
They are also there to grind through papers, tests, lesson plans, reports, marking, and letter writing. All of that will get easier with machine assistance.
Teaching is one of the most human-nature centric jobs in the world and will be the last to go. If AI can help focus the role of teacher more on using expert people skills and less on drudgery it will hopefully even improve the prospects of teaching as a career, not eliminate it.
Capability today and next year will probably be very different in reliability
As someone who uses LLMs to write code every day, I don't see a huge progress since last year, so I'm also not that sure about next year.
This isn't really a problem in tool-assisted LLMs.
Use google AI studio with search grounding. Provides correct links and citations every time. Other companies have similar search modes, but you have to enable those settings if you want good results.
Okay, but it's weird there is a "don't lie to me" button.
The "don't lie to me" button for a human is asking them, "where did you learn that fact?"
Grounding isn't very different from that.
How would that ever work? The only thing you can do is continue to refine high quality data sets to train on. The rate of hallucination only trends downwards on the high end models as they improve in various ways.
It's wikipedia in the 00s all over again being preached by roughly the same age and social demographic.
I’ve never had it give me bullshit citations when I specifically say look it up. It just gives me a clickable link.
This sort of concrete verifiable hallucinations can be trained out and probably will be soon
people have said this since chatGPT first released
I become more and more convinced with each of these tweets/blogs/threads that using LLMs well is a skill set akin to using Search well.
It’s been a common mantra - at least in my bubble of technologists - that a good majority of the software engineering skill set is knowing how to search well. Knowing when search is the right tool, how to format a query, how to peruse the results and find the useful ones, what results indicate a bad query you should adjust… these all sort of become second nature the longer you’ve been using Search, but I also have noticed them as an obvious difference between people that are tech-adept vs not.
LLMs seems to have a very similar usability pattern. They’re not always the right tool, and are crippled by bad prompting. Even with good prompting, you need to know how to notice good results vs bad, how to cherry-pick and refine the useful bits, and have a sense for when to start over with a fresh prompt. And none of this is really _hard_ - just like Search, none of us need to go take a course on prompting - IMO folks jusr need to engage with LLMs as a non-perfect tool they are learning how to wield.
The fact that we have to learn a tool doesn’t make it a bad one. The fact that a tool doesn’t always get it 100% on the first try doesn’t make it useless. I strip a lot of screws with my screwdriver, but I don’t blame the screwdriver.
Agree. It's a tool like anything else.
On a side note, this lady is a fraud: https://www.youtube.com/watch?v=nJjPH3TQif0&themeRefresh=1
I don't know if she is a fraud, but she has definitely greatly amplified Rage Bait Farming and talking about things that are far outside of her domain of expertise as if she were an expert.
In no way am I credentialing her, lots of people can make astute observations about things they weren't trained in, but she both mastered sounding authoritative and at the same time, presenting things to go the most engagement possible.
I've frequently heard that once you get sucked into the YouTube algorithm, you have to keep making content to maintain rankings.
This trap reminds me of the Perry Bible Fellowship comic "Catch Phrase" which has been removed for being too dark but can still be found with a search.
https://www.youtube.com/watch?v=4L0PAXOH718 Sponge Bob realizes he has played himself out with his own catchphrase!
Wow, thank you. I rarely good a good cultural recommendation here, but PBF I didn't know about.
I raise you, Joan Cornellà
Thanks for sharing this. I was heavily involved in graduate physics when I was in school, and was very worried about what direction shed take after the first big viral vid "telling her story." I wasn't sure it was well understood, or even understood at all, how blinkered her...viewpoint?...was.
LLMs function as a new kind of search engine, one that is amazingly useful because it can surface things that traditional search could never dream of. Don't know the name of a concept, just describe it vaguely and the LLM will pull out the term. Are you not sure what kind of information even goes into a cover letter or what's customary to talk about? Ask an LLM to write you one, it will be bland and generic sure but that's not the point because you now know the "shape" of what they're supposed to look like and that's great for getting unblocked. Have you stumbled across a passage of text that's almost English but you're not really sure what to look up to decipher it? Paste it into the LLM and it will tell you that it's "Early Modern English" which you can look up to confirm and get a dictionary for.
Broader than that, it’s critical thinking skills. Using search and LLMs requires analyzing the results and being able to separate what is accurate and useful from what isn’t.
From my experience this is less an application of critical skills and more a domain knowledge check. If you know enough about the subject to have accumulated heuristics for correctness and intuition for "lgtm" in the specific context, then it's not very difficult or intellectually demanding to apply them.
If you don't have that experience in this domain, you will spend approximately as much effort validating output as you would have creating it yourself, but the process is less demanding of your critical skills.
No, it is critical thinking skills, because the LLMs can teach you the domain, but you have to then understand what they are saying enough to tell if they are bsing you.
> you don't have that experience in this domain, you will spend approximately as much effort validating output as you would have creating it yourself,
Not true.
LLMs are amazing tutors. You have to use outside information, they test you, you test them, but they aren't pathologically wrong in the way that they are trying to do a Gaussian magic smoke psyop against you.
Knowledge certainly helps, but I’m talking about something more fundamental: your bullshit detector.
Even when you lack subject matter expertise about something, there are certain universal red flags that skeptics key in on. One of the biggest ones is: “There’s no such thing as a free lunch” and its corollary: “If it sounds too good to be true, it probably is.”
I'm not so sure about that. I was really Anti llm in the previous generation of LLMs (GPT3.5/4) but never stopped trying them out. I just found the results to be subpar.
Since reasoning models came about I've been significantly more bullish on them purely because they are less bad. They are still not amazing but they are at a poiny where I feel like including them in my workflow isn't an impediment.
They can now reliably complete a subset of tasks without me needing to rewrite large chunks of it myself.
They are still pretty terrible at edge cases ( uncommon patterns / libraries etc ) but when on the beaten path they can actually pretty decently improve productivity. I still don't think 10x ( well today was the first time I felt a 10x improvement but I was moving frontend code from a custom framework to react, more tedium than anything else in that and the AI did a spectacular job ).
You're using them wrong. Everyone is though I can't fault you specifically. Chatbot is like the worst possible application of these technologies.
Of late, deaf tech forums are taken over by language model debates over which works best for speech transcription. (Multimodal language models are the the state of the art in machine transcription. Everyone seems to forget that when complaining they can't cite sources for scientific papers yet.) The debates are sort of to the point that it's become annoying how it has taken over so much space just like it has here on HN.
But then I remember, oh yeah, there was no such thing as live machine transcription ten years ago. And now there is. And it's going to continue to get better. It's already good enough to be very useful in many situations. I have elsewhere complained about the faults of AI models for machine transcription - in particular when they make mistakes they tend to hallucinate something that is superficially grammatical and coherent instead - but for a single phrase in an audio transcription sporadically that's sometimes tolerable. In many cases you still want a human transcriber but the cost of that means that the amount of transcription needed can never be satisfied.
It's a revolutionary technology. I think in a few years I'm going have glasses that continuously narrate the sounds around me and transcribe speech and it's going to be so good I can probably "pass" as a hearing person in some contexts. It's hard not to get a bit giddy and carried away sometimes.
> You're using them wrong. Everyone is though I can't fault you specifically.
If everyone is using them wrong, I would argue that says something more about them than the users. Chat-based interfaces are the thing that kicked LLMs into the mainstream consciousness and started the cycle/trajectory we’re on now. If this is the wrong use case, everything the author said is still true.
There are still applications made better by LLMs, but they are a far cry from AGI/ASI in terms of being all-knowing problem solvers that don’t make mistakes. Language tasks like transcription and translation are valuable, but by no stretch do they account for the billions of dollars of spend on these platforms, I would argue.
LLM providers actually have an incentive not to write literature on how to use LLM optimally, as that causes friction which means less engagement/money spent on the provider. There's also the typical tin-foil hat explanation of "it's bad so you'll keep retrying it to get the LLM to work which means more money for us."
Isn't this more a product of the hype though? At worst you're describing a product marketing mistake, not some fundamental shortcoming of the tech. As you say "chat" isn't a use case, it's a language-based interface. The use case is language prediction, not an encyclopedic storage and recall of facts and specific quotes. If you are trying to get specific facts out of an LLM, you'd better be using it as an interface that accesses some other persistent knowledge store, which has been incorporated into all the major 'chat' products by now.
Surely you're not saying everyone is using them wrong. Let's say only 99% of them are using LLMs wrong, and the remaining 1% creates $100B of economic value. That's $100B of upside.
Yes the costs of training AI models these days are really high too, but now we're just making a quantitative argument, not a qualitative one.
The fact that we've discovered a near-magical tech that everyone wants to experiment with in various contexts, is evidence that the tech is probably going somewhere.
Historically speaking, I don't think any scientific invention or technology has been adopted and experimented with so quickly and on such a massive scale as LLMs.
It's crazy that people like you dismiss the tech simply because people want to experiment with it. It's like some of you are against scientific experimentation for some reason.
“If everything smells like shit, check your shoe.”
If the goal is to layoff all the customer support and trap the customer in a tarpit with no exit, LLMs are likely the best choice.
US can have fun with that. In EU well likely get laws that force companies to let us talk to a human if it gets bad enough.
I think all the technology is already in place. There are already smart glasses with tiny text displays. Also smartphones have more than enough processing capacity to handle live speech transcription.
What is the best open source live machine transcription tools would you say? Know of any guides that make it easy to setup locally if so?
I’ve had the exact same vibes around chatbots as an application of LLMs. But other than translation/transcription, what else is there?
> ...there was no such thing as live machine transcription ten years ago.
What? Then what the hell do you call Dragon NaturallySpeaking and other similar software in that niche?
Thru the 90s and 00s and well into the 10s I generally dismissed speech recognition as useless to me, personally.
I have a minor speech impediment because of the hearing loss. They never worked for me very well. I don't speak like a standard American - I have a regional accent and I have a speech impediment. Modern speech recognition doesn't seem to have a problem with that anymore.
IBM's ViaVoice from 1997 in particular was a major step. It was really impressive in a lot of ways but the accuracy rate was like 90 - 95% which in practice means editing major errors with almost every sentence. And that was for people who could speak clearly. It never worked for me very well.
You also needed to speak in an unnatural way [pause] comma [pause] and it would not be fair to say that it transcribed truly natural speech [pause] full stop
Such voice recognition systems before about 2016 also required training on the specific speaker. You would read many pages of text to the recognition engine to tune it to you specifically.
It could not just be pointed at the soundtrack to an old 1980s TV show then produce a time-sync'd set of captions accurate enough to enjoy the show. But that can be done now.
So, you started by saying
> ...there was no such thing as live machine transcription ten years ago.
Now you're saying that live machine transcription existed thirty years ago, but it has gotten substantially better in the intervening decades.
I agree with your amended claim.
If there's one common thread across LLM criticisms, it's that they're not perfect.
These critics don't seem to have learned the lesson that the perfect is the enemy of the good.
I use ChatGPT all the time for academic research. Does it fabricate references? Absolutely, maybe about a third of the time. But has it pointed me to important research papers I might never have found otherwise? Absolutely.
The rate of inaccuracies and falsehoods doesn't matter. What matters is, is it saving you time and increasing your productivity. Verifying the accuracy of its statements is easy. While finding the knowledge it spits out in the first place is hard. The net balance is a huge positive.
People are bullish on LLM's because they can save you days' worth of work, like every day. My research productivity has gone way up with ChatGPT -- asking it to explain ideas, related concepts, relevant papers, and so forth. It's amazing.
> Verifying the accuracy of its statements is easy.
For single statements, sometimes, but not always. For all of the many statements, no. Having the human attention and discipline to mindfully verify every single one without fail? Impossible.
Every software product/process that assumes the user has superhuman vigilance is doomed to fail badly.
> Automation centaurs are great: they relieve humans of drudgework and let them focus on the creative and satisfying parts of their jobs. That's how AI-assisted coding is pitched [...]
> But a hallucinating AI is a terrible co-pilot. It's just good enough to get the job done much of the time, but it also sneakily inserts booby-traps that are statistically guaranteed to look as plausible as the good code (that's what a next-word-guessing program does: guesses the statistically most likely word).
> This turns AI-"assisted" coders into reverse centaurs. The AI can churn out code at superhuman speed, and you, the human in the loop, must maintain perfect vigilance and attention as you review that code, spotting the cleverly disguised hooks for malicious code that the AI can't be prevented from inserting into its code. As qntm writes, "code review [is] difficult relative to writing new code":
-- https://pluralistic.net/2025/03/18/asbestos-in-the-walls/
> Having the human attention and discipline to mindfully verify every single one without fail? Impossible.
I mean, how do you live life?
The people you talk to in your life say factually wrong things all the time.
How do you deal with it?
With common sense, a decent bullshit detector, and a healthy level of skepticism.
LLM's aren't calculators. You're not supposed to rely on them to give perfect answers. That would be crazy.
And I don't need to verify "every single statement". I just need to verify whichever part I need to use for something else. I can run the code it produces to see if it works. I can look up the reference to see if it exists. I can Google the particular fact to see if it's real. It's really very little effort. And the verification is orders of magnitude easier and faster than coming up with the information in the first place. Which is what makes LLM's so incredibly helpful.
> I just need to verify whichever part I need to use for something else. I can run the code it produces to see if it works. I can look up the reference to see if it exists. I can Google the particular fact to see if it's real. It's really very little effort. And the verification is orders of magnitude easier and faster than coming up with the information in the first place. Which is what makes LLM's so incredibly helpful.
Well put.
Especially this:
> I can run the code it produces to see if it works.
You can get it to generate tests (and easy ways for you to verify correctness).
It's really funny how most anecdotes and comments about the utility and value of interacting with LLM's can be applied to anecdotes and comments about human beings themselves. Majority of people havent realized yet that consciousness is assumed by our society, and that we, in fact, don't know what it is or if we have it. Let alone prescribing another entity with it.
> Does it fabricate references? Absolutely, maybe about a third of the time
And you don't have concerns about that? What kind of damage is that doing to our society, long term, if we have a system that _everyone_ uses and it's just accepted that a third of the time it is just making shit up?
No, I don't. Because I know it does and it's incredibly easy to type something into Google Scholar and see if a reference exists.
Like, I can ask a friend and they'll mistakenly make up a reference. "Yeah, didn't so-and-so write a paper on that? Oh they didn't? Oh never mind, I must have been thinking of something else." Does that mean I should never ask my friend about anything ever again?
Nobody should be using these as sources of infallible truth. That's a bonkers attitude. We should be using them as insanely knowledgeable tutors who are sometimes wrong. Ask and then verify.
The net benefit is huge.
No, that doesn't mean you should never ask your friend things again if they make that mistake. But, if 30% of all their references are made up then you might start to question everything your friend says. And looking up references to every claim you're reading is not a productive use of time.
If my friend has a million times more knowledge than the average human being, then I'm willing to put up with a 30% error rate on references.
And I'm talking about references when doing deep academic research. Looking them up is absolutely a productive use of time -- I'm asking for the references so I can read them. I'm not asking for them for fun.
Remember, it's hundreds of times easier to verify information than it is to find it in the first place. That's the basic principle of what makes LLM's so incredibly valuable.
But how can you be sure that the info is correct if it made up the reference? Where did it pull the info? What good is a friend that's just bullshiting their way through every conversation hoping you wouldn't notice?
A third of the time is an insane number, if 30% of code that I wrote contained non existent headers I would be fired long ago.
A person who's bullshitting their way doesn't get a 70% accuracy. For yes/no questions they'll get 50%. For open ended questions they'll be lucky to get 1%.
You're really underestimating the difficulty of getting 70% accuracy for general open-ended questions.
And while you might think you're better than 70%, I'm pretty sure if you didn't run your code through compilers and linters, and testing for at least a couple times, your code doesn't get anywhere near 70% correct.
Because he reads the reference document…
"you might start to question everything your friend says"
That's exactly what the OP is saying. Verify everything.
Maybe I'm getting old, but sometimes it feels like everybody is young now and has only lived in a world where they can look up anything at a moments notice and now things they are infallible.
Having lived a decent chunk of my life pre-internet, or at least fast and available internet, looking back at those days you realize just how often people were wrong about things. Old wives tales, made up statistics, imagined scenarios, people really do seem to confabulate a lot of information.
> And you don't have concerns about that? What kind of damage is that doing to our society, long term, if we have a system that _everyone_ uses and it's just accepted that a third of the time it is just making shit up?
Main problem with our society is that two thirds of what _everyone_ says is made up shit / motivated reasoning. The random errors LLMs make are relatively benign, because there is no motivation behind them. They are just noise. Look through them.
I think a third of facts i say are false as stated and I do not think I'm worse than 30th percentile in humans at truthfulness
You are not a trusted authority relied on by millions and expected to make decisions for them, and you could choose not to say something you aren't sure that you know.
You might be surprised to hear that people talk to other people and trust their judgements.
So, I've sometimes wondered about this.
Could it end up being a net benefit? will the realistic sounding but incorrect facts generated by A.I. make people engage with arguments more critically, and be less likely to believe random statements they're given?
Now, I don't know, or even think it is likely that this will happen, but I find it an interesting thought experiment.
That's hilarious; I had no idea it was that bad. And for every conscientious researcher who actually runs down all the references to separate the 2/3 good from the 1/3 bad, how many will just paste them in, adding to the already sky-high pile of garbage out there?
This. 100% this.
LLMs will spit out responses with zero backing with 100% conviction. People see citations and assume it's correct. We're conditioned for it thanks to....everything ever in history. Rarely do I need to check a wikipedia entry's source.
So why do people not understand that: this is absolutely going to pour jet fuel on misinformation in the world. And we as a society are allowed to hold a bar higher for what we'll accept get shoved down our throats by corporate overlords that want their VC payout.
> People see citations and assume it's correct.
The solution is to set expectations, not to throw away one of the most valuable tools ever created.
If you read a supermarket tabloid, do you think the stories about aliens are true? No, because you've been taught that tabloids are sensationalist. When you listen to campaign ads, do you think they're true? When you ask a buddy about geography halfway across the world, do you assume every answer they give is right?
It's just about having realistic expectations. And people tend to learn those fast.
> Rarely do I need to check a wikipedia entry's source.
I suggest you start. Wikipedia is full of citations that don't back up the text of the article. And that's when there are even citations to begin with. I can't count the number of times I've wanted to verify something on Wikipedia, and there either wasn't a citation, or there was one related to the topic but that didn't have anything related to the specific assertion being made.
people lie more
I think many people are just not really good at dealing with "imperfect" tools. Different tools can have different success probability, let's call that probability p here. People typically use tool that have p=100%, or at least very close to it. But LLM is a tool that is far from that, so making use of it takes different approach.
Imagine there is an probabilistic oracle that can answer any question with a yes/no with success probability p. If p=100% or p=0% then it is apparently very useful. If p=50% then it is absolutely worthless. In other cases, such oracle can be utilized in different way to get the answer we want, and it is still a useful thing.
One of the magic things about engineering is that I can make usefulness out of unreliability. Voltage can fluctuate and I can transmit 1s and 0s, lines can fizz, machines can die, and I can reliably send video from one end to the other.
Unreliability is something we live in. It is the world. Controlling error, increasing signal over noise, extracting energy from the fluctuations. This is life, man. This is what we are.
I can use LLMs very effectively. I can use search engines very effectively. I can use computers.
Many others can’t. Imagine the sheer fortune to be born in the era where I was meant to be: tools transformative and powerful in my hands; useless in others’.
I must be blessed by God.
Many people are trapped in black and white thinking. It's like they can only think in binary things that are not either good nor bad are heresy.
That is deep
Your point reminded me of Terrence Tao’s point that AI has a “plausibility problem”. When it can’t be accurate, it still disguises itself as accurate.
Its true success rate is by no means 100%, and sometimes is 0%, but it always tries to make you feel confident.
I’ve had to catch myself surrendering too much judgment to it. I worry a high school kid learning to write will have fewer qualms surrendering judgment
A scientific instrument that is unreliably accurate is useless. Imagine a kitchen scale that always gave +/- 50% every 3rd time you use it. Or maybe 5th time. Or 2nd.
So we're trying to use tools like this currently to help solve deeper problems and they aren't up to the task. This is still the point we need to start over and get better tools. Sharpening a bronze knife will never be as sharp or have the continuity as a steel knife. Same basic elements, very different material.
A bad analogy doesn't make a good argument. The best analogy for LLMs is probably a librarian on LSD in a giant library. They will point you in a direction if you have a question. Sometimes they will pull up the exact page you need, sometimes they will lead you somewhere completely wrong and confidently hand you a fantasy novel, trying to convince you it's a real science book.
It's completely up to your ability to both find what you need without them and verify the information they give you to evaluate their usefulness. If you put that on a matrix, this makes them useful in the quadrant of information that is both hard to find, but very easy to verify. Which at least in my daily work is a reasonable amount.
I think people confuse the power of the technology with the very real bubble we’re living in.
There’s no question that we’re in a bubble which will eventually subside, probably in a “dot com” bust kind of way.
But let me tell you…last month I sent several hundred million requests to AI, as a single developer, and got exactly what I needed.
Three things are happening at once in this industry… (1) executives are over promising a literal unicorn with AGI, that is totally unnecessary for the ongoing viability of LLM’s and is pumping the bubble. (2) the technology is improving and delivery costs are changing as we figure out what works and who will pay. (3) the industry’s instincts are developing, so it’s common for people to think “AI” can do something it absolutely cannot do today.
But again…as one guy, for a few thousand dollars, I sent hundreds of millions of requests to AI that are generating a lot of value for me and my team.
Our instincts have a long way to go before we’ve collectively internalized the fact that one person can do that.
> But let me tell you…last month I sent several hundred million requests to AI, as a single developer, and got exactly what I needed
There are 2.6 million seconds in a month. You are claiming to have sent hundreds of requests per second to AI.
That's exactly what happened – I called the OpenAI API, using custom application code running on a server, a few hundred million times.
It is trivial for a server to send/receive 150 requests per second to the API.
This is what I mean by instincts...we're used to thinking of developers-pressing-keys as a fundamental bottleneck, and it still is to a point. But as soon as the tracks are laid for the AI to "work", things go from speed-of-human-thought to speed-of-light.
A lot of people are feeding all the email and slack messages for entire companies through AI to classify sentiment (positive, negative, neutral etc), or summarize it for natural language search using a specific dictionary. You can process each message multiple ways for all sorts of things, or classify images. There's a lot of uses for the smaller cheaper faster llms
Yeah I'm curious now.
If you have a lot of GPU's and you're doing massive text processing like spam detection for hundreds of thousands of users, sure.
But "as a single developer", "value for me and my team"... I'm confused.
I'm NDA'ed on the specifics, sorry.
In general terms, we had to call the OpenAI API X00,000,000 times for a large-scale data processing task. We ended up with about 2,000,000 records in a database, using data created, classified, and cleaned by the AI.
There were multiple steps involved, so each individual record was the result of many round trips between the AI and the server, and not all calls are 1-to-1 with a record.
None of this is rocket science, and I think any average developer could pull off a similar task given enough time...but I was the only developer involved in the process.
The end product is being sold to companies who benefit from the data we produced, hence "value for me and the team."
The real point is that generative AI can, under the right circumstances, create absurd amounts of "productivity" that wouldn't have been possible otherwise.
My experience is starkly different. Today I used LLMs to:
1. Write python code for a new type of loss function I was considering
2. Perform lots of annoying CSV munging ("split this CSV into 4 equal parts", "convert paths in this column into absolute paths", "combine these and then split into 4 distinct subsets based on this field.." - they're great for that)
3. Expedite some basic shell operations like "generate softlinks for 100 randomly selected files in this directory"
4. Generate some summary plots of the data in the files I was working with
5. Not to mention extensive use in Cursor & GH Copilot
The tool (Claude 3.7 mostly, integrated with my shell so it can execute shell commands and run python locally) worked great in all cases. Yes I could've done most of it myself, but I personally hate CSV munging and bulk file manipulations and its super nice to delegate that stuff to an LLM agent
edit: formatting
These seem like fine use cases: trivial boilerplate stuff you’d otherwise have to search for and then munge to fit your exact need. An LLM can often do both steps for you. If it doesn’t work, you’ll know immediately and you can probably figure out whether it’s a quick fix or if the LLM is completely off-base.
That’s fair but it’s totally different use cases than the linked post discusses.
The click baited title is “I genuinely don’t understand how some people are still bullish about LLM”.
I guess the author can understand now?
When something was impossible only 3 years ago, barely worked 2 years ago, but works well now, there are very good reasons to be bullish, I suppose?
The hypes cut both way.
> When something was impossible only 3 years ago, barely worked 2 years ago, but works well now
Are you talking of what exactly? What are you stating works well now and did not years ago? Claude as a milestone of code writing?
Also in that case, if there are current apparent successes coming from a realm of tentative responses, we would need proof that the unreliable has become reliable. The observer will say "they were tentative before, they often look tentative now, why should we think they will pass the threshold to a radical change".
How did you integrate Claude into your shell
I wrote my own tool for that a while back as an LLM plugin, so I can do this:
I use that all the time, it works really well (defaulting to GPT-4o-mini because it's so cheap, but it works with Claude too): https://simonwillison.net/2024/Mar/26/llm-cmd/I hacked something together a while back - a hotkey toggles between standard terminal mode and LLM mode. LLM mode interacts with Claude, and has functions / tool calls to run shell commands, python code, web search, clipboard, and a few other things. For routine data science tasks it's been super useful. Claude 3.7 was a big step forward because it will often examine files before it begins manipulating them and double-checks that things were done correctly afterwards (without prompting!). For me this works a lot better than other shell-integration solutions like Warp
Claude Code is available directly from Anthropic, but you have to request an invite as it's in "Research Preview"
There are third party tools that do the same, though
I’ve been using Claude a lot lately, and I must say I very much disagree.
For example, the other day I was chatting with it about the health risks associated with my high consumption of grown salmon. It then generated a small program to simulate the accumulation of PCB in my body. I could review the program, ask questions about the assumptions, etc. It all seemed very reasonable. A toxicokinetic analysis it called it.
It then struck me how immensely valuable this is to a curious and inquisitive mind. This is essentially my gold standard of intelligence: take a complex question and break it down in a logical way, explaining every step of the reasoning process to me, and be willing to revise the analysis if I point out errors / weaknesses.
Now try that with your doctor. ;)
Can it make mistakes? Sure, but so can your doctor. The main difference is that here the responsibility is clearly on you. If you do not feel comfortable reviewing the reasoning then you shouldn’t trust it.
"Trust but verify"
With an LLM it should be "Don't trust, verify" - but it isn't that hard to verify LLM claims, just ask it for original sources.
Compare to ye olde scientific calculators (90s), they were allowed in tests because even though they could solve equations, they couldn't show the work. And showing the work was 90% of the score. At best you could use one to verify your solution.
But then tech progressed and now calculators can solve equations step by step -> banned from tests at school.
> it isn't that hard to verify LLM claims, just ask it for original sources
Have you tried actually doing this? Most of the time it makes up urls that don't exist or contradict the answer it just gave.
Google it.
There's a bunch of scientific papers talking about that.
Bet that plenty of those researchers have written python programs they've uploaded to GitHub and you just got one of their programs regurgitated
It's not intelligence mate, it's just copying an existing program.
I don’t mind to be honest. I don’t expect more intelligence than that from my doctor either. I want them to identify the relevant science and regurgitate / apply it.
>It's not intelligence mate, it's just copying an existing program.
Isn't the 'intelligence' part, the bit that gets a previously-constructed 'thing', and makes it work in 'situation'. Pretty sure that's how humans work, too.
that's a wild assumption that has no place here
And how would you define intelligence then?
So many people putting expectations up to knock down about models. Infinite reasons to critique them.
Please dispense with anyone's "expectations" when critiquing things! (Expectations are not a fault or property of the object of the expectations.)
Today's models (1) do things that are unprecedented. Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, in real time (and faster if we want), is beyond any other artifact in existence. Including humans.
They are (2) progressing quickly. AI has been an active field (even through its famous "winters") for several decades, and they have never moved forward this fast.
Finally and most importantly (3), many people, including myself, continue to find serious new uses for them in daily work, that no other tech or sea of human assistants could replace cost effectively.
The only way I can make sense out of anyone's disappointment is to assume they simply haven't found the right way to use them for themselves. Or are unable to fathom that what is not useful for them is useful for others.
They are incredibly flexible tools, which means a lot of value, idiosyncratic to each user, only gets discovered over time with use and exploration.
That that they have many limits isn't surprising. What doesn't? Who doesn't? Zeus help us the day AI doesn't have obvious limits to complain about.
> Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, is beyond any other artifact in existence
Very well said. That’s perhaps the area where I have found LLMs most useful lately. For several years, I have been trying to find a solution to a complex and unique problem involving the laws of two countries, financial issues, and my particular individual situation. No amount of Googling could find an answer, and I was unable to find a professional consultant whose expertise spans the various domains. I explained the problem in detail to OpenAI’s Deep Research, and six minutes later it produced a 20-page report—with references that all checked out—clearly explaining my possible options, the arguments for and against each, and why one of those options was probably best. It probably saved me thousands of dollars.
Are they progressing quickly? Or was there a step-function leap about 2 years ago, and incremental improvements since then?
I tried using AI coding assistants. My longest stint was 4 months with Copilot. It sucked. At its best, it does the same job as IntelliSense but slower. Other times it insisted on trying to autofill 25 lines of nonsense I didn't ask for. All the time I saved using Copilot was lost debugging the garbage Copilot wrote.
Perplexity was nice to bounce plot ideas off of for a game I'm working on... until I kept asking for more and found that it'll only generate the same ~20ish ideas over and over, rephrased every time, and half the ideas are stupid.
The only use case that continues to pique my interest is Notion's AI summary tool. That seems like a genuinely useful application, though it remains to be seen if these sorts of "sidecar" services will justify their energy costs anytime soon.
Now, I ask: if these aren't the "right" use cases for LLMs, then what is, and why do these companies keep putting out products that aren't the "right" use case?
have you tried it recently? o3-mini-high is really impressive. If you ease into talking to it about your intent and outlining the possible edge and corner cases it will write nuanced rust code 1000 lines at a time no problem
My anecdotal experience is similar. For any important or hard technical questions relevant to anything I do, the LLM results are consistently trash. And if you are an expert in the domain you can’t not notice this.
On the other hand, for trivial technical problems with well known solutions, LLMs are great. But those are in many senses the low value problems; you can throw human bodies against that question cheaply. And honestly, before Google results became total rubbish, you could just Google it.
I try to use LLMs for various purposes. In almost all cases where I bother to use them, which are usually subject matters I care about, the results are poorer than I can quickly produce myself because I care enough to be semi-competent at it.
I can sort of understand the kinds of roles that LLMs might replace in the next few years, but there are many roles where it isn’t even close. They are useless in domains with minimal training data.
>For any important or hard technical questions relevant to anything I do, the LLM results are consistently trash. And if you are an expert in the domain you can’t not notice this.
This is also my experience. My day job isn't programming, but when I can feed an LLM secretarial work, or simple coding prompts to automate some work, it does great and saves me time.
Most of my day is spent getting into the details on things for which there's no real precedent. Or if there is, it hasn't been widely published on. LLMs are frustrating useless for these problems.
Because it’s not a scientific research tool, it’s a most likely next text generator. It doesn’t keep a database of ingested information with source URLs. There are plenty of scientific research tools but something that just outputs text based on your input is no good for it.
I’m sure that in the future there will be a really good search tool that utilises an LLM but for now a plain model just isn’t designed for that. There are a ton of other uses for them, so I don’t think that we should discount them entirely based on their ability to output citations.
I think that to understand the diversity of opinions, we have to recognize a few different categories of users:
Category 1: people who don't like to admit that anything trendy can also be good at what it does.
Category 2: people who don't like to admit that anything made by for-profit tech companies can also be good at what it does.
Category 3: people who don't like to admit that anything can write code better than them.
Category 4: people who don't like to admit that anything which may be put people out of work who didn't deserve to be put out of work, and who already earn less than the people creating the thing, can also be good at what it does
Category 5: people who aren't using llms for things they are good at
Category 6: people who can't bring themselves to communicate with AIs with any degree of humility
Category 7: people to whom none of the above applies
I have a bunch of friends who don't get along well with each other, but I tend to get along with all of them. I believe this is about focusing on the good in people, and being able to ignore the bad. I think it's the same with tools. To me AI is an out-of-this-world, OP tool. Is it perfect? No. But it's amazing! The good I get out of it far surpasses its mistakes. Almost like people. People "hallucinate" and say wrong things all the time. But that doesn't make them useless or bad. So, whoever is having issues with AIs is probably having an issue dealing with people as well :) Learn how to deal with people, and learn how to deal with AI -- the single biggest skill you'll need in 21st century.
Not trying to dismiss your anecdote, as both can very much coexist separately, and I actually think your analogy is spot on for LLMs! But it did make me think of something.
I once was in an environment where A got along with everyone, and B was hated by everyone else except for A. This wasn't because B saw qualities in A that no one else recognized; it was just that A was oblivous to/wasn't personally affected by all the valid reasons why everyone else disliked B. A to an extent thought of themselves as being able to see the good in B, but in reality they simply lacked the understanding of the effects of B's behavior on others.
Agree. Had the same thought when I saw her complaint about getting the publication year of the article wrong. If she had a grad student who could read, understand and summarize the article well, but inadvertently said it was from 2023 instead of 2025, (hopefully) she wouldn’t call that grad student unintelligent.
In general LLMs have made many areas worse. Now you see people writing content using LLMs without understanding the content itself, it becomes really annoying especially if you don't know this and ask the question "did you perhaps write this using LLM" and get the "yes" answer.
In programming circles it's also annoying when you try to help and you get fed garbage outputted by LLMs.
I belive models for generating visuals (image, video sound generation) is much more interesting as it's area where errors do not matter as much. Though the ethicality of how these models have been trained is another matter.
The equivalent trope of this as recent as 5 years back would have been the lazy junior engineer copying code from Stackoverflow without fully grokking it.
I feel humans should be held to account for the work they produce irrespective of the tools they used to produce it.
The junior engineer who copied code he didn't understand from Stackoverflow should face the consequences as much as the engineer who used LLM generated code without understanding it.
I think the disconnect is that people who produce serious, researched and referenced work tend to believe that most work is like that. It is not. The majority of content created by humans is not referenced, it's not deeply researched, and a lot of it isn't even read, at least not closely. It sends a message just by existing in a particular place at a particular time. And its creation gainfully employs millions of people, which of course costs millions of dollars. That's why, warts and all, people are bullish on LLMs.
They are bullish for the right reasons. Those reasons are not the same ones you think of. They are betting that humans wouuld keep getting addicted to more and more technology crutches and assistants as they keep inviting more workload onto their minds and body. There is no going back with this trend.
Why do we burden ourselves with such expectations on us? Look at cities like Dallas. It is designed for cars. Not for human walking. The buildings are far apart, workplaces are far away from homes and everything looks like designed for some king-kong like creatures.
The burden of expectations on humans is driven by technology. Technology makes you work harder than before. It didn't make your life easier. Check how hectic life has become for you now vs a laid-back village peasant a century back.
The bullishness on LLMs is betting on this trend of self-inflicted human agony and dependency on tech. Man is going back to the craddle. LLMs give the milk feeder.
I’ve used Claude today to:
Write code to pull down a significant amount of public data using an open API. (That took about 30 seconds - I just gave it the swagger file and said “here’s what I want”)
Get the data (an hour or so), clean the data (barely any time, gave it some samples, it wrote the code), used the cleaned data to query another API, combined the data sources, pulled down a bunch of PDFs relating to the data, had the AI write code to use tesseract to extract data from the PDFs, and used that to build a dashboard. That’s a mini product for my users.
I also had a play with Mistral’s OCR and have tested a few things using that against the data. When I was out walking my dogs I thought about that more, and have come up with a nice workflow for a problem I had, which I’ll test in more detail next week.
That was all whole doing an entirely different series of tasks, on calls, in meetings. I literally checked the progress a few times and wrote a new prompt or copy/pasted some stuff in from dev tools.
For the calls I was on, I took the recording of those calls, passed them into my local instance whisper, fed the transcript into Claude with a prompt I use to extract action points, pasted those into a google doc, circulated them.
One of the calls was an interview with an expert. The transcript + another prompt has given me the basis for an article (bulleted narrative + key quotes) - I will refine that tomorrow, and write the article, using a detailed prompt based on my own writing style and tone.
I needed to gather data for a project I’m involved in, so had Claude write a handful of scrapers for me (HTML source > here is what I need).
I downloaded two podcasts I need to listen to - but only need to listen to five minutes of each - and fed them into whisper then found the exact bits I needed and read the extracts rather than listening to tedious podcast waffle.
I turned an article I’d written into an audio file using elevenlabs, as a test for something a client asked me about earlier this week.
I achieved about three times as much today as I would have done a year ago. And finished work at 3pm.
So yeah, I don’t understand why people are so bullish about LLMs. Who knows?
Yuck. Do your users know that they are reading recycled LLM content? Is this long winded post generated by an LLM?
Yeah, they are not “reading recycled LLM content”, no. The dashboard in question presents data from PDFs. They are very happy with being able to explore that data.
So much about this seems inauthentic. The post itself. The experience. The content produced. I wouldn’t like to be on the other end of the production of this content.
This just sounds like a normal day for someone who does research and analysis in 2025.
Where do you think expert analysis comes from?
Talk to experts, gather data, synthesize, output. Researchers have been doing this for a long time. There's a lot of grunt work LLM's can really help with, like writing scripts to collect data from webpages.
Great! You’re not the audience for it.
Who is?
The people who pay for what I do.
What happens on the day when those people just directly pay some AI model to do it?
Then that part of my work will change.
However, as this thread demonstrates repeatedly, using LLMs effectively is about knowing what questions to ask, and what to put into the LLM alongside the questions.
The people who pay me to do what I do could do it themselves, but they choose to pay me to do it for them because I have knowledge they don’t have, I can join the dots between things that they can’t, and I have access to people they don’t have access to.
AI won’t change any of that - but it allows me to do a lot more work a lot more quickly, with more impact.
So yeah, at the point that there’s an AI model that can find and select the relevant datasets, and can tell the user what questions to ask - when often they don’t know the questions they need to have answered, then yes, I’ll be out of a job.
But more likely I’ll have built that tool for my particular niche. Which is more and more what I’m doing.
AI gives me the agency to rapidly test and prototype ideas and double down on the things that work really well, and refine the things that don’t work so brilliantly.
Love the pragmatic and varied use. Nice one and thanks for some ideas.
This sounds like a lot of actions without any verification that the LLM didn't misinterpret things or just make something up.
Well the API calls worked perfectly. The LLM didn’t misinterpret that.
The data extraction via tesseract worked too.
The whisper transcript was pretty good. Not perfect, but when you do this daily you are easily able to work around things.
The summaries of the calls were very useful. I could easily verify those because I was on the calls.
The interview - again, transcript is great. The bulleted narrative was guided - again - by me having been on the call. I certify he quotes against the transcript, and audio if I’ve got any doubts.
Scrapers - again, they worked fine. The LLM didn’t misinterpret anything.
Podcasts - as before. Easy.
Article to voice - what’s to misinterpret?
Your criticism sounds like a lot of waffle with no understanding of how to use these tools.
How do you know a summary of a podcast you haven't listened to is accurate?
Firstly I am not summarising the podcast, simply using whisper to make a transcript.
T even if I was, because I do this multiple times a day and have been for quite sone time I know how to check for errors.
One part of that is a “fact check” built into the prompt, another part is feeding the results of that prompt back into the API with a second prompt and the source material and asking it to verify that the output of the first prompt is accurate.
However the level of hallucination has dropped massively over time, and when you’re using LLMs all the time you quickly become attuned to what’s likely to cause them and how to mitigate them.
I don’t mean this in an unpleasant way but this question - and many of the other comments responding to my initial description of how I use LLMs - feel like the story is things that people who have slightly hand wavey experience of LLMs think, having played with the free version of ChatGPT back in the day.
Claude 3.7 is far removed from ChatGPT at launch, and even now ChatGPT feels like a consumer facing procure while Claude 3.7 feels like a professional tool.
And when you couple that with detailed tried and tested prompts via the api in a multistage process, it is incredibly powerful.
Did you also do that while mewing and listening to an AI abridged audiobook version of the laws of power in chinese? Don't forget your morning ice face dunks.
No, I leave that to the highly amusing people like you.
I am neither bullish or bearish. LLM is a tool.
It's a hammer -- sometimes it works well. It summarizes the user reviews on a site... cool, not perfect, but useful.
And like every tool, it is useless for 90% of life's situations.
And I know when it's useful because I've already tried a hammer on 1000 things and have figured out what I should be using a hammer on.
Yep, that's exactly how Microsoft is operating as a company recently: Copilot is a hammer—it needs to be used EVERYWHERE.
Forget bug fixes and new feature rollouts, every department and product team at Microsoft needs to add Copilot. Microsoft customers MUST jump on the AI-bandwagon!
>> I am neither bullish or bearish. LLM is a tool...It's a hammer
If someone says, "This new type of hammer will increase productivity in the construction industry by 25%", it's something else in addition to being a tool. It's either a lie, or it's an incredible advance in technology.
It's a lie. Those constructed things would become increasingly sub-par requiring even more maintenance from workers that no longer exist because of the 25% efficiency gain. There is no win here. It's a shortcut for some people that will cost other people more time. It's burden shifting to the others on the team, in the industry, or in the near future. It's weak sauce from weak workers.
The claims are a lie. The tool is still useful. Even if the hammer doesn't increase productivity by 25%, if it feels more comfortable to hold and I can use it for longer, I'm going to be happy with it, independent from any marketing.
I feel like we'll laugh at posts like this in 5 years. It's not inaccurate in any way, it just misses the wood for the trees. Any new technology is always worse in some ways. Smart phones still have much worse battery life and are harder to type on than Blackberries. But imagine not understanding why people are bullish about Smartphones.
It's 100x easier to see how LLM's change everything. It takes very little vision to see what an advancement they are. I don't understand how you can NOT be bullish about LLM's (whether you happen to like them or not is a different question).
Think about the early days of digital photography. When digital cameras first emerged, expert critics from the photograph field were quick to point out issues like low resolution, significant noise, and poor color reproduction—imperfections that many felt made them inferior to film. Yet, those early digital cameras represented a breakthrough: they enabled immediate image review, easy sharing, and rapid technological improvements that soon eclipsed film in many areas. Just as people eventually recognized that the early “flaws” of digital photography were a natural part of a revolutionary leap forward, so too should we view the occasional hallucinations in modern LLMs as a byproduct of rapidly evolving technology rather than a fundamental flaw.
Or how about computer graphics? Early efforts to move 3D graphics hardware into the PC realm were met with extreme skepticism by my colleagues who were “computer graphics researchers” armed with the latest Silicon Graphics hardware. One researcher I was doing some work with in the mid-nineties remarked about PC graphics at the time: “It doesn’t even have a frame buffer. Look how terrible the refresh rate is. It flickers in a nauseating way.” Etc.
It’s interesting how people who are actual experts in a field where there is a major disruption going on often take a negative view of the remarkable new innovation simply because it isn’t perfect yet. One day, they all end up eating their words. I don’t think it’s any different with LLMs. The progress is nothing short of astonishing, yet very smart people continue to complain about this one issue of hallucination as if it’s the “missing framebuffer” of 1990s PC graphics…
I think there are multiple conversations happening that are tying to converge on one.
On one hand, LLMs are overhyped and not delivering on promises made by their biggest advocates.
On the other hand, any other type of technology (not so overhyped) would be massively celebrated in significantly improving a subset of niche problems.
It’s worth acknowledging that LLMs do solve a good set of problems well, while also being overhyped as a silver bullet by folks who are generally really excited about its potential.
Reality is that none of us know what the future is, and whether LLMs will have enough breakthroughs to solve more problems then today, but what they do solve today is still very impressive as is.
Yes, exactly. There is a bell curve of hype, where some people think autoregressive decoders will lead us to AGI if we just give it the right prompt or perhaps a trillion dollars of compute. And there are others who haven’t even heard of ChatGPT. Depending on which slice of the population you’re interacting with, it’s either under or over hyped.
I've been saying the same thing, though in less detail. AI is so dr8ven by hype at the moment, that it's unavoidable that it's going to collapse at some point. I'm not saying tue current crop of AI is useless; there are plenty of useful applications, but it's clear lots of people expect more from it than it's capable of, and everybody is investing in it just because everybody else is.
But even if it does work, you still need to doublecheck everything it does.
Anyway, my RPG group is going to try roleplaying with AI generated content (not yet as GM). We'll see how it goes.
https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-cas...
> Eisegesis is "the process of interpreting text in such a way as to introduce one's own presuppositions, agendas or biases". LLMs feel very smart when you do the work of making them sound smart on your own end: when the interpretation of their output has a free parameter which you can mentally set to some value which makes it sensible/useful to you.
> This includes e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations/directions to explore, you impute the coherent personality to the LLM. And you inject very few bits of steering by doing so, but those bits are load-bearing. If left to their own devices, LLMs won't pick those obviously correct ideas any more often than chance.
I wrote an AI assistant which generates working spreadsheets with formulas and working presentations with neatly laid out elements and styles. It's a huge productivity gain relative to starting from a blank page.
I think LLMs work best when they are used as a "creative" tool. They're good for the brainstorming part of a task, not for the finishing touches.
They are too unreliable to be put in front of your users. People don't want to talk to unpredictable chatbots. Yes, they can be useful in customer service chats because you can put them on rails and map natural language to predetermined actions. But generally speaking I think LLMs are most effective when used _by_ someone who's piloting them instead of wrapped in a service offered _to_ someone.
I do think we've squeezed 90%+ of what we could from current models. Throwing more dollars of compute at training or inference won't make much difference. The next "GPT moment" will come from some sufficiently novel approach.
People are bullish because, despite your rant about quality, even you still use it every day:
> I use GPT, Grok, Gemini, Mistral etc every day in the hope they'll save me time searching for information and summarizing it.
Even worse, you're continually waiting for it to get better. If the present is bright and the future is brighter, bullishness is justified.
I don't trust Sabine Hossenfelder on this. Even I as a computer scientist/programmer with just some old experience from my master courses towards AI and ML, I know much more than her about how the things work.
She became more of an influencer than a scientist. And that is nothing wrong with that unless she doesn't try to pose as an authority on subjects she doesn't have a clue. It's OK to have an opinion as an outsider but it's not OK to pretend you are right and that you are an expert on every scientific or technical subject that happens to make you want to make a tweet about.
> I genuinely don't understand why some people are still bullish about LLMs.
I don't believe OP's thesis is properly backed by the rest of his tweet, which seems to boil down to "LLM's can't properly cite links".
If LLM's performing poorly on an arbitrary small-scoped test case makes you bearish on the whole field, I don't think that falls on the LLM's.
Her point is not just "LLMs can't cite links", but "LLMs make shit up". And that is absolutely a problem.
Some web development advice I remember from long ago is to do most of the styling after the functionality is implemented. If it looks done, people will think it is done.
LLMs did the "styling" first. They generate high-quality language output of the sort most of us would take as a sign of high intelligence and education in a human. A human who can write well can probably reason well, and probably even has some knowledge of facts they write confidently about.
When digital cameras first appeared, their initial generations produced low-resolution, poor-quality images, leading many to dismiss them as passing gimmicks. This skepticism caused prominent companies, notably Kodak, to overlook the significance of digital photography entirely. Today, however, film photography is largely reserved for niche professionals and specialized use-cases.
New technologies typically require multiple generations of refinement—iterations that optimize hardware, software, cost-efficiency, and performance—to reach mainstream adoption. Similarly, AI, Large Language Models (LLMs), and Machine Learning (ML) technologies are poised to become permanent fixtures across industries, influencing everything from automotive systems and robotics to software automation, content creation, document review, and broader business operations.
Considering the immense volume of new information generated and delivered to us constantly, it becomes evident that we will increasingly depend on automated systems to effectively process and analyze this data. Current challenges—such as inaccuracies and fabrications in AI-generated content—parallel the early imperfections of digital photography. These issues, while significant today, represent evolutionary hurdles rather than permanent limitations, suggesting that patience and continuous improvement will ultimately transform these AI systems into indispensable tools.
I'm also not bullish on this. In the sense that I don't think LLMs are going to get 10x better, but they are useful for what they can do already.
If I see what Copilot suggests most of the time, I would be very uncomfortable using it for vibe coding though. I think it's going to be... entertaining watching this trend take off. I don't really fear I'm going to lose my job soon.
I'm skeptical that you can build a business on a calculator that's wrong 10% of the time when you're using it 24/7. You're gonna need a human who can do the math.
Like others here, I use it to code (no longer a professional engineer, but keep side projects).
As soon as LLMs were introduced into the IDE it began to feeling like LLM autocomplete was almost reading my mind. With some context built up over a few hundred lines of initial architecture, autocomplete now sees around the same corners I am. It’s more than just “solve this contrived puzzle” or “write snake”. It combines the subject matter use case (informed by variable and type naming) underlying the architecture and sometimes produces really breathtaking and productive results. Like I said, it took some time but when it happened, it was pretty shocking.
Walk into any coffee shop or office and I can guarantee that you'll see several people actively typing into ChatGPT or Claude. If it was so useless, four years on, why would people be bothering with it?
I don't think you can even be bullish or bearish about this tech. It's here and it's changing pretty much every sector you can think of. It would be like saying you're not bullish about the Internet.
I honestly can't imagine life without one of these tools. I have a subscription to pretty much all of them because I get so excited to try out new models.
> It would be like saying you're not bullish about the Internet.
The Internet is great, but it did not usher in a golden age utopia for mankind. So it was certainly possible to overhype it.
She's a scientist. Most of the people on here are writing software which is essentially reinventing the wheel over and over. Of course you have a different experience of LLMs.
"Reinventing the wheel over and over" has been a trillion dollar business. Maybe it won't be for long.
Is that "bullish on LLMs" or not?
I don't know.. I've maintained skepticism, but recently AI has enabled solutions for client problems that would have been intractable with conventional coding. A team was migrating a years old excel based workflow where no less than 3 spreadsheets contained thousands of call notes, often with multiple notes stuffed into the same column separated inconstantly by a shorthand date and initials of who was in the call. Sometimes with text arrows or other meta descriptions like (all calls after 3/5 were handled by Tim). They want to move all of this into structured jira tickets and child tickets.
Joining the mess of freeform, redundant, and sometimes self contradicting data into JSON lines, and feeding it into AI with a big explicit prompt containing example conversions and corrections for possible pitfalls has resulted in almost magically good output. I added a 'notes' field to the output and instructed the model to call out anything unusual and it caught lots of date typos by context, ambiguously attributed notes, and more.
It would have been a man month or so of soul drowningly tedious and error prone intern level work, but now it was 40 minutes and $15 of Gemini usage.
So, even if it's not a galaxy brained super intelligence yet, it is a massive change to be able to automate what was once exclusively 'people' work.
I think there is a lot of denial going around right now.
The present path of IA is nothing short of revolutionary, a lot of jobs and industries are going to suffer a major upheaval and a lot of people are just living in some wishful thinking moment where it will all go away.
I see people complaining it gives them bad results. Sure it does, so all other parsed information we get. It’s our job to check it ourselves. Still , the amount of time it saves me, even if I have to correct it is huge.
A can give an example that has nothing to do with work. I was searching for the smallest miniATX computer cases that would accept at least 3 HDDs (3.5”). The amount of time LLMs saved me is staggering.
Sure, there was one wrong result in the mix, and sure, I had to double check all the cases myself, but, just not having to go through dozens of cases, find the dimensions, calculate the volume, check the HDDs in difficult to read (and sometimes obtain) pages, saved days of work - yes I had done a similar search completely manually about 5 years ago.
This is a personal example, I also have others at work.
It’s truly revolutionary and it’s just starting.
The author of the tweet is a physicist. Her work is at the edge of the boundary of human knowledge. LLM’s are useless in this domain, at last when applied directly.
When I use LLM’s to explore applications of cutting edge nonlinear optics, I too am appalled about the quality of the output. When I use an LLM to implement a React program, something that has been done hundreds of times before by others, I find it performs well.
It's a common and fair criticism. LLM-based products promise to save time, but for many complex, day-to-day tasks - adding a feature to a 50M LOC codebase, writing a magazine-quality article, properly summarizing 5 SEC filings - they often don't. They require careful re-validation, and once you find a few obvious issues, trust erodes and the whole thing gets tossed.
This isn't a technology problem, it's a product problem - and one that may not be solvable with better models alone.
Another issue: people communicate uncertainty naturally. We say "maybe", "it seems", "I'm not sure, but...". LLMs suppress that entirely, for structural reasons. The output sounds confident and polished, which warps perception - especially when the content is wrong.
We've had the opposite experience, especially with o3-mini using Deep Research for market research & topic deep-dive tasks. The sources that are pulled have never been 404 for us, and typically have been highly relevant to the search prompt. It's been a huge time-saver. We are just scratching the surface of how good these LLMs will become at research tasks.
I used to be skeptical of the hype but it's hard to deny that they are incredible tools. For coding, they save me a few hours a week and this is just the beginning. A few months ago I would use them to generate simple pieces of code, but now they can handle refactoring across several files. Even if LLMs don't get smarter, the tooling around them will improve and they'll be even more useful. We'll also learn to use them better.
Also my gf who's not particularly tech savvy relies heavily on ChatGPT for her work. It's very useful for a variety of text (translation, summaries, answering some emails).
Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
> Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
Yeah. That was my first thought. There’s probably orders of magnitude more training data for software engineering than for theoretical physics (her field). Also, how much of software engineering is truly novel? Probably someone else has already come up with a decent solution to your problem, it’s “just” a matter of finding it.
Should we listen to Sabine in this case? Isn't this another manifestation of a generally intelligent person, who happens to be an expert in her field weighing in on something she's not an expert on, thinking her expertise transfers?
This is the most common 'smart person' fallacy out there.
As for my 2 cents, LLMs can do sequence modeling and prediction tasks, so as long as a problem can be reduced to sequence modeling (which is a lot of them!), they can do the job.
This is like saying that the Fourier Transform is played out because you can only do so much with manipulating signal frequencies.
Well, she's an expect in the field she's asking the LLM about, and she judges the responses to be nonsense. Would a non-expert be able to tell? This is mostly mirroring my experience with it, which is a straightforward report of "it doesn't work for me, even though I keep trying because everyone else is hyped about it and apparently it's getting better all the time".
I genuinely don't understand why some people are still using X.
Still significantly more popular than the alternatives.
It wasn't a phenomenal platform to begin with and never improved
LLMs are an incredible technological breakthrough which even their creators are using completely incorrectly - in my view, half because they want to make a lot of money, and half because they are enchanted by the allure of their own achivement and are desperate to generalize it. The ability to generate human-style language, art, and other media dynamically on demand, based on a prompt communicated in natural language, is an astonishing feat. It's immensely useful on its own.
But even its creators, who acknowledge it is not AGI, are trying to use it as if it were. They want to sell you LLMs as "AI" writ large, that is, they want you to use it as your research assistant, your secretary, your lawyer, your doctor, and so on and so forth. LLMs on their own simply cannot do those tasks. They is great for other uses: troubleshooting, assisting with creativity and ideation, prototyping concepts of the same, and correlating lots of information, so long as a human then verifies the results.
LLMs right now are flour, sugar, and salt, mixed in a bowl and sold as a cake. Because they have no reasoning capability, only rote generation via prediction, LLMs cannot process contextual information the way required for them to be trustworthy or reliable for the tasks people are trying to use them for. No amount of creative prompting can resolve this totally. (I'll note that I just read the recent Anthropic paper, which uses terms like "AI biology" and "concept" to imply that the AI has reasoning capacity - but I think these are misused terms. An LLM's "concept" of something bears no referent to the real world, only a set of weights to other related concepts.)
What LLMs need is some sort of intelligent data store, tuned for their intended purpose, that can generate programmatic answers for the LLMs to decipher and present. Even then, their tendency to hallucinate makes things tough - they might imagine the user requested something they didn't, for instance. I don't have a clear solution to this problem. I suspect whoever does will have solved a much bigger, more complex than the already massive one that LLMs have solved, and if they are able to do so, will have brought us much much closer to AGI.
I am tired of seeing every company under the sun claim otherwise to make a buck.
I love this. The more people that say "I don't get it" or "it's a stochastic parrot", the more time I get to build products rapidly without the competition that there would be if everyone was effectively using AI. Effectively is the key.
It's cliche at this point to say "you're using it wrong" but damn... it really is a thing. It's kind of like how some people can find something online in one Google query and others somehow manage to phrase things just wrong enough that they struggle. It really is two worlds. I can have AI pump out 100k tokens with a nearly 0% error rate, meanwhile my friends with equally high engineering skill struggle to get AI to edit 2 classes in their codebase.
There are a lot of critical skills and a lot of fluff out there. I think the fluff confuses things further. The variety of models and model versions confuses things EVEN MORE! When someone says "I tried LLMs and they failed at task xyz" ... what version was it? How long was the session? How did they prompt it? Did they provide sufficient context around what they wanted performed or answered? Did they have the LLM use tools if that is appropriate (web/deepresearch)?
It's never a like-for-like comparison. Today's cutting-edge models are nothing like even 6-months ago.
Honestly, with models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3-mini-high, I'm not sure how people fail so hard at prompting and getting quality answers. The models practically predict your thoughts.
Maybe that's the problem, poor specifications in (prompt), expecting magic that conforms to their every specification (out).
I genuinely don't understand why some people are still pessimistic about LLMs.
Great points. I think much of the pessimism is based on fear of inadequacy. Also the fact that these things bring up truly base-level epistemological quandaries that question human perception and reality fundamentally. Average joe doesnt want to think about how we dont know if consciousness is a real thing, let alone determine if the robot is.
We are going through a societal change. There will always be the people who reject AI no matter the capabilities. I'm at the point where if ANYTHING tells me that it's conscious... I just have to believe them and act accordingly to my own morals
I am making an effort to use LLMs at work, but in my workflow it's basically just a fancy auto complete. Having a more AI centric workflow could be interesting, but I haven't thought of a good way to rig that up. I'm also not really itching for something to do my puzzles for me. They're what gets me out of bed in the morning.
I haven't tried using LLMs for much else, but I am curious as long as I can run it on my own hardware.
I also totally get having a problem with the massive environmental impact of the technology. That's not AIs fault per se, but its a valid objection.
I think LLMs have value, but what I'm really looking forward to is the day when everyone can just quietly use (or not use) LLMs and move on with their lives. It's like that one friend who started a new diet and can't shut up about it every time you see them, except instead of that one friend it's seemingly the majority of participants in tech forums. It's getting so old.
The author mentioned Gemini sometimes refusing to do something.
I’ve recently been using Gemini (mostly 2.0 flash) a lot and I’ve noticed it sometimes will challenge me to try doing something by myself. Maybe it’s something in my system prompt or the way I worded the request itself. I am a long time user of 4o so it felt annoying at first.
Since my purpose was to learn how to do something, being open minded I tried to comply with the request and I can say that… it’s being a really great experience in terms of retention of knowledge. Even if I’m making mistakes Gemini will point them out and explain it nicely.
I think the author is being overly pessimistic with this. The positives of an LLM agent outweigh the negatives when used with a Human-in-the-loop.
For people interested in understanding the possibilities of LLM for use in a specific domain see The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee (Microsoft Research VP), Isaac Kohane (Harvard Biomedical Informatics MD) et al. It is an easy read showing the authors systematic experiments with using the OpenAI models via the ChatGPT interface for the medical/healthcare domain.
For a current-status follow-up to the above book, here is Peter Lee's podcast series The AI Revolution in Medicine, Revisited - https://www.microsoft.com/en-us/research/story/the-ai-revolu...
Instead of reading trivial blogs/tweets etc. which are useless, read the above to get a much better idea of an LLM's actual capabilities.
People have different opinions about this, but I think one problem is there are different questions.
One is - Google, Facebook, OpenAI, Anthropic, Deepseek etc. have put a lot of capital expenditure into train frontier large language models, and are continuing to do so. There is a current bet that growing the size of LLMs, with more or maybe even synthetic data, with some minor breakthroughs (nothing as big as the Alexnet deep learning breakthrough, or transformers), will have a payoff for at least the leading frontier model. Similar to Moore's law for ICs, the bet is that more data and more parameters will yield a more powerful LLM - without that much more innovation needed. So the question for this is whether the capital expenditure for this bet will pay off.
Then there's the question of how useful current LLMs are, whether we expect to see breakthroughs at the level of Alexnet or transformers in the coming decades, whether non-LLM neural networks will become useful - text-to-image, image-to-text, text-to-video, video-to-text, image-to-video, text-to-audio and so on.
So there's the business side question, of whether the bet that spending a lot of capital expenditure training a frontier model will be worth it for the winner in the next few years - with the method being an increase in data, perhaps synthetic data, and increasing the parameter numbers - without much major innovation expected. Then there's every other question around this. All questions may seem important but the first one is what seems important to business, and is connected to a lot of the capital spending being done on all of this.
I needed to search if bullish is a positive or a negative expectation.
https://www.nerdwallet.com/article/investing/bullish-vs-bear...
The current discussion about LLMs guarantees that both positive and negative expectations are a valid title for an article xD
I think she is right. And it may have some very real consequence. The latest Gen Alpha are AI native. They have been using AI for one or two years now. And as they grow up their knowledge will definitely be built on top of AI. This leads to a few fundamental problems.
1. AI inventing false information that is being used to built on their foundational knowledge.
2. There is a lot less problem solving for them once they are used to AI.
I think the fundamental of Eduction needs to look at AI or current LLM chatbot seriously and start asking or planning how to react to it. We have already witness Gen Z, with era of Google thinking they know everything and if not google it. Thinking of "They Know it ALL" only to be battered in the real world.
AI may make it even worst.
Can our brains recall precise citations to tens of papers we read a while ago? For the vast majority, no. LLMs function somewhat similarly to our brains in many ways, as opposed to classical computers.
Their strengths and flaws differ from our brains, to be sure, but some of these flaws are being mitigated and improved on by the month. Similarly, unaided humans cannot operate successfully in many situations. We build tools, teams, and institutions to help us deal with them.
> LLMs function somewhat similarly to our brains in many ways,
Including the arrogance to confidently deliver a wrong answer. Which is the opposite of the reasons we use computers in the first place. Why this is worth billions of dollars is utterly beyond me.
> unaided humans cannot operate successfully in many situations
Absolute nonsense driven by a total lack of historical perspective or knowledge.
> We build tools, teams, and institutions to help us deal with them.
And when they lie to us we immediately correct that problem or disband them recognizing that they are more trouble than they could be worth.
>> unaided humans cannot operate successfully in many situations
> Absolute nonsense driven by a total lack of historical perspective or knowledge.
An LLM can give you a list of examples:
Historical Examples:
- During historical epidemics, structured record-keeping and statistical analysis (such as John Snow’s cholera maps in 1854) significantly improved outcomes.
- Development of physics, architecture, and engineering depended heavily on tools such as abacus, logarithmic tables, calculators, slide rules, to supplement human cognitive limitations.
- Astronomical calculations in ancient civilizations (Babylonian, Greek, Mayan) depended heavily on abacuses, tables, and other computational tools.
- The pyramids in ancient Egypt required extensive use of tools, mathematics, coordinated human labor, and sophisticated organization.
It's a fair point.
An LLM can do some pretty interesting things, but the actual applicability is narrow. It seems to me that you have to know a fair amount about what you're asking it to do.
For example, last week I dusted off my very rusty coding skills to whip up a quick and dirty Python utility to automate something I'd done by hand a few too many times.
My first draft of the script worked, but was ugly and lacked any trace of good programming practices; it was basically a dumb batch file, but in Python. Because it worked part of me didn't care.
I knew what I should have done -- decompose it into a few generic functions; drive it from an intelligent data structure; etc -- but I don't code all the time anymore, and I never coded much in Python, so I lack the grasp of Python syntax and conventions to refactor it well ON MY OWN. Stumbling through with online references was intellectually interesting, but I also have a whole job to do and lack the time to devote to that. And as I said, it worked as it was.
But I couldn't let it go, and then had the idea "hey, what if I ask ChatGPT to refactor this for me?" It was very short (< 200 lines), so it was easy to paste into the Chat buffer.
Here's where the story got interesting. YES, the first pass of its refactor was better, but in order to get it to where I wanted it, I had to coach the LLM. It took a couple passes through before it had made the changes I wanted while still retaining all the logic I had in it, and I had to explicitly tell it "hey, wouldn't it be better to use a data structure here?" or "you lost this feature; please re-add it" and whatnot.
In the end, I got the script refactored the way I wanted it, but in order to get there I had to understand exactly what I wanted in the first place. A person trying to do the same thing without that understanding wouldn't magically get a well-built Python script.
While I am amazed at the technology, at the same time I hate it. First, 90% of people misintepret it and cite output as facts. Second, it needs too much energy. Third, energy consumption is rarely mentioned in nerdy discussions about LLMs.
'But you're such a killjoy.'
Yes, it is an evil technology in its current shape. So we should focus on fixing it, instead of making it worse.
Note that there is a different between being "bullish" (ie market that is upwards trending) vs "useful". I think there is general value with LLM for semantic search & information extraction, but not as an exclusive way to AGI that some the market expects for its overinflated valuation.
I say the author tells us more than the headline or first sentence of the loop. If you have recently scrolled through Sabines posts on Twitter, or her clickbaity thumbnails, facial expressions and headlines on YouTube [1], you would see that she is all-in for clicks. She often takes a popular belief, then negates it, and throws around counter-examples of why X or Y has absolutely failed. It's a repeating pattern to gain popularity, and it seems to work not only on Twitter and YouTube, but even here on Hacker News, given the massive amount of upvotes her post has.
[1] https://youtube.com/@SabineHossenfelder/featured
She's a physicist. LLMs are not for creating new information. They're for efficiently delivering established information. I use it to quickly inform me about business decisions all the time, because a thousand answers about those questions already exist. She's just using it for the wrong thing.
I think LLM skeptics and cheerleaders all have a point but I lean toward skepticism. And that's because though LLMs are easy to pick up, they're impossible to "master" and are extremely finicky. The tech is really fun to tinker with and capable of producing some truly awesome results in the hands of a committed practitioner. That puts it in the category of specialized tools. Despite this fact, LLMs are hyped and valued by the markets like they are breakthrough consumer products. My own experience plus the vague/underwhelming adoption and revenue numbers reported by the major vendors tell me that something's not quite right in this area of the industry.
I understand how developers can come to this conclusion if they're only using local models that can run on consumer GPUs since there's a time cost to prompting and the output is fairly low quality with a higher probability of errors and hallucinations.
But I don't understand how you can come to this conclusion when using SOTA models like Claude Sonnet 3.7, it's response has always been useful and when it doesn't get it right first time you can keep prompting it with clarifications and error responses. On the rare occasion it's unable to get it right, I'm still left with a bulk of useful code that I can manually fix and refactor.
Either way my interactions with Sonnet is always beneficial. Maybe it's a prompt issue? I only ask it to perform small, specific deterministic tasks and provide the necessary context (with examples when possible) to achieve it.
I don't vibe code or unleash an LLM on an entire code base since the context is not large enough and I don't want it to refactor/break working code.
I genuinely don't understand why some people are so critical of LLMs. This is new tech, we don't really understand the emergent effects of attention and transformers within these LLMs at all. It is very possible that, with some further theoretical development, LLMs which are currently just 'regurgitating and hallucinating' can be made to be significantly more performant indeed. In fact, reasoning models - when combined with whatever Google is doing with the 1M+ ctxt windows - are much closer to that than people who were using LLMs expected.
The tech isn't there yet, clearly. And stock valuations are over the board way too much. But, LLMs as a tech != the stock valuations of the companies. And, LLMs as a tech are here to stay and improve and integrate into everyday life more and more - with massive impacts on education (particularly K-12) as models get better at thinking and explaining concepts for example.
The key for LLM productivity, it seems to me, is grounding. Let me give you my last example, from something I've been working on.
I just updated my company commercial PPT. ChatGPT helped me with: - Deep Research great examples and references of such presentations. - Restructure my argument and slides according to some articles I found on the previous step, and thought were pretty good. - Come up with copy for each slide. - Iterate new ideas as I was progressing.
Now, without proper context and grounding, LLMs wouldn't be so helpful at this task, because they don't know my company, clients, product and strategy, and would be generic at best. The key: I provided it with my support portal documentation and a brain dump I recorded to text on ChatGPT with key strategic information about my company. Those are two bits of info I keep always around, so ChatGPT can help me with many tasks in the company.
From that grounding to the final PPT, it's pretty much a trivial and boring transformation task that would have cost me many, many hours to do.
The way I see it, it's less about the technicalities of accuracy and more about the long term human and societal problems it presents when widely adopted.
On one hand, every new technology that comes about unregulated creates a set of ethical and in this particular case, existential issues.
- What will happen to our jobs?
- Who is held accountable when that car navigation system designed by an LLM went haywire and caused an accident?
- What will happen with education if we kill all entry level jobs and make technical skills redundant?
In a sense they're not new concerns in science, we research things to make life easier, but as technology advances, critical thinking takes a hit.
So yeah, I would say people are still right to be weary and 'bullish" of LLMs as it's the normal behaviour for disruptive technology, and one will help us create adequate regulations to safeguard the future.
Simple example: My company is using Gemini to analyze every 13F filing and find correlations between all S&P500 companies the minute new earnings are released. We profited millions off of this in the last six months or so. Replicating this work alone without AI would require hiring dozens of people. How can I not be bullish on LLMs? This is only one of many things we are doing with it.
I do not understand how you can be bearish on LLMs. Data analysis, data entry, agents controlling browsers, browsing the web, doing marketing, doing much of customer support, writing BS React code for a promo that will be obsolete in 3 months anyway.
The possibilities are endless, and almost every week, there is a new breakthrough.
That being said, OpenAI has no moat, and there definitely is a bubble. I'm not bullish on AI stocks. I'm bullish on the tech.
LLMs as a tool that you use and check can be useful, especially for code. However, I think that putting some LLM in your customer-facing app/SaaS/game is like using the "I'm feeling lucky" button when Google introduced it. It only works for trivial things that you might not even had to use a search engine for (find the address of a website you had already visited). But since it's so cheap to implement and feels like it's doing the work of humans for a tiny fraction of the cost, they won't care about the customers and implement it anyway. So it'll probably flood any system that can get away with basic mistakes, hopefully not in systems where human lives are at stake.
LLMs are like any tool, you get what you put in. If you are frustrated with the results, maybe you need to think about what you're doing.
300/5290 functions decompiled and analyzed in less than three hours off of a huge codebase. By next weekend, a binary that had lost source code will have tests running on a platform it wasn't designed for.
What is this product so I can avoid it?
Need source.
I have such a hard time getting a read on Sabine Hossenfelder. I wish I was a physicist so I knew if she was full of shit
Sabine has moved well into the deep-end and has all sorts of bizarre anti-science content at this point. I had been getting more and more uncomfortable with some of her videos and couldn't really put it into words until one of the guys who blew up debunking flat earther stuff, professor dave, put out a few videos on what she's been doing.
https://www.youtube.com/watch?v=70vYj1KPyT4 https://www.youtube.com/watch?v=6P_tceoHUH4 https://www.youtube.com/watch?v=nJjPH3TQif0
She's youtuber who is happy to take whatever position she thinks will get her the most views and ad revenue, all while crying woe is me about how the scientific establishment shuns her.
As a former theoretical high energy physicist (who believe the field does have problems): yes, and there’s no colleague of mine that I know of who gives a shit about what she says. She’s just a YouTuber to us. As a general rule of thumb, you can assume any “celebrity” “scientist” constantly feuding in public is full of shit.
Sabine is on my blocklist as she very often puts out really ignorant and short-sighted perspectives.
LLMs are the most impactful technology we've had since the internet, that is why people are bullish on them, anyone who fails to see that cannot probably tie its own shoes without a "peer-reviewed" mechanism, lol.
To each is own. I give o3 with deep research my notes and ask it for a high-level design document, then feed that to claude and get a skeleton of a multi-service system, then build out functionality in each service with subsequent claude requests.
Sure, it does middle-of-the-road stuff, but it comments the code well, I can manually tweak things at the various levels of granularity to guide it along, and the design doc is on par something a senior principal would produce.
I do in a week what a team of four would take a month and a half to do. It's insane.
Sure, don't be bullish. I'm frantically piecing together enough hardware to run a decent sized LLM at home.
AI sludge in, AI sludge out.
It's just another version of someone who is relativity incompetent but can produce something vaguely convincing.
Two things can be true:
1) LLMs are a wonderous technology, capable of doing some really ingenious things
2) The hundreds of billions spent on them will not meet a positive ROI
Bonus 3) they are not good at everything & they are very bad at some things, but they are sold as good at everything.
This user just doesn't understand how to use an LLM properly.
The best solution to hallucination and inaccuracy is to give the LLM mechanisms for looking up the information it lacks. Tools, MCP, RAG, etc are crucial for use cases where you are looking for factual responses.
I’m bullish because many of these models rest on the consumption of copyrighted material or information that wasn’t intended for mass consumption in this way.
Also, I like to think for myself. Writing code and thinking through what I am writing often exposes edge cases that I wouldn’t otherwise realize.
I was previously a non believer in LLMs, but I've come around to accepting them. Gemini has saved me so much time and energy it's actually insane. I wouldn't be able to do my work to my satisfaction (which is highly technical) without its support.
One is a boss's view, looking for an AI to replace his employees someday. I think that is a dead end. It is just getting better to become a sophisticate, increasingly impressive but won't work.
One is the worker's view, looking at AI to be a powerful tool that can leverage one's productivity. I think that is looking promising.
I don't really care for the chat bot to give me accurate sources. I care about an AI that can provide likely places to look for sources and I'll build the tool chain to lookup and verify the sources.
A bunch of comments here seem to be missing a point : the author is (at least was ?) a scientist.
Her primarily work interest is in the truth, not the statically plausible.
Her point is that using LLM to generate truth is pointless, and that people should stop advertising llms as "intelligent", since, to a scientist, being "intelligent" and being "dead wrong" are polar opposite.
Other use cases have feedback loops - it does not matter so much if Claude spuits wrong code, provided you have a compiler and automated tests.
Scientists _are_ acting as compilers to check truth. And they rely on truths compiled by other scientists, just like your programs rely on code written by other people.
What if I tell you that, from now on, any third party library that you call will _stastically_ work 76% of the time, and I have no clue what it does is the remaining X % ?(I don't know what X is, I haven't hast chatgpt yet.)
In the meantime, I still have to see a headline "AI X discovered life-changing new Y on its own" (the closest thing I know of is alpha fold, which I both know is apparently "changing the world of scientists", and yet feel has "not changed the world of your average joe, so far" - emphasis on the "so far" ) ; but I've already seen at least one headline of "dumb mistake made because an AI hallucinated".
I suppose we have to hope the trend will revert at some point ? Hope, on a Friday...
I don’t think it’s fair to say LLMs are “statistically plausible” text generators. It ignores the emergent world model that enables them to generalise outside of their training set. LLMs aren’t a lookup table, and they’re not n-grams.
I pay our lawyer quite a lot and he also makes mistakes. What have you not - typos, somewhat inconcise language in contracts, but we work to fix it and everything's ok.
People are looking for perfect instead of better.
AI coding overall still seems to be underrated by the average developer.
They try to add a new feature or change some behavior in a large existing codebase and it does something dumb and they write it off as a waste of time for that use case. And that's understandable. But if they had tweaked the prompt just a bit it actually might've done it flawlessly.
It requires patience and learning the best way to guide it and iterate with it when it does something silly.
Although you undoubtedly will lose some time re-attempting prompts and fixing mistakes and poor design choices, on net I believe the frontier models can currently make development much more productive in almost any codebase.
"Bullish" I like these tools and they give me great power and flexibility.
vs
"Bullish" The LLM is going to revolutionise human behaviour and thought and bring about a new golden age.
Former is justifiable, the latter is just reinforcing the bubble.
"Yes, I have tried Gemini, and actually it was even worse in that it frequently refuses to even search for a source and instead gives me instructions for how to do it myself. Stopped using it for that reason."
Thank you Sabine. Every time I have mentioned Gemini is the worst, and not even worth of consideration, I have been bombarded with downvotes, and told I am using it wrong.
https://nitter.net/skdh/status/1905132853672784121
For those who have Twitter blocked.
I feel like at this point when people make some claim about LLM they need to actually include the model they are using. So many "LLMs can do / cant do X", without reference to the model, which I think is relevant.
I am hoping that the LLM approach will face increasingly diminished returns however. So I am biased toward Sabine's griping. I don't want LLM to go all the way to "AGI".
Well you see, there are lots of people that are dependent on bullshitting their way through life, and don't actually contribute anything new of novel to the world. For these people, LLMs are great because they can generate more bullshit than they ever could before. For those that are actually trying to do new and interesting things, well, an LLM is only as good as what it has seen before, and you're doing something new and exciting. Congratulations, you beat the machine until they steal your work and feed it to the LLM.
Ah hah, this is what I think. LLMs are the ultimate bullshitting machines! When you need to produce some BS fast, an LLM is definitely the right tool for that job.
She should try Perplexity.
I hope she is aware of the limited context window and ability to retrieve older tokens from conversations.
I have used llms for the exact same purpose she has, summerize chapters or whole books and find the source from e quote, both with success.
I think they key to a successful output lies in the way you prompt it.
Hallucinations should be expected though, as we all hopefully know, llms are more of a autocomplete than intelligence, we should stick to that mindset.
I can't believe they're complaining about LLMs not being able to do unit conversion. Of all things, this is the least interesting and last task I would ever ask an LLM to do.
I can ask chatgpt to generate a painting of my wife and it does it instantly. That would cost thousands of pounds from a human.
People who don't work in tech have no idea how hard it is to do certain things at scale. Skilled tech people are severely underappreciated.
From a sub-tweet:
>> no LLM should ever output a url that gives a 404 error. How hard can it be?
As a developer, I'm just imagining a server having to call up all the URLs to check that they still exist (and the extra costs/latency incurred there)... And if any URLs are missing, getting the AI to re-generate a different variant of the response, until you find one which does not contain the missing links.
And no, you can't do it from the client side either... It would just be confusing if you removed invalid URLs from the middle of the AI's sentence without re-generating the sentence.
You almost need to get the LLM to engineer/pre-process its own prompts in a way which guesses what the user is thinking in order to produce great responses...
Worse than that though... A fundamental problem of 'prompt engineering' is that people (especially non-tech people) often don't actually fully understand what they're asking. Contradictions in requirements are extremely common. When building software especially, people often have a vague idea of what they want... They strongly believe that they have a perfectly clear idea but once you scope out the feature in detail, mapping out complex UX interactions, they start to see all these necessary tradeoffs and limitations rise to the surface and suddenly they realize that they were asking for something they don't want.
It's hard to understand your own needs precisely; even harder to communicate them.
"How hard can it be?"
If I recall correctly, that is one of Dilbert's management axioms: if I don't understand it it cannot be difficult
Indeed.
And I have used the following response to pointy haired bosses on a couple of occasions ( though I don't recommend it ).
'If it's so easy - feel free to do it yourself'.
>> no LLM should ever output a url that gives a 404 error. How hard can it be?
Very easy. We can replace the error handler with a bullshit generator, and these people will be satisfied, as the whole idea is bullshit by the way.
I think it could be possible if you're using thinking tokens or via function calls - decide all URLs that will be mentioned while thinking (don't display this to the user), delay for a few seconds to test they exist, push that into the prompt history and then output what the user sees. But it's a bit of an edge case really and would probably annoy users with the extra latency.
It's all done quite easily when that's the priority but it's not because the priority is hype to keep the cycle going to mint a few more ultra-wealthy assholes. Nothing about fixing 404s would require resources unavailable to these mega-corporations and your and others carrying their water is part of why they don't.
Why do people who don't like using LLMs keep insisting they are useless for the rest of us? If you don't like to use them, then simply don't use them.
I use them almost daily in my job and get tremendous use out of them. I guess you could accuse me of lying, but what do I stand to gain from that?
I've also seem people claim that only people who don't know how to code or people doing super simple done a million times apps can get value out of LLMs. I don't believe that applies to my situation, but even if it did, so what? I do real work for a real company delivering real value, and the LLM delivers value to me. It's really as simple as that.
Even outside of work. Two personal examples, out of dozens this week:
1. Asked ChatGPT for a table showing monthly daily max temp, rainfall in mm and numbers of rain days, for Vietnam, Cambodia and Thailand. And colour coded based on the temperatures. Then suggest two times of year, and a route direction, to hit the best conditions on a multi-week trip.
It took a couple of seconds, and it helpfully split Vietnam at Hanoi and HCM given their weather differences.
2. I'm trying to work out how I will build a chicken orchard - post material, spacing, mesh, etc. I asked ChatGPT for a comparison table of steel posts versus timber, and then to cost it out with varying scales of post spacing. Plus pros and cons of each, and likely effort to build. Again, it took a few seconds, including browsing local stores for indicative pricing.
On top of that, I've been even more impressed by a first week testing Cursor.
Yes, of course they did that. These things happily do whatever you ask them to. But did it spit out correct information?
Yeah, it did. And uncovered a local timber supply place I wasn't aware of, with decent pricing. I'm not expecting a final quote, but quick ways to compare and consider variations. I've used it in the same way to compare cladding costs/work for lining a building.
I don't think it's a matter or liking or not. The use cases just differ considerably, and tools and not as useful or applicable across those. THe OP's use case is probably one of the worst possible for LLMs right now, imo...
The internet is full of people spouting their opinions and implying other opinions are wrong. Both sides of the AI/LLM (and all in-between) are well represented.
I don't actually like LLMs that much or find them useful, but its clear there are some things they are good at and some things they are bad at, and this post is all about things they are weak at.
LLMs are a little bit magical but they are still a square peg. The fact they don't fit in a round hole is uninteresting. The interesting thing to debate is how useful they are at the things they are good at, not at the things they are bad at.
I haven't met any developers in real life who are hyped about LLMs. The only people who seem excited are managers and others who don't really know how to program.
,,By my personal estimate currently GPT 4o DeepResearch is the best one. ''
If the o3 based 3 month old strongest model is the best one, it's a proof that there were quite significant improvements in the last 2 years.
I can't name any other technology that improved as much in 2 years.
O1 and o1 pro helped me with filing tax returns and answered me questions that (probably quite bad) tax accountants (and less smart models) weren't able to (of course I read the referenced laws, I don't trust the output either).
I don't mean to be disparaging to the original author, but I genuinely think a good litmus test today for the calibre of a human researcher's intelligence is what they are able to get out of the current state of the art AI with all of its faults. Contrasting any of Terrance Tao's comments over the last year to the comments above are telling. Perhaps it seems unfair to contrast such a celebrated mathematician to a popular science author, but in fact one would a priori expect that AI ought to be less helpful for the most talented mind in a field. Yet we seem to find exactly the opposite: this cottage industry of "academics" who, now more than a year since LLMs entered popular consciousness, seem to do nothing but decry "AI hype" - while both everyday people and seemingly the most talented researchers continue to pursue what interests them at a higher level.
If I were to be cynical, I think we've seen over the last decade the descent of most of academia, humanities as much as natural sciences, to a rather poor state, drawing entirely on a self-contained loop of references without much use or interest. Especially in the natural sciences, one can today with little effort obtain an infinitely more insightful and, yes, accurate synthesis of the present state of a field from an LLM than 99% of popular science authors.
ITT: ppl saying LLMs are v helpful
The keyword in title is "bullish". It's about the future.
Specifically I think it's about the potential of the transformer architecture & the idea that scaling is all that's needed to get to AGI (however you define AGI).
> Companies will keep pumping up LLMs until the day a newcomer puts forward a different type of AI model that will swiftly outperform them.
https://twitter-thread.com/t/1905132853672784121
Upon asking ChatGPT multiple times it just wouldn’t tell me why Vlad the Impaler has this name. It will refuse to say anything cruel I guess, but it’s very frustrating when history is not represented truthfully. When I’m asking about topics unknown to me, what is it hidding from me, I don’t know it’s awful.
ChatGPT has a share button. This is like a bug report with no repro steps.
If people aren't linking the conversation, it's really hard to take the complaint seriously.
This is interesting. I have replaced google search with ChatGPT and meta’s AI. They more than deliver. Thinking about my use cases, I use google to recollect things or to give me a starting point for further research. LLMs are great for that and so I am never going back to google. However I am curious about how the cases where the OP sees this great gap and failures
This article seems to focus on the shortcomings of LLMs being wrong, but fails to consider the value LLMs provide and just how large of an opportunity that value presents.
If you look at any company on earth, especially large ones, they all share the same line item as their biggest expense: labor. Any technology that can reduce that cost represents an overwhelmingly huge opportunity.
OP highlights application problems, and RAG specifically. But that is not an LLM problem.
Chat is such a “leaky” abstraction for LLMs
I think most people share the same negative experience as they only interact with LLMs through the chat UI by OpenAI and Anthropic. The real magic moment for me was still the autocompletion moment from the gh copilot.
So after reading many of the comments here;
What is a good tutorial / training, to learn about LLMs from scratch ?
I guess that’s the main problem, even more so for non-developers and -tech people. The learning curve is too steep, and people don’t know where to start.
Funny how most of the counter comments here used the form "my experience is different/it's amazing!" and then listed activities that are completely different from what Sabine listed :)
We really should stop reinforcing our echo bubbles and learn from other people. And sometimes be cool in the face of criticism.
She's a scientist. In that area LLMs are quite useful in my opinion and part of my daily workflow. Quick scripts that use APIs to get data, cleaning the data and converting it. Quickly working with polar data frames. Dumb daily stuff like "take this data from my CSV file and turn it into a Latex table"...but most importantly freeing up time from tedious administrative tasks (let's not go into detail here).
Also great for brainstorming and quick drafting grant proposals. Anything prototyping and quickly glued together I'll go for LLMs (or LLM agents). They are no substitute for your own brain though.
I'm also curious about the hallucinated sources. I've recently read some papers on using LLM-agents to conduct structured literature reviews and they do it quite well and fairly reproducible. I'm quite willing to build some LLM-agents to reproduce my literature review process in the near future since it's fairly algorithmic. Check for surveys and reviews on the topic, scan for interesting papers within, check sources of sources, go through A tier conference proceedings for the last X years and find relevant papers. Rinse, repeat.
I'm mostly bullish because of LLM-agents, not because of using stock models with the default chat interface.
It’s not about what they can do, it’s about what people think they can do and try to use them for, which is, on average, incredibly wrong.
They are impressive for what they are… but do you know what they are? I do, and that’s why I’m not that hyped about them.
LLM interactions are more a reflection on the user than a objective technology. I'm starting to believe that people that think that LLM's are terrible are sub optimal communicators. Because their input is bad the output is.
My experience mirrors hers. Asking questions is worthless because the answers are either 404 links, telling me how to use a search engine, or just flat out wrong and the code generated compiles maybe one time out of ten and when it does the implementation is usually poor.
When I evaluate against areas I possess professional expertise I become convinced LLMs produce the Gell Mann amnesia effect for any area I don't know.
I'm just glad I have something that can write logic in any unix shell script for me. It's the right combination of thing I don't have to do often, thing that doesn't fit my mental model, thing that works differently based on the platform, and thing I just can't be bothered to master.
I wonder when we can get LLM, which might be more "stupid" but knows what it does not know rather than hallucinates... Though perhaps when it would be not LLM but entirely different tech :)
"Yes, it covers 99% of use cases, but look at this 1%...."
It is the same discussion with autonomous driving, it generates much less accidents than humans, but look at these anedoctical accidents...
> 99% of use cases
You're being very generous.
To me, the bullishness stems from the observation that current gen of LLM an plausibly lead to a route to human level intelligence / agi, because they have this causal behavior: (memory/context + inputs) --> outputs. Vaguely similar to humans.
It feels like OP is not using the tools that LLM are correctly. Yes they hallucinate, but I've found they rarely do on first run. It's only when you insist that they start doing it.
Experts will be in denial of LLMs for a long time, while the non-experts will swiftly use it to bridge their own knowledge gap. This is the use case for LLMs, maybe more so than 100% correctness.
Well strap in, because I only see this getting worse (or better, depending on your outlook). Faster chips, more power, more algorithm research and more data. I don't know what's coming exactly, but changes are coming fast.
WORD OF ADVICE FOR LLM BEARS!!
If you have tried to use LLMs and find them useless or without value, you should seriously consider learning how to correctly use them and doing more research. It is literally a skill issue and I promise you that you are in the process of being left behind. In the coming years Human + AI cooperatives are going to far surpass you in terms of efficiency and output. You are handicapping yourself by not becoming good at using them. These things can deliver massive value NOW. You are going to be a grumbling gray beard losing your job to 22 year old zoomers who spend 10 hours a day talking to LLMs.
If the future is not LLMs, then this is a waste of time.
If the future is only LLMs then we’re all cooked.
If the future is a hybrid, then those “grey beard” skills are a much higher barrier to entry than a few months tinkering with ChatGPT.
Please stop spreading FUD.
Weight the three possibilities how you want but I think in scenario 3 you are coping hard and the grey beard skills aren't nearly as valuable as you think. Is this not already the case? Grey beards already have well known problems with age discrimination despite having such a unique and hard to earn skillset no?
LLMs are not difficult to use and learn though, if so called "greybeard" skills are valueless (or close to) then knowing how to use an LLM certainly won't be valuable either!
Just like in the 2000-2010's knowing how to effectively Google things (while undoubtedly a skill) wasn't what made someone economically valuable.
>LLMs are not difficult to use and learn though, if so called "greybeard" skills are valueless (or close to) then knowing how to use an LLM certainly won't be valuable either!
I mean I challenge you to show you're as skilled at using LLMs as Janus or Pliny the Liberator.
https://x.com/repligate https://x.com/elder_plinius
I feel that sentiment on LLMs has almost approached a left/right style idealogical divide, where both sides seem to have a totally different interpretation of what they are seeing and what is important.
I'm genuinely just unimpressed by computers and space craft. Sometimes they just won't boot up, or blow up. I also just am unimpressed by wireless 4G/5G internet,... the printing press. (sarcasm)
Translation: "I genuinely don't understand why some people are still bullish about power tools [when they could just keep using hand saws]."
Can anyone offer insight and/thoughts on his mention of knowledge graphs helping (beyond KGs provide robust source attribution)?
I've been looking at these recently but not specifically to solve "LLM issues".
Oh yes, LLM's are a failure when measured against my very specific metrics in my specific use case. Sigh.
Actually I think they work really well, in any case where you can detect errors, which the OP apparently can.
LLMs are not a miracle, they are a type of tool. The hype I am angry about is the “black magic? Well, clearly that can solve every problem” mode of thought.
A better model would be good for the stockmarket. The stock market SP500 is shovel town. What would be bad id AI fizzle out and not deliver on hype valuations.
in 2019 these things could hardly speak, in 2024 we trained one that outperforms all but the very best in the world at competition programming.
the bull case very obviously speaks for itself!
For me, using an LLM is simply about overcoming the biggest initial hurdle to getting something done. It allows me to eliminate that obstacle.
I realized it would be a bubble this xmas when I ran across a neighbor of mine walking their dog. I told them what I did and they immedately asked if I worked with AI, they were looking for AI programmers. No regard for details, or what field, just get me AI programmers.
He's a person with money and he wants AI programmers. I bet there are millions like him.
Don't get me wrong though, I do believe in a future with LLMs. But I believe they will become more and more specialized for specific tasks. The more general an AI is, the more it's likely to fail.
It's more or less the same for every hype technology. Most people only glance over the executive slides version of the description of the technology (which is understandable because they have different priorities). Others have already told them that this is the "new thing" and they need it.
They suck, just this week. Docker container caddy issue. It repeatedly made nonsensical suggestions.
It’s like a parrot and if you know what you’re doing you can catch tons of mistakes.
I use them everyday and they work greatly, I even made a command (using Claude, actually Claude made everything in that script) that calls Gemini from the terminal so that I can ask for question related to the shell directly there, just doing a: ai "how can I convert a webp to a png", the system prompt asks to be brief, using markdown (it does display nicely), that most question are related to Linux and it provides information about my OS (uname -a), the last code block is also copied in the clipboard, super useful, I imagine there are plenty online of similar utilities.
I wonder if this site isn't impressed because there are a lot of 1% coders here that don't understand what most people do at work. It's mostly administrative. Take this spreadsheet, combine with this one, stuff it into power BI, email it to Debbie so she can spell check it and send to the client. Y'all forget there are companies that actually make things that don't have insane valuations like your bullshit apps do. A company that scopes municipal sewer pipes can't afford a $500k/yr dev, so there's a few $60k/yr office workers that fiddle with reports and spreadsheets all day. It's literally a whole department in some cases. Those are the jobs that are about to be replaced and there are a lot of those jobs out there.
I think it's because it's mostly old people which tends to come with being uninspired and jaded from previous hypes. There's always been the same attitude towards cryptocurrency. Young people are drawn to exciting new things that really break open their existing worldviews. Bitcoin was pioneered by teenagers and young men.
Also, people in general quickly adapt. LLMs are absolute sci-fi magic but you forget that easily. Here's a comedian's view on that phenomenon https://www.youtube.com/watch?v=nUBtKNzoKZ4
I think of LLMs like smart but unreliable humans. You don't want to use them for anything that you need to have right. I would never have one write anything that I don't subsequently go over with a fine-toothed comb.
With that said, I find that they are very helpful for a lot of tasks, and improve my productivity in many ways. The types of things that I do are coding and a small amount of writing that is often opinion-based. I will admit that I am somewhat of a hacker, and more broad than deep. I find that LLMs tend to be good at extending my depth a little bit.
From what I can tell, Sabine Hossenfelder is an expert in physics, and I would guess that she already is pretty deep in the areas that she works in. LLMs are probably somewhat less useful at this type of deep, fact-based work, particularly because of the issue where LLMs don't have access to paywalled journal articles. They are also less likely to find something that she doesn't know (unlike with my use cases, where they are very likely to find things that I don't know).
What I have been hearing recently is that it will take a long time for LLMs will be better than humans at everything. However, they are already better than many many humans at a lot of things.
I think there are a couple of truths to this.
1. Any low hanging fruits that could easily be solved by an LLM easily probably would have been solved by someone already using standard methods.
2. Humans and LLMs have to spend some particular amount of energy to solve problems. Now, there are efficiencies that can lower/raise that amount of energy but at the end of the day TANSTAAFL. Humans spend this in a lifetime of learning and eating, and LLMs spend this in GPU time and power. Even when AI gets to human level it's never going to abstract this cost away, energy still needs spent to learn.
so many people have shown me these stupid ass AI summaries for random things and if you have even a basic understanding of the relevent issue then the answers just seem bizare. this feels like cheating on my homework and not understanding. like using photomath on homework does.
If you treat them like a temperamental muse then you’ll have far more success.
They excel in spitballing more than accurate citations.
LLMs are like having a nerdy friend who is willing to talk with you about anything.
The LLM output inference is often too fast to be illogical or nonsense.
Overestimated short term effects, but underestimated long term effects, as usual.
My take is that if you expect current LLMs to be some near perfect, near-AGI models, then you're going to be sorely disappointed.
If that disappoints you to such a degree that you simply won't use them, you might find yourself in a position some years ahead - could be 1...could be 2...could be 5...could be 10 - who knows, but when the time comes, you might just be outdated and replaced yourself.
When you closely follow the incremental improvements of tech, you don't really fall for the same hype hysteria. If you on the other hand only look into it when big breakthroughs are made, you'll get caught in the hype and FOMO.
And even if you don't want to explicitly use the tools, at least try to keep some surface-level attention to the progress and improvements.
I honestly believe that there are many, many senior engineers / scientists out there that currently just scoff at these models, and view them as some sort of toy tech that is completely overblown and overhyped. They simply refuse to use the tools. They'll point to some specific time a LLM didn't deliver, roll their eyes, and call it useless.
Then when these tools progress, and finally meet their standards, they will panic and scramble to get into the loop. Meanwhile their non-tech bosses and executives will see the tech as some magic that can be used to reduce headcount.
Here's one simple reason:
I have a very specific esoteric question like: "What material is both electrically conductive and good at blocking sound?" I could type this into google and sift through the titles and short descriptions of websites and eventually maybe find an answer, or I can put the question to the LLM and instantly get an answer that I can then research further to confirm.
This is significantly faster, more informative, more efficient, and a rewarding experience.
As others have said, its a tool. A tool is as good as how you use it. If you expect to build a house by yelling at your tools I wouldn't be bullish either.
Until they pull the rug and make you read through 10 paragraphs of ads first before getting to your answer.
I mean, the first link I got when I pasted that in is probably the Stack Exchange thread you would use to research further, along with other sources, which do seem relevant to the query.
I don't see how an LLM is significantly faster or more informative, since you still have to do the legwork to validate the answer. I guess if you're google-phobic (which a lot of people seem to be, especially on HN) then I can see how it's more rewarding to put it off until later in the process.
you can boil it down to this, whats easier for most people? looking through websites and search engine results for answers or speaking in plain language? The answer is pretty obvious.
The validity of the answers is not 1:1 with its potential profitability.
Like James Baldwin said "people love answers, but hate questions."
getting an answer faster is exponentially better than getting the more precise, more right, more nuanced answer for most people every time. Doing the due dilligence is smart but its also after the fact.
Its just an example and you can "well actually" until the cows come home, but your missing the point. I'm sure there are things you have found hard to google. Also you are most likely (as your commenting on hacker news) not a good representation of the majority of the world.
idk if you have noticed, but google is clearly using LLM technology in conjunction with its search results, so the assumption they are just using traditional tech and not LLM's to inform or modify its result set I think is unlikely.
> I'm sure there are things you have found hard to google.
There are, but I haven’t found LLMs to be useful for them either.
Today, I had the question of “where can I buy firewood within a ten minute drive from my house, and what’s the cost at each place?”
There’d no real way to get that info without going for a drive, or calling every potential location to ask.
The first stackexchange link I see answers the question of thermal conductivity, not electrical. Google is convinced I didn’t actually mean electrical. Forcing it to include electrical brings up nothing of use.
The Google AI summary suggests MLV which is wrong.
ChatGPT suggests using copper which is also wrong.
I call bullshit on the entire affair.
The answer I got from ChatGPT was: ...........
A material that is both electrically conductive and good at blocking sound is:
Lead (Pb) • Electrical conductivity: Lead is a metal, so it conducts electricity, although it’s not the most conductive (lower than copper or silver).
• Sound blocking: Lead is excellent at blocking sound due to its high density and mass, which help attenuate airborne sound effectively.
Other options depending on application:
Composite materials:
• Metal-rubber composites or metal-polymer composites can be engineered to conduct electricity (via embedded conductive metal layers or fillers) and block sound (due to the damping properties of the polymer/rubber layer).
Graphene or carbon-filled rubber:
• Electrically conductive due to graphene/carbon content.
• Sound damping from rubber base.
• Used in some specialized industrial or automotive applications.
Let me know if you need it optimized for a specific use case (e.g., lightweight, flexible, non-toxic).
...........
This took me less than 10 seconds.
Pretty damn good if you ask me.
You used the same search string and got a completely different answer?
That’s… weird. Definitely doesn’t inspire confidence.
I have a prompt personalization that says im a scientist / engineer. Perhaps thats why it gave me a better answer. If you consider the multitude of contexts you could ask this question it makes sense to give it a little personal background.
Use LLMs for what they are good at. Most of my prompts start with “What are my options for … ?” And they excel at that, particularly the recent reasoning models with the ability to search the web. They can help expand your horizons and analyze pros/cons from many angles.
Just today, I was thinking of making changes to my home theater audio setup and there are many ways to go about that, not to mention lots of competing products, so I asked ChatGPT for options and gave it a few requirements. I said I want 5.1 surround sound, I like the quality and simplicity of Sonos, but I want separate front left and right speakers instead of “virtual” speakers from a soundbar. I waited years thinking Sonos would add that ability, but they never did. I said I’d prefer to use the TV as the hub and do audio through eARC to minimize gaming latency and because the TV has enough inputs anyway, so I really don’t need a full blown AV receiver. Basically just a DAC/preamp that can handle HDMI eARC input and all of the channels.
It proceeded to tell me that audio-only eARC receivers that support surround sound don’t really exist as an off-the-shelf product. I thought, “What? That can’t be right, this seems like an obvious product. I can’t be the first one to have thought of this.” Turns out it was right, there are some stereo DAC/preamps that have an eARC input and I could maybe cobble together one as a DIY project, but nothing exactly like what I wanted. Interesting!
ChatGPT suggested that it’s probably because by the time a manufacturer fully implements eARC and all of the format decoding, they might as well just throw in a few video inputs for flexibility and mass-market appeal, plus one less SKU to deal with. And that kind of makes sense, though it adds excess buttons and bothers me from a complexity standpoint.
It then suggested WISA as a possible solution, which I had never heard of, and as a music producer I pay a lot of attention to speaker technology, so that was interesting to me. I’m generally pretty skeptical of wireless audio, as it’s rarely done well, and expensive when it is done well. But WISA seems like a genuine alternative to an AV receiver for someone who only wants it to do audio. I’m probably going to go with the more traditional approach, but it was fun learning about new tech in a brainstorming discussion. Google struggles with these sorts of broad research queries in my experience. I may or may not have found out about it if I had posted on Reddit, depending on whether someone knowledgeable happened to see my post. But the LLM is faster and knows quite a bit about many subjects.
I also can’t remember the last time it hallucinated when having a discussion like this. Whereas, when I ask it to write code, it still hallucinates and makes plenty of mistakes.
I mean the answer is simple, money. There's a bajillion dollars getting shoved into this crap, and most of the bulls are themselves pushing some AI thing. Look at YC, pretty much everything they're funding has some mention of AI. It's a massive bubble with people trying to cash in on the hype. Plus the managerial class being the scum they are, they're bullish because they keep getting sold on the idea that they can replace swathes of workers.
This is just the beginning - and it need not be nation states. Imagine instead of Russian disinfo, it is oil companies doing the same thing with positive takes on oil, climate change is a hoax, etc. Or <insert religion> pushing a narrative against a group they are against.
https://euromaidanpress.com/2025/03/27/russian-propaganda-ne...
I have been using Claude this week the first time for a _slightly_ bigger SwiftUI project than just a few lines of bash or SQL I used it before. I have never used swift before but I am amazed how much Claude could do. It feels to me as we are at the point where anyone can now generate small tools with low effort for themselves. Maybe not production ready, but good enough to use yourself. It feels like it should be good enough to empower the average user to break out of having to rely on pre-made apps to do small things. Kind of like bash for the average Joe.
What worked:
- generated a mostly working PoC with minimal input and hallucinated UI layout, Color scheme, etc. this is amazing because it did not bombard me with detailed questions. It just carried on to provide me with a baseline that I could then finetune
- it corrected build issues by me simply copy pasting the errors from Xcode - got APIs working - added debug code when it could not fix an issue after a few rounds
- resolved an API issue after I pointed it to a typescript SDK to the API (I literally gave a link to the file and told it, try to use this to work out where the problem is) - it produces code very fast
What is not working great yet:
- it started off with one large file and crashed soon after because it hit a timeout when regenerating the file. I needed to ask it to split the file into a typical project order
- some logic I asked it to implement explicitly got changed at some point during an unrelated task. To prevent this in future I asked it mark this code part as important and that it should only be changed at explicit request. I don’t know yet how long this code will stay protected for
- by the time enough context got build up usage warnings pop up in Claude
- only so many files are supported atm
So my takeaway is that it is very good at translating, I.e. API docs into code, errors into fixes. There is also a fine line between providing enough context and running out of tokens.
I am planning to continue my project to see how far I can push it. As I am getting close to the limit of the token size now, I am thinking of structuring my app in a Claude friendly way:
- clear internal APIs. Kind of like header files so that I can tell Claude what functions it can use without allowing it to change them or needing to tokenize the full source code
- adversarial testing. I don’t have tests yet, but I am thinking of asking one dedicated instance of Claude to generate tests. I will use other Claude instances for coding and provide them with failing test outputs like I do now with build errors. I hope it will fix itself similarly.
Perhaps attitudes to this new phenomenon are correlated with propensity to skepticism in general.
I will cite myself as Exhibit A. I am the sort of person who takes almost nothing at face value. To me, physiotherapy, and oenology, and musicology, and bed marketing, and mineral-water benefits, and very many other such things, are all obviously pseudoscience, worthy of no more attention than horoscopes. If I saw a ghost I would assume it was a hallucination caused by something I ate.
So it seems like no coincidence that I reflexively ignore the AI babble at the top of search results. After all, an LLM is a language-rehashing machine which (as we all know by now) does not understand facts. That's terribly relevant.
I remember reading, a couple of years back, about some Very Serious Person (i.e. a credible voice, I believe some kind of scientist) who, after a three-hour conversation with ChatGPT, had become convinced that the thing was conscious. Rarely have I rolled my eyes so hard. It occurred to me then that skepticism must be (even) less common a mindset than I assumed.
Not related to the post, but you don't believe in physiotherapy? You know, the "help grandma to go from wheelchair to a walker after a fall thorough controlled exercise" people. You don't mean chiropractors, right? Terms do get weird country to country, so maybe in the US it's more woe adjacent.
I genuinely don’t understand why everyone is so hyper polarized on LLMs. I find them to be fantastically useful tools in certain situations, and definitely one of the most impressive technological developments in decades. It isn’t some silver-bullet, solve all issues solution. It can be wrong and you can’t simply take things it outputs for granted. But that does not make it anywhere near useless or any less impressive. This is especially true considering how truly young the technology is. It literally didn’t exist 10 years ago, and the iteration that thrust them into the public is less than 3 years old and has already advanced remarkably. I find the idea that they are useless snake-oil to be just as deluded of a take as the people claiming we have solved AGI.
Regardless of whether she's right or not, isn't she the same person who recently said that Europe needs to invest tens of billions in foundation models as soon as possible?
> I use GPT, Grok, Gemini, Mistral etc every day
Maybe that's why?
As far as I understand LLM what is being asked is unfortunately close to impossible with LLM.
Also I find it disingenuous that apologists are stating thing close to "you are using it wrong". Where it is advertised that LLM based AI should be more and more trusted (because more accurate, based on some arbitrary metrics) and might save some time ( on some undescribed task).
Of course in that use case most would say to use your judgement to verify whatever is generated, but for the generation that is using AI LLM as a source of knowledge ( like some people are using Wikipedia as source of truth, or stack overflow) it will be difficult to verify, when all they knew is LLM generated content as source of knowledge.
This technology is still only few years old.
Is there a way to block Sabine posts by default?
The rate of improvement, that's why.
LLM are cheap gimmicks that at worse will be another lecherous sore on society like crypto and best a fad that mercifully disappears soon
Might have to do with the fact she is German.
Sabine is very, very German!
Not everyone is as impressed as us by new tech, especially when it's kinda buggy.
The article makes a lot of good points. I get a lot of slop responses to both coding and non-coding prompts, but I've also gotten some really really good responses, especially code completion from Copilot. Even today, ChatGPT saved me a ton of Google searches.
I'm going to continue using it and taking every response with a grain of salt. It can only get better and better.
Never underestimate the momentum of a poorly understood idea that appears as magic to the average person. Once the money starts flowing, that momentum will increase until the idea hits a brick wall that its creators can't gaslight people about.
I hope that realization happens before "vibe coding" is accepted as standard practice by software teams (especially when you consider the poor quality of software before the LLM era). If not, it's only a matter of time before we refer to the internet as "something we used to enjoy."
I watched this creator on YT and she is unsatisfied all the time of LLMs. Just blocked her not to see her videos, because do not give anything but bad aura. Wishing her a good luck finding a topic where she will be happy to discuss.
which skdh was on bluesky or mastodon and the link linking there
> "By my personal estimate currently GPT 4o DeepResearch is the best one."
uhm, I dismiss this statement here? if you call 4o the best, that means you haven't genuinely explored other models before making such claims...
GPT 4o Deep Research is currently the only available interface to the full o3 (as opposed to o3 mini).
It's only amazing right now because we aren't paying full-price for it
Wait until the investors want their returns
I like them but I still feel like I am employing a bunch of over confident narcissist engineers and if I ask them to do something, I am never really comfortable the result is correct. What I want is a work force where I can pass off a request and go home in confidence that the results were correctly implemented.
LLMs are useful for bolting slop together and wasting electricity. blind leading the blind taking on a new skin
can we stop using such a confusing word as "bullish" please?
saying "why are people bullish" only to continue with bullying does not add any clarity to this world
maybe you can ask an llm to explain it.
no, llms are not useful according to this thread. how will we ever verify it's output? it will never be a billion dollar business
What's confusing about the word bullish?
what's confusing about the word bullying?
Written by some grifter who knows their game is up due to generative AI.
These along with the whiney creatives who think progress should be halted so that they can continue to be a bottleneck, are the only people moaning about AI
I love when people spend a ton of time spouting takes like this.
I just hope they keep feeling that way and avoid LLMs. Less competition for those of us who are using them to make our jobs/lives easier every day.
strong Paul Krugman vibes
A while ago I asked Gemini to set an alarm, it said "Sure!"
...that was the LLM responding, and it did not set an alarm.
Sabine is an obvious grifter. Crazy to see her on the frontpage of HN
It’s weird that LLM bulls not not directly engage with the criticism, which would characterize as a fairly standard criticism of hallucination. Hallucinations are a problem. I don’t know who Sabine is but I don’t have to.
Same with the top reply from Teortaxes containing zero relevant information, which Twitter in its infinite wisdom has decided is the “most relevant” reply. (The second “most relevant” reply is some ad for some bs crypto newsletter.)
It's weird that you'd guess (wrong) that I'm an "LLM bull", then start generalizing the actions of this supposed group. Hallucinations are a problem, no news value there. Sabine seems not to be able to get value out of LLMs, and concludes that they aren't useful. The logic is not exactly impressive.
It's a tool. It can be useful, it doesn't always work. Some people claim it's better than it is, some people claim it's worse. This isn't exactly rocket science.
Regardless of the tweet in question, Sabine is a grifter. Her novel takes on academia being some kind of conspiracy of people milking the system, and of physicists not being interested in making new discoveries are nonsensical and only serves to increase her own profile. Look at this video of her trying to convince the world she received an email that apparently proves all her points correct. My BS detector tells me she wrote that email herself, but you be the judge: https://www.youtube.com/watch?v=shFUDPqVmTg
Nothing to see here. Just Another hot take from a influencer way outside their domain of expertise.
[dead]
[dead]
[dead]
[dead]
[flagged]