Devs say AI crawlers dominate traffic, forcing blocks on entire countries

360 points by Bender 3 months ago

xyzal 3 months ago

I think the proper answer is to aim for the bots to get _negative_ utility value from visiting our sites, that is poisoning their well, not just zero value, that is to block them.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models

sigmoid10 3 months ago

This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.
- hec126 3 months ago
  
  You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.
  
  rustc 3 months ago
  
  This would be terrible for accessibility for users using a screen reader.
  
  mostlysimilar 3 months ago
  
  So would the site shutting down because AI bots are too much traffic.
  
  MrResearcher 3 months ago
  
  <a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.
  
  rustc 3 months ago
  
  Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.
- soco 3 months ago
  
  But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?
  
  majewsky 3 months ago
  
  robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.
  
  ccgreg 3 months ago
  
  I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.
  
  sigmoid10 3 months ago
  
  The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.
  
  delichon 3 months ago
  
  Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?
  
  voidUpdate 3 months ago
  
  When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?
  
  zzo38computer 3 months ago
  
  It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).
  
  sharlos201068 3 months ago
  
  Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.
  
  danielheath 3 months ago
  
  Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.
  
  gadflyinyoureye 3 months ago
  
  Tuesday. But I have odd hobbies.
  
  brookst 3 months ago
  
  robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.
  
  ccgreg 3 months ago
  
  robots.txt has a maximum relevant size of 500 kib.
gdcbe 3 months ago

That’s vaguely what https://blog.cloudflare.com/ai-labyrinth/ is about
TonyTrapp 3 months ago

We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.
- sokoloff 3 months ago
  
  > dozens of IPs, so every IP just makes 1-2 requests in total
  Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.
  
  lucb1e 3 months ago
  
  I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)
  That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks
  
  aorth 3 months ago
  
  Parent probably meant hundreds or thousands of IPs.
  Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.
  Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!
  Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.
  Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?
  I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.
  
  TonyTrapp 3 months ago
  
  If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.
  
  giantg2 3 months ago
  
  That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.
gchamonlive 3 months ago

If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?
lomonosov 3 months ago

Russia is already doing poisoning with success, so it is a viable tactic!
https://www.heise.de/en/news/Poisoning-training-data-Russian...
lucb1e 3 months ago

> Is your user-agent too suspicious?
Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower
> A request rate too inhuman?
I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!
jajko 3 months ago

Thats absolutely brilliant f*cked up idea, poisoning AI while fending them off.
Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.
usrnm 3 months ago

Where are you going to get all that content? If it's static, it will get filtered out very fast, if it's dynamic and autogenerated, it might cost even more than just letting the crawler through
- hec126 3 months ago
  
  Generate it once every few weeks with LLaMa and then serve as static content?
j-bos 3 months ago

This makes tons of sense, the AI trainers spend endless resources aligning their llms, the least we could do ia spend a few minutes aligning their owners. Fixing things at the incentive level.
agilob 3 months ago

Also, lower upload rate to 5kb/s
PeterStuer 3 months ago

You have clearly no idea on the incompetence so many public administrations have in configuring robots.txt for data that is actually created for and specifically meant to be consumed programatically (think rss and atom feeds, REST api endpoints etc.). Half the time the person setting up the robots.txt just blankedly blocks everything, and does not even know (or care) to exclude those.
seper8 3 months ago

I love this idea haha

tedunangst 3 months ago

> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

If the target goes down after you scrape it, that's a feature.

prisenco 3 months ago

This has me wondering what it would take to do a bcrypt style slow hashing requirement to retrieve data from a site. Something fast enough that a single mobile client for a user wouldn't really feel the difference. But an automated scraper would get bogged down in the calculations.
Data is presented to the user with multiple layers of encryption that they use their personal key to decrypt. This might add an extra 200ms to decrypt. Degrades the user experience slightly but creates a bottleneck for large-scale bots.
- tschwimmer 3 months ago
  
  Check out Anubis - it's not quite what you're suggesting but similar in concept: https://anubis.techaro.lol/
  
  LPisGood 3 months ago
  
  How does it work? I don’t have time to read the code, and the website/docs seem to be under construction.
  Does it have the client do a bunch of SHA-256 hashes?
  
  01HNNWZ0MV43FF 3 months ago
  
  Yeah it's like hashcash. You have to try random numbers until you roll a hash with enough leading zeroes. Then you get a cookie (JWT I think) that's valid for a week.
  SHA2 can run on ASICs and isn't memory-hard, so I'm hoping someone will add something tougher
  
  prisenco 3 months ago
  
  Interesting! Not how I'd approach it but certainly thinking along the same lines.
  
  userbinator 3 months ago
  
  I just hit a site with this --- and hit the back button immediately.
  
  Retr0id 3 months ago
  
  Better than a 500 error
  
  fc417fc802 3 months ago
  
  Also better than a third party service IMO because of the privacy implications. You could even potentially give users the choice (complete overkill but technically you could do it).
  
  joeblubaugh 3 months ago
  
  Maybe once you can use something more professional as an interstitial page
  
  ndiddy 3 months ago
  
  From the announcement page: https://xeiaso.net/blog/2025/anubis/
  > RPM packages and unbranded (or customly branded) versions are available if you contact me and purchase commercial support. Otherwise your users have to see a happy anime girl every time they solve a challenge. This is a feature.
  
  xena 3 months ago
  
  I'm going to be making distro packages and binaries public. I will have to figure out white label monetization I guess.
  
  neurostimulant 3 months ago
  
  It's open source with MIT license, so you should be able to remove those images yourself if you want.
  
  Spivak 3 months ago
  
  [flagged]
  
  bakugo 3 months ago
  
  [flagged]
  
  williamdclt 3 months ago
  
  I’m no anime fan, but this sort of judgement on normalcy leaves me with a very sour impression about whoever says this sort of thing.
  
  bakugo 3 months ago
  
  [flagged]
  
  soulofmischief 3 months ago
  
  [flagged]
  
  dns_snek 3 months ago
  
  [flagged]
  
  imtringued 3 months ago
  
  You might say that, but claiming everyone is a pedophile is such a tired political play at this point. Its primary purpose is to dehumanize people so that blatant wrongdoing can be justified.
  The visceral reaction might be genuine, but the actual feelings are probably not. I have yet to see someone who actually "cares about the children". The vast majority of accusations e.g. democrats running a pedophile ring turn out to be completely manufactured. The democrats responded by giving more funding and starting projects combating child abuse, only for the republicans to gut the programs, who think they are a waste of tax payer money and an example of big government.
  
  dns_snek 3 months ago
  
  [flagged]
  
  GoblinSlayer 3 months ago
  
  Violent games make people violent, and money make people greedy again?
  
  dns_snek 3 months ago
  
  How does that relate to anything I said?
- puchatek 3 months ago
  
  If we are able to detect AI scrapers then I would welcome a more strategic solution: feed them garbage data instead of the real content. If enough sites did that then the inference quality would take a hit and eventually the perpetrators, too.
  But of course this is the more expensive option that can't really be asked of sites that already provide public services (even if those are paid for by ads).
  
  brookst 3 months ago
  
  I really don’t think we have such a lack of misinformation that we need to invest in creating more of it, no matter the motive.
- kevin_thibedeau 3 months ago
  
  The problem is that the server also has to do the work. Fine for an infrequent auth challenge. Not so fine for every single data request.
  
  tedunangst 3 months ago
  
  Tons of problems are easier to verify than to solve.
  
  fsckboy 3 months ago
  
  do any of them involve mining bitcoin?
  
  Retr0id 3 months ago
  
  sure
  
  saganus 3 months ago
  
  Maybe there is a way for the server to ask the client to do the work?
  Something similar to proof-of-work but on a much smaller scale than Bitcoin.
  
  mrheosuper 3 months ago
  
  just add some delay to your response, we don't have to waste any more energy on meaningless calculation.
  
  01HNNWZ0MV43FF 3 months ago
  
  Adding delay means you have to keep more connections open at a single time. Parallelism doesn't favor a server if your problem is already a small server getting hit by a big scraper
  
  Ma8ee 3 months ago
  
  How expensive is it to just keep a connection open?
  
  swiftcoder 3 months ago
  
  About 20 kilobytes of socket + TLS state, if you've really optimised it down to the minimum. Most server software isn't that lean, of course, so pick a framework designed for running a million or so concurrent connections on a single server (i.e. something like Nginx)
  
  prisenco 3 months ago
  
  Right it would need an algorithm with widely different encryption speeds vs decryption speeds. Lattice-based cryptography maybe?
  
  Retr0id 3 months ago
  
  Hash functions are all you need.
  
  Dylan16807 3 months ago
  
  Yeah, searching for hashes with some prefix is easy to set up.
- baq 3 months ago
  
  See also https://datatracker.ietf.org/wg/privacypass/about/ and https://developers.cloudflare.com/waf/tools/privacy-pass/
- koakuma-chan 3 months ago
  
  Companies running those bots have more than enough resources
  
  prisenco 3 months ago
  
  Nobody has unlimited resources. Everything is a cost-benefit analysis.
  For highly valuable information, they might throw the GDP of a small country at scraping your site. But most information isn't worth that.
  And there are a lot of bad actors who don't have the resources you're thinking of that are trying to compete with the big guys on a budget. This would cut them out of the equation.
  
  timewizard 3 months ago
  
  Make all websites intentionally waste energy as a strategy to defeat unscrupulous operators has negative costs and marginal benefits.
  
  prisenco 3 months ago
  
  Resources applied to prevent bad actors from degrading or destroying the commons has always been the cost of civilization.
  
  noosphr 3 months ago
  
  Then use it to mine monero or similar.
  The idea that you should pay for content shouldn't be an insane pipedream. It should be the default on the internet.
  Maybe then we wouldn't be in the situation where getting new users is an existential threat to the majority of websites.
  
  01HNNWZ0MV43FF 3 months ago
  
  I'll add a link on my site recommending that visitors petition their elected officials for a pollution tax
MathMonkeyMan 3 months ago

Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?
- randmeerkat 3 months ago
  
  > Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?
  The next scraper doesn’t get the data. People don’t realize we’re not compute limited for ai, we’re data limited. What we’re watching is the “data war”.
  
  cyanydeez 3 months ago
  
  at this point we're _good data_ limited, which has little to do with scraping.
  
  DrFalkyn 3 months ago
  
  Why kind of data that isn’t public would be so valuable for AI training?
  Seems like there’s a fuck ton. All of Wikipedia, GitHub for code, etc.
  I can understand targeting certain sites like Reddit, etc. but not random websites
  
  timewizard 3 months ago
  
  It's to rip off copyrighted content and profit from it instead of the original authors. It's like every other low rent and highly automated scam that finds it's way onto the internet.
  If you look closely even Google does this. This is probably why many popular sites started getting down ranked in the last 2 years. Now they're below the fold and Google can present their content as their own through the AI box.
  
  throwaway2037 3 months ago
  
  Please remember that Google only needs to be marginally better than the competition. And, of course, their primary biz is ads, not serving great results; that is a distant second priority.
  
  MathMonkeyMan 3 months ago
  
  Their biz is ads, but since search is winner takes all they need only be marginally better than the competition... twenty years ago.
  
  timewizard 3 months ago
  
  > Their biz is ads,
  Yea, but, the FTC doesn't want it to be.
  
  grotorea 3 months ago
  
  Discord I guess would be quite valuable, even the de facto public servers.
  
  XorNot 3 months ago
  
  Honestly it's hard to tell how much more value the LLM people are going to get out of another copy of the internet.
  It feels a lot like they're stuck for improvements but management doesn't want to hear it.
  
  Davidzheng 3 months ago
  
  It's a bit strange to talk about stuck when the most recent breakthrough is less than a year old.
  
  LPisGood 3 months ago
  
  I’m not sure what you mean by breakthrough, but if you’re talking about Deepseek, it’s more of an incremental improvement than a breakthrough.
  
  threatofrain 3 months ago
  
  Scraping social media is good data, even without ML. The fact that something is "happening" to people in a social space inherently has importance to people. The specter of law is more threatening to whether companies can get their hands on good data.
- DaSHacka 3 months ago
  
  Now the only way to obtain that information is through them
- TuxMark5 3 months ago
  
  I guess one could make a point that competition will no longer have the access to the scraped data.

xena 3 months ago

Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!

shric 3 months ago

You're well into https://refactoringenglish.com/tools/hn-popularity/ so enjoy the fame!
seafoamteal 3 months ago

I've seen Anubis a couple times irl, mostly on Sourcehut, and the first time I saw it I was like, "Hey, I remember that blog post!" Congratulations on making something both useful and usable!
true_blue 3 months ago

On the few sites I've seen using it so far, it's been a more pleasant (and cuter) experience for me than the captchas I'd probably get otherwise. good work!
- xena 3 months ago
  
  Thanks! The artist I'm contracting and I are in discussions on how to make the mascot better. It will be improved. And more Canadian.
Figs 3 months ago

Hmm. Instead of requiring JS on the client, why don't you add a delay on the server side (e.g. 1 second default, adjustable by server admin) for requests that don't have a session cookie? For each session keep a counter and a timestamp. Every time you get a request from a session, look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp. If the counter is greater than a configured threshold, slow-walk the response (e.g. add a delay before forwarding the request to the shielded web server -- or transfer the response back out at reduced bytes/second, etc.)
You can periodically remove tracking data for entries older than a threshold -- e.g. once a minute or so (adjustable) remove tracked entries that haven't made a request in the past minute to keep memory usage down.
That'd effectively rate limit the worst offenders with minimal impact on most well-behaved edge-case users (like me running NoScript for security) while also wasting less energy globally on unnecessary computation through the proof-of-work scheme, wouldn't it? Is there some reason I'm not thinking of that would prevent that from working?
- viraptor 3 months ago
  
  > look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp
  Not sure about author's motivation, but this part is why I don't track usage - PoW allows you to do everything statelessly and not keep any centralised database or write any data. The benefit of a system slowing down crawling should be minimal resource usage for the server.
- PufPufPuf 3 months ago
  
  In a "denial of service prevention" scenario, you need your cost to be lower than the cost of the attacker. "Delay on the server side" means keeping a TCP connection open for that long, and that's a limited resource.
techjamie 3 months ago

The ffmpeg website is also using it. First time I actually saw it in the wild.

rfurmani 3 months ago

After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.

keyle 3 months ago

Has someone made honeypot for AI yet?
Take all regular papers and change their words or keywords to something outrageous and watch it feed it to users.
- MIC132 3 months ago
  
  This kinda fits, though it's on a personal blog level:
  https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr...
- puchatek 3 months ago
  
  If there was a non-profit dedicated do this, I would donate
karlgkk 3 months ago

One thing that worked well for me was layering obstacles
It really sucks that this is the way things are, but what I did was
10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count
After a captcha pass, 100 requests in an hour gets you auth walled
It’s really shitty but my industry is used to content scraping.
This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.
- nomel 3 months ago
  
  What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.
  [1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
  
  karlgkk 3 months ago
  
  Probably makes sense for a b2b app where you publish status codes as part of the api
  Bad actors don’t care and annoying actors would make fun of you for it on twitter
- rfurmani 3 months ago
  
  I've wanted to but wasn't sure how to keep track of individuals. What works for you? IP Addresses, cookies, something else?
  
  karlgkk 3 months ago
  
  I use IP addy. Users behind cgnat are already used to getting captcha the first time around
  There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.
- nukem222 3 months ago
  
  > This allows legit users to get what they need.
  Of course they could have just used the site directly.
  
  karlgkk 3 months ago
  
  If bots and scrapers respected the robots and tos, we wouldn’t be here
  It sucks!
- GoblinSlayer 3 months ago
  
  Or just buy cloudflare :)
- LPisGood 3 months ago
  
  What is your website?

Nckpz 3 months ago

I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"

ndiddy 3 months ago

This is yet another great example of the innovation that the AI industry is delivering. Why just limit your scraper bots to GET requests when there might be some juicy data to train on hidden behind that form? There's a reason why the a16z funded cracked vibe coder ninjas are taking over software engineering, they're full of wonderful ideas like this.
- ohgr 3 months ago
  
  I work with one of those guys. Morally bankrupt at every level of his existence.
bendangelo 3 months ago

There are bots that scrape https registration sites thats how they usually find you.

ipaddr 3 months ago

I've had a number of content sites I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.

These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.

Aurornis 3 months ago

> Alexa seems like the worst.
Many of the bots disguise themselves as coming from Amazon or other big company.
Amazon has a page where you can check some details to see if it’s really their crawler or someone imitating it.
- spookie 3 months ago
  
  yup, actually most that I've seen are impersonating amazon
svelle 3 months ago

It's funny how every time this topic comes up someone says "I've had this happen and x is the worst" With x being any of the big AI providers. Just a couple minutes ago I read the same in another thread and it was Anthropic. A couple weeks back it was Meta.
My conclusion is that they're all equally terrible then.
- Aurornis 3 months ago
  
  All of the crawlers present themselves as being from one of the major companies, even if they’re not.
  Setting user-agent headers is easy.
  
  cyanydeez 3 months ago
  
  at the same time, all the AI providers have some kind of web based AI agent, so let snot pretend they're crafting their services in care of other peoples websites.
  
  CaptainFever 3 months ago
  
  I highly doubt that people are using AI agent features so frequently and so concentrated-ly that it brings down websites.
- ipaddr 3 months ago
  
  I agree they are all creating negative value for site owners. In my personal experience this week blocking Amazon solved my server overload issue.

rco8786 3 months ago

I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.

Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.

i5heu 3 months ago

At which level of DDos one can claim damages from them?
- brookst 3 months ago
  
  You can claim whatever you want, but actually litigation is expensive and it’s not at all a sure thing that “I made a publicly available resource and they used it too much” is going to win damages. Maybe? Maybe not?
  
  i5heu 3 months ago
  
  So I could ddos anyone legally as long I have a some reason?
  
  brookst 3 months ago
  
  Sure. Courts tend to care about intent, but your are welcome to try to change that.
Neil44 3 months ago

I've got claude bot blocked too. It regularly took sites offline and ignored robots.txt. Claude bot is an asshole.
throwaway2037 3 months ago

robots.txt did not work?
- seabird 3 months ago
  
  Of course it didn't work. At best, the dorks doing this think there's a gamechanging LLM application to justify the insane valuations right around the corner if they just scrape every backwater site they can find. At worst, they're doing it because it's paying good money. Either way, they don't care, they're just going to ignore robots.txt.
- hsbauauvhabzb 3 months ago
  
  I have not monitored traffic in this way, but I imagine most AI companies would explicitly follow links listed in robots, even if not mentioned elsewhere on the site.
- epc 3 months ago
  
  I’ve been doing web sites for thirty years, robots.txt is at best a request to polite user agents to respect the server’s desires. None of the malicious crawlers respect it. None of the AI crawlers respect it.
  I’ve resorted to returning xml and zip bombs in canary pages. At best it slows them down until I block their network.
- lemper 3 months ago
  
  bro, since when vc funded ai companies have the courtesy to respect robots.txt?

userbinator 3 months ago

All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.

IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.

mvdtnz 3 months ago

How does JS entrench a browser monopoly? If you're not using vendor-specific JS extensions or non-standard APIs any browser should be able to execute your JS. Like most web developers I don't have a lot of patience for the people who refuse to run JS on their clients.
- userbinator 3 months ago
  
  The effort required to implement a JS engine and keep trendchasing the latest changes with it is a huge barrier to entry, not to mention the insane amount of fingerprinting and other privacy-hostile, anti-user techniques it enables.
  Seeing what used to be simple HTML forms turned into bloated invasive webapps to accomplish the exact same thing seriously angers me; and everyone else who wanted an easily accessible and freedom-preserving Internet.
Terr_ 3 months ago

Or require every fresh "unique" visitor to run some JS that takes X seconds to compute.
It's not nice for visitors using a very old smartphone, but it's arguably less-exclusionary than some of the tests and third-party gatekeepers that exist now.
In many cases we don't actually care about telling if someone is truly a human alone, as much as ensuring that they aren't a throwaway sockpuppet of a larger automated system that doesn't care about good behavior because a replacement is so easy to make.
- userbinator 3 months ago
  
  that takes X seconds to compute.
  Those who have the computing resources to do commercial scraping will easily get past that.
  In contrast, there are still many questions which a human can easily answer, but even the best LLMs currently can't.
  
  Terr_ 3 months ago
  
  It doesn't have to be bulletproof, it just has to create a cost that doesn't scale economically for them.
  
  userbinator 3 months ago
  
  Computing power is cheap, and getting cheaper for the big guys. Real humans are not.
  
  kristiandupont 3 months ago
  
  >there are still many questions which a human can easily answer, but even the best LLMs currently can't.
  I am genuinely curious: what is an example of such a question, if it's for a person you don't know (i.e. where you cannot rely on inside knowledge)?
  
  userbinator 3 months ago
  
  This was a notable example in that category: https://news.ycombinator.com/item?id=41058318
  Also https://news.ycombinator.com/item?id=38766512
  
  kristiandupont 3 months ago
  
  I just tried these with ChatGPT (4o) and it got both of them right. That's not to say that you won't be able to find something that still works but I think that particular hole is closing fast.
  
  spartanatreyu 3 months ago
  
  Yeah, any text based question with a text based answer eventually ends up getting posted on a forum for an AI model to scrape.
- a2128 3 months ago
  
  IIRC that's basically already part of what Cloudflare Turnstile does
GoblinSlayer 3 months ago

The algorithm is inspired by hashcash https://raw.githubusercontent.com/TecharoHQ/anubis/refs/head...

everdrive 3 months ago

One more reason we're moving away from privacy. Didn't load all the javascript domains? You're probably a bot. Not signed in? You're probably a bot. The web we knew is dying step by step.

One interesting thought: do we know if these AI crawlers intentionally avoid certain topics? Is pornography totally left unscathed by these bots? How about extreme political opinions?

GoblinSlayer 3 months ago

No, https://www.heise.de/en/news/Poisoning-training-data-Russian...
BiteCode_dev 3 months ago

That's an interesting thing: put things about all dictators known currently in power that they don't want to hear, and maybe they will back off.

banq 3 months ago

I have blocked these ip from the country: 190.0.0.0/8 207.248.0.0/16 177.0.0.0/8 200.0.0.0/8 201.0.0.0/8 145.0.0.0/8 168.0.0.0/8 187.0.0.0/8 186.0.0.0/8 45.0.0.0/8 131.0.0.0/16 191.0.0.0/8 160.238.0.0/16 179.0.0.0/8 186.192.0.0/10 187.0.0.0/8 189.0.0.0/8

HermanMartinus 3 months ago

I literally just published a post on this and how it affects Bear Blog.

https://herman.bearblog.dev/the-great-scrape

zzo38computer 3 months ago

My issue is not to prevent others from obtaining copies of the files, using Lynx or curl, disabling JavaScripts and CSS and pictures, etc. It is to prevent others from overloading the server due to badly behaved software.

I had briefly set up port knocking for the HTTP server (and only for HTTP; other protocols are accessible without port knocking), but due to a kernel panic I removed it and now the HTTP server is not accessible. (I may later put it back on once I can fix this problem.)

As far as I can tell, the LLM scrapers do not attempt to be "smart" about it at this time; if they do in future, you might try to take advantage of that somehow.

However, even if they don't, there are probably things that can be done. For example, check that if the declared user-agent declares things that it isn't doing, and display an error message if so (users who use Lynx will then remain unaffected and will still be able to access it). Another possibility is to try to confuse the scrapers however they are working, e.g. invalid redirects, valid redirects (e.g. to internal API functions of the companies that made them), invalid UTF-8, invalid compressed data, ZIP bombs (you can use the compression functions of HTTP to serve a small file that is too big when decompressed), EICAR test files, reverse pings (if you know who they really are), etc. What will work and what doesn't work depends on what software they are using.

edoloughlin 3 months ago

I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.

nosianu 3 months ago

You mean like this?
[2025-03-19] https://blog.cloudflare.com/ai-labyrinth/
> Trapping misbehaving bots in an AI Labyrinth
> Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives.
- barbazoo 3 months ago
  
  What a colossal waste of energy
- fc417fc802 3 months ago
  
  > No real human would go four links deep into a maze of AI-generated nonsense.
  ... I would. Out of curiosity and amusement I would most definitely do that. Not every time, and not many times, but I would definitely do that one or a few times.
  Guess I'm getting added to (yet another) Cloudflare naughty list.
  > It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.
  In that case wouldn't it be faster and easier to restyle the CSS of wikipedia pages?
- mbesto 3 months ago
  
  Wait, what happens when a Cloudflare Worker AI meets an AI Labyrinth?!
  
  ronsor 3 months ago
  
  Cloudflare deletes itself.
  
  GoblinSlayer 3 months ago
  
  Rise of the machines.
noirscape 3 months ago

Bandwidth isn't free, not at the volume these crawlers scrape at; serving them random data (for example by leading them down an endless tarpit of links that no human would end up visiting) would still incur bandwidth fees.
Also it's not identifiable AI bot traffic that's detected (they mask themselves as regular browsers and hop between domestic IP addresses when blocked), it's just really obviously AI scraper traffic in aggregate: other mass crawlers have no benefit from bringing down their host sites, except for AI.
A search engine has nothing if it brings down the site they're scraping (and has everything to gain from identifying itself as a search engine to try and get favorable request speeds - the only thing they'd need to check is if the site in question isn't serving different data, but that's much cheaper), same with an archive scraper and those two are pretty much the main examples I can think of for most scraping traffic.
- BlarfMcFlarf 3 months ago
  
  Hmm, maybe you could zipbomb the data? Aka, you send a few kilobytes of compressed data that expands to many gigabytes on client side?
- gus_massa 3 months ago
  
  Reverse Slowloris?
  https://en.wikipedia.org/wiki/Slowloris_(cyber_attack)
- miohtama 3 months ago
  
  For Cloudflare, bandwidth is practically free.
- cyanydeez 3 months ago
  
  arnt a lot of these bots now actively loading javascript? you could just load a simple script that does the job .
  
  edoloughlin 3 months ago
  
  If they agree to mine crypto for you then you send valid data. Is this a win-win?
  (I feel I need to preemptively state that I am being sarcastic.)
- charcircuit 3 months ago
  
  >Bandwidth isn't free
  Via peering agreements it is.
  
  rcxdude 3 months ago
  
  Not something available to smaller sites
  
  charcircuit 3 months ago
  
  Yes, it is. They transitively get it via the agreements the smaller site's host's host makes. Or via services like Cloudflare.
  
  xena 3 months ago
  
  What button do I click in the AWS panel for that?
  
  charcircuit 3 months ago
  
  There is no button. AWS is where you go to light money on fire.
xena 3 months ago

You can detect the patterns in aggregate. You can't detect it easily at an individual request level.
- bluGill 3 months ago
  
  In short if you get several million requests and expect to only get 100 you won't know which are the real requests and which are the AI ones - but it is obvious that the vast majority are AI.
jmpeax 3 months ago

You skipped the last section "Tarpits and labyrinths: The growing resistance" of the article.
DecentShoes 3 months ago

Random data? Why not "recipes" that just say "Bezos is a pedo" over and over ?

ggm 3 months ago

Entire country blocks are lazy, and pragmatic. The US armed forces at one point blocked AU/NZ on 202/8 and 203/8 on a misunderstanding about packets from China, also from these blocks. Not so useful for military staff seconded into the region seeking to use public internet to get back to base.

People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)

noirscape 3 months ago

They can absolutely work if you aren't expecting any traffic from those countries whatsoever.
I don't expect any international calls... ever, so I block international calling numbers on my phone (since they are always spam calls) and it cuts down on the overwhelming majority of them. Don't see why that couldn't apply to websites either.
- koito17 3 months ago
  
  Although it's a very lazy practice, this is exactly how many Japanese sites (and internet services) fight against bad actors. In short, they block non-Japanese traffic and data center IPs. I expect these measures to become insufficient as consumers adopt IoT devices and provide ample amounts of residential IPs for botnets.
  As for phone numbers, businesses and individuals employ a similar strategy. Most "legitimate" phone numbers begin with 060 or 070. Due to lack of supply, telcos are gradually rolling out 080 numbers. 080 numbers currently have a bad reputation because they look unfamiliar to the majority of Japanese. Similarly, VoIP numbers all begin with 050, and many services refuse such numbers. Most people instinctively refuse to answer any call that is not from a 060 or 070 number.
- alabastervlog 3 months ago
  
  Country or region blocks based on IPs used to (c. ~2000) be pretty standard. Blackhole the blocks associated with China, Russia, and maybe Africa, and your failed-login logs drop from scrolling so fast you can't read them, to a handful of lines per minute. Almost all the traffic was from those blocks, and was malicious. Meanwhile, for many sites, your odds (especially back then) of getting legitimate traffic from, say, China, was nearly zero, so the cost of blocking them was effectively nothing.
  Cloudflare is basically still just this, but with more steps.
- ggm 3 months ago
  
  Sure. Absolutely works. Right up until it doesn't. I think the MIL was the wrong people to assume "we will never need packets from these network blocks"
  The other thing is that phone numbers follow a numbering scheme where +1 is north america and +64 is NZ. Its easy to know the longterm geographic consequence of your block, modulo faked out CLID. IP packets don't follow this logic and Amazon can deploy AWS nodes with IPs acquired in Asia, in any DC they like. The smaller hosting companies don't say the IP range they route for banks have no pornographers on them.
  It's really not sensible to use IP blocks except for the very specific cases like yours: "I never terminate international calls" is the NAT of firewalls: "I don't want incoming packets from strangers" sure the cheapest path is to block entire swathes of IPv4 and IPv6. But if you are in general service delivery, that rarely works. If you ran a business doing trade in China, you'd remove that block immediately.
- kragen 3 months ago
  
  It depends on whether the information on the website is supposed to be publicly available or not. "This information is publicly available except to people from Israel" sends a really terrible message.
  
  Retric 3 months ago
  
  It sends a great message to crack down on these companies, as long as you mention why it’s blocked.
  
  kragen 3 months ago
  
  "You're cut off from access to knowledge because you live in the same country as AI researchers"?
  
  Retric 3 months ago
  
  A DoS attack is a DoS attack even if someone is pretending to be a “Researcher.”
  People in Iran, Russia, etc get annoyed with sanctions but that’s kind of the point. If your government isn’t responding appropriately, yes you’ll get shafted it’s what you do after that which solves the problem.
  
  kragen 3 months ago
  
  Me, I prefer to relate to people as individuals rather than, as you are advocating, interchangeable representatives of their area of residence. If what you want is World War III, this is how you get it.
  In particular, universal access to knowledge is a fundamental principle of liberalism.
  
  Retric 3 months ago
  
  I’d personally love for someone to hand me a billion dollars no strings attached.
  That’s got nothing to do with solving the issues created by these people, but if you’re going to toss out meaningless non sequitur’s then I figure I might as well join in on the fun.
  
  cyanydeez 3 months ago
  
  "researchers" is like ignoring the whole Capitalists. We don't call them script researchers. We call them script kiddies.
  There's the whole other side of these AI researchers, and thats just slop artisans.

nashashmi 3 months ago

And this is why the Internet has become a maze of captcha’s.

fsckboy 3 months ago

yeah but the HN community would be ground zero for "automating scraping tasks"
"We have met the enemy and he is us." -- Walt Kelly

myzie 3 months ago

An aspect I find interesting is that these crawlers are all doing highly redundant work. As in, thousands of crawlers are running around the world, and each crawler may visit the same site and pages multiple times a week.

This seems like an opportunity for a company like Firecrawl, ScrapingBee, etc to offer built-in caching with TTLs so that redundant requests can hit the cache and not contribute to load on the actual site.

Even if each company that operates a crawler cached pages across multiple runs, I'd expect a large improvement in the situation.

For more dynamic pages, this obviously doesn't help. But a lot of the web's content is more static and is being crawled thousands of times.

I built something for my own company that crawls using Playwright and caches in S3/Postgres with a TTL for this purpose.

Does this make sense to anyone else? I'm not sure if I'm missing something that makes this harder than it seems on the surface. (Actual question!)

jaccola 3 months ago

I have considered this before, but then if the content can be cached why wouldn't the website just do this themselves?
They have the incentive, it is relatively easy and I don't think there's a huge benefit to centralisation (especially since it will basically be centralised to one of the big providers of caching anyways)
- myzie 3 months ago
  
  I'm definitely with you that sites should be leveraging CDNs and similar. But I get that many don't want to do any work to support bots that they don't want to exist in the first place.
  To me it seems like the companies actually doing the crawling have an incentive to leverage centralized caching. It makes their own crawling faster (since hitting the cache is much faster than using Playwright etc to load the page) and it reduces the impact on all these sites. Which would then also decrease the impact of this whole bot situation overall.
- brookst 3 months ago
  
  It would shift the complexity and cost of large scale caching to a provider that would sell to the scrapers. Not sure it has much value, but it’s kind of a classic three tier distribution system with a middleman to make life easier for both producer and consumer.
xena 3 months ago

What does the user agent oook like for if you wanted to crawl xeiaso.net?

time4tea 3 months ago

Yeah my small site got 170k requests from one bot in a few mins. Of course it was rate limited, but didn't seem to know 429 or 444 (drop connection) so kept on coming back for more. I do have ipset drop too but at the moment takes a person to enable it... just so much effort to stop these **s. Exhausting!

rambambram 3 months ago

Make AI 'pay' by delaying every request and presenting the content with a warning on top, like: AI bots are (sc)raping the internet, that's why we do this.

Or something like: AI is making your experience worse, complain here (link to OpenAI).

Maybe not the most technical solution, but this at least gets the signal across to regular human beings who want to browse a site. Puts all this AI bs in a bad spotlight.

boyter 3 months ago

Crawling, incidentally, I think is the biggest issue with making a new search engine these days. Websites flat out refuse to support any crawler [other] than Google, and Cloudflare and other protection services and CDN's flat out deny access to incumbents. It is not a level playing field.

I wrote the above some time ago. I think its even more true today. Its practically impossible to crawl the way the bigger players do and with the increased focus on legislation in this area its going to lock out smaller teams even faster.

The old web is dead really. There really needs to be a move to more independent websites. Thankfully we are starting to see more of this like the linked searchmysite discussed earlier today https://news.ycombinator.com/item?id=43467541

zlagen 3 months ago

that's a good point, in search we have google as a monopoly and since a big percentage of sites only want to be crawled by them it reinforces the monopoly. So a lot of people complain about bots not following robots.txt but if you follow them to the letter it's impossible to make anything useful. Also AFAIK robots.txt doesn't have any legal standing

zkmon 3 months ago

An opensource repo asking who is responsible for this AI invasions? Well, it is you, who is responsible for all this. What did you think when you helped tech to advance so rapidly, over-pacing the needs of humans? Read about panchatantra story of 4 brothers who got a dead tiger alive, just to boast of their skill and greatness.

ANarrativeApe 3 months ago

Excuse my ignorance, but is it time to update the open source licenses in the light of this behavior? If so, what should the evolved license wording be?

I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt...

mook 3 months ago

They already ignore copyright. The open source licenses are based on copyright, so changing the licenses wouldn't do squat, they'd still ignore it.
See also: Meta being sued for torrenting. Since this is an Ars Technica article, here's another one: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...
pabs3 3 months ago

That wouldn't be an Open Source Definition compliant license. More in this subthread:
https://news.ycombinator.com/item?id=43423595
johnnyanmac 3 months ago

From my little understanding, we have a sort of agreement in place with an item called robot.txt that's more or less a hanshake with such scrapers. Of course, the issue is these scrapers are blatantly ignoring robots.txt
A license can help as well, but what's a license without enforcement? These companies are simply treating the courts as a cost to do business.
- CaptainFever 3 months ago
  
  Close, robots.txt was originally for web crawlers, to reduce accidental denial-of-service attacks. It had nothing to do with the scraping (i.e. downloading content and parsing the HTML tags in a programmatic manner).
  
  WesolyKubeczek 3 months ago
  
  What do you think a search engine’s crawler bot is doing exactly? I could sure be wrong, but I have a hunch that “downloading content and paraing the HTML tags in a programmatic manner” describes it.
  
  CaptainFever 3 months ago
  
  Yes, but the difference is that the term "scraping" also targets things like automatically generating RSS feeds from HTML pages, which is not covered by robots.txt.
  
  WesolyKubeczek 3 months ago
  
  I thought robots.txt covered all automated, programmatic access by third parties where a bot slurps stuff and follows links, without splitting hairs about it.
  But what do I know, the young whippersnappers will just word lawyer me to death, so I better shut up and go away.

grotorea 3 months ago

Is this stuff only affecting the not for profit web? What are the for profit sites doing? I haven't seen Anubis around the web elsewhere. Are we just going to get more and tighter login walls and send everything into the deep web?

burkaman 3 months ago

For profit sites are making deals directly with the AI companies so they can get some more of that profit.
surfingdino 3 months ago

I think we killed the old web. We'll see new ways of communicating, publishing, and gathering over the internet. It's sad, but it's also exciting.
- epolanski 3 months ago
  
  Literally nothing in this data driven world (sports, technology, entertainment, everything is data driven and maximized) is exciting, nothing.
_xtrimsky 3 months ago

I think what they mean is that most not for profit small sites don't have expensive hardware or DDOS blocking mechanisms. A small 256mb ram vps might be enough for 1000 users per month traffic, but not enough for 200,000 users a day traffic.

eevilspock 3 months ago

The irony is that the people and employees of the AI companies will vehemently defend the morality of capitalism, private property and free markets.

Their robber baron behavior reveals their true values and the reality of capitalism.

johnnyanmac 3 months ago

Insert the Sinclair quote here. Anything to drive up the stock, no matter how immoral or illegal.
randmeerkat 3 months ago

> Their robber baron behavior reveals their true values and the reality of capitalism.
This is rather reductionist… By your same logic I could say that Stalin and Mao revealed the true values and reality of communism.
Let’s not elaborate on it further though and just leave this as a simple argument. Free market capitalism has led us to the most prosperous, peaceful, and advanced society humanity has ever ventured to create. Communism threatened that prosperity and peace with atrocities on a scale that exists beyond human comprehension. Capitalism, even with all of its faults, is the obvious choice.
- johnnyanmac 3 months ago
  
  It's rather strawman to bring up communism in a conversation that talked nothing about it, except that Capitalism is clearly flawed.
  Capitalism without law ends up with the same kind of authoritariasm as communism without law. Some Rich Guy ends up telling everyone what to do as a ruler with loose rules that no longer resemble the economic model. That's what people complain about when they bring up terms like "late stage capitalism".
- CyberDildonics 3 months ago
  
  Internet comments are inherently reductionist.

kh_hk 3 months ago

Proof of work is sufficient (although easy to bypass on targeted crawls) for protecting endpoints that are accessed via browsers, but plain public APIs have to resort to other more primitive methods like rate limiting.

Blocking by UA is stupid, an by country kind of wrong. I am currently exploring ja4 fingerprints, that together with other metrics (country, Arn, block list), might give me a good tool to stop malicious usage.

My point is, this is a lot of work, and it takes time off the budget you give to side projects.

j45 3 months ago

Surprised at how crawling as a whole seems to have taken a sustained step backwards with old best practices that are solved being new to devs today?

I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.

Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.

cyanydeez 3 months ago

I'm guessing its the rise of the AI agents that just go out and download "research" and not necessarily building LLMs.

kazinator 3 months ago

I've been seeing crawlers which report an Agent string like this:

  Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3405.80 Safari/537.36

Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.

When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.

The attack quickly abated.

haswell 3 months ago

Lately I’ve been thinking a lot about what it would take to create a sort of “friends and family” Internet using some combination of Headscale/Tailscale. I want to feel free and open about what I’m publishing again, and the modern internet is making that increasingly difficult.

DrFalkyn 3 months ago

Chances are one your friends/ family will eventually be compromised in some way.
There’s always VPNs, though you can only be on at one at a time per device

CyberDildonics 3 months ago

Can't you just rate limit beyond what a person would ever notice and do it by slowing the response?

ordersofmag 3 months ago

Not if they hop to a different IP address every few requests. And they generally aren't bothered slow responses. It's not like they have to wait for one request to finish before they make another one (especially if they are making requests from thousands of machines).
- CyberDildonics 3 months ago
  
  You're saying that large companies are hitting individual websites with thousands of unrelated IP addresses?
  
  Vespasian 3 months ago
  
  Yep we've been seeing that on our random small scale site that used to be open (and mostly relevant to a very limited number of people).
  It was nice for interested guests to get an impression of what we are doing.
  First the AI crawlers came in from foreign countries that could be blocked.
  Then they beat down the small server by being very distributed, calling from thousands of ips one or two requests each.
  We finally put a stop to it by requiring a login with a message informing people to physically show up to gain access.
  Worked fine for over 15 years but AI finally killed it.
  
  jaggederest 3 months ago
  
  How do you think they do crawling if not like that? They'd be IP banned instantly if they used any kind of predictable IP regime for more than a few minutes.
  
  CyberDildonics 3 months ago
  
  I don't know what is actually happening, that's why I'm asking.
  Also you're implying that the only way to crawl is to essentially DDOS a website by blasting them from thousands of IP addresses. There is no reason crawlers can't do more sites in parallel and avoid hitting individual sites so hard. There are plenty of crawlers for the last few decades that don't cause problems, these are just stories about the ones that do.
xena 3 months ago

You'd think, but no :(

jrvarela56 3 months ago

This is going to start happening to brick and mortar businesses through their customer support channels

imtringued 3 months ago

Kitboga already deployed a counter-scam LLM that calls scammers to waste their time. The goal is to keep them on the line as long as possible.
sean_lynch 3 months ago

I’d love to hear more. What are they seeing?
- jrvarela56 3 months ago
  
  Unfounded. Assuming automated/voice-enabled assistants will spam for opening times, price & description of products/services, scheduling/booking, etc.
  In the long run it'll be an arms race but the transition will be rough for businesses as consumers can adopt these tools faster than SMBs or enterprises can integrate them.
cdolan 3 months ago

And enterprise call centers

101008 3 months ago

I have a blog with content about a non tech topic and I had any problem. I am all against AI scrappers, but I didn't notice any change being behind Cloudflare (funny enough, if I ask GPT and Claude about my website, they know it)

WesolyKubeczek 3 months ago

They could hit Cloudflare’s caches and run with that. The problems are if your site is dynamic and your CDN has to hit the origin every time.

miyuru 3 months ago

Isn't this by problem solved by using commoncrawl data. I wonder what changed to AI companies to do mass crawling individually.

https://commoncrawl.org/

JCharante 3 months ago

I believe we should have microtransactions to access resources. Pay a server a tiny amount and it'll return the content. This way if crawlers dominate traffic it just means they're paying a bunch for it

hbcondo714 3 months ago

> many AI companies engage in web crawling

Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.

CyberDildonics 3 months ago

How do they manage to get 'paywalled' content?
- hbcondo714 3 months ago
  
  Maybe 'paywalled' is not the best word but using their Identity Based Crawling feature with Managed Browsers[1], you can use an existing account and scrape content that requires authentication. This may not sound like anything new but IMHO, crawl4ai's workflow is easy to follow.
  [1] https://docs.crawl4ai.com/advanced/identity-based-crawling

aorth 3 months ago

I think it's strange to focus on "FOSS sites" in this article. I have regular corporate sites that are getting slammed by ChatGPT, Perplexity, ByteDance, and others too.

temp008 3 months ago

I wonder if a service exists where IPs known for crawling are reported and reputation of such IPs is tracked for others to use and ban / ratelimit by default

mrweasel 3 months ago

The problem with that approach is that you'll quickly add large swaths of IPs belonging to cloud service provides, such as AWS. We already know that AWS, Azure, GCP and Alibaba are part of the problem, so we can technically just rate-limit them already. I believe that all of them publish their IP ranges.
Google also publish the IP ranges for GoogleBot I believe, and Bing probably does the same, so we can then whitelist those IPs and still have sites appear in searches.
My issue is that the burden is again placed on everyone else, not the people/companies who are causing the problem.
It's crazy to me to think about how much needless capacity is built into the internet to deal with crawlers. The resource waste is just insane.

pdw 3 months ago

I wonder how long until we'll see DNS blocklists to blackhole IP addresses associated with scrapers. It seems like the logical evolution.

kh_hk 3 months ago

There are ip blocklist. Ranging from free, to paid, from text files to apis. Some of the sites offering IP blacklists also need to protect themselves from automated crawls too. It all goes full circle

navane 3 months ago

How do they know which square contains a bicycle?

_DeadFred_ 3 months ago

Sounds like every request should require crypto processing be fed back to the host on the client side as quid pro quo.

instagib 3 months ago

Can we force the bots to mine cryptocurrency?

userbinator 3 months ago

Anger someone enough and you'll get a real DDoS instead.

throwaway81523 3 months ago

Saaay, what happened to that assistant US attorney who prosecuted Aaron Swartz for doing exactly this? Oh wait, Aaron didn't have billions in VC backing. That's the difference.

kittikitti 3 months ago

One of the main peddlers for residential proxies is Israel, see BrightData.

zlagen 3 months ago

in the perfect world we would have a very high trust internet in which everyone follows the rules, checks robots.txt, rate limits, etc. But we don't live in such a world. Getting angry at these ai bots is useless. People should start considering what and how they host their data. If you're worried about bandwidth costs you have many alternatives to host your data for free or at a very little cost. i.e: github/gitlab/cloudflare.

If you're worried about your data getting scraped and used then maybe you can consider putting it behind a login or do some proof of work/soft captcha. Yeah, this isn't perfect but it will keep most dumb bots away.

Some people are hosting their sites like we're still in 1995 and times have changed.

internet101010 3 months ago

I block all traffic except that which comes from the country of Cloudflare.

superkuh 3 months ago

While the motivations may be AI related the cause of the problem is the first and original type of non-human person: corporations. Corporations are doing this, not human persons, not AI.

nektro 3 months ago

those crawlers and models are a scourge on the internet

yieldcrv 3 months ago

nice, a free way to keep our IPFS pins alive

i5heu 3 months ago

This is not how it works.
You have to actively pin data for it to be distributed.

EVa5I7bHFq9mnYK 3 months ago

Why can't Cloudflare do it? They know in real time which IPs spew millions of scrape requests to various sites, so they can classify them as AI bots and allow site owners to block them.

BiteCode_dev 3 months ago

Some people have issues with the idea that the entire web will eventually depend of a single point of failure and private for provide entity like cloudflare.
- EVa5I7bHFq9mnYK 3 months ago
  
  Well, should choose from two evils then
miohtama 3 months ago

Cloudflare already offers this feature
https://blog.cloudflare.com/declaring-your-aindependence-blo...

egypturnash 3 months ago

This is 100% off-topic but:

Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.

misnome 3 months ago

I also get this for 1+ years, and just assume that it’s a symptom of my adblockers/pihole working and screwing up something in their auto ad targeting.
- egypturnash 3 months ago
  
  I’m glad to know it’s not just me! It’s such a weird and specific failure state.

hoaxminion 3 months ago

[dead]

RamblingCTO 3 months ago

Sure, overly spamming websites is shitty behaviour. But blocking AI crawlers hurts you in the end. Guess what will replace SEO in the long run?

entropi 3 months ago

>But blocking AI crawlers hurts you in the end. Guess what will replace SEO in the long run?
Maybe. But even if that turns out to be true, what good is it for the source website? The "AI" will surely not share any money (or anything else that may help the source website) with the source anyways. Why would they, they already got the content and trained on it.
- RamblingCTO 3 months ago
  
  What good is it? If "AI" doesn't know about you down the line, you won't be discovered. Be it in LLM weights or via crawling (perplexity, jina reader etc.), you won't get any organic traffic. It's not about sharing profits.
  
  entropi 3 months ago
  
  Again, the "AI" doesn't care about the website. It doesn't even link to it in vast majority of the cases. Even if it did, the "AI" derives a lot of its business value from the fact that it is providing what the client requests while removing the need to visit potentially dozens of these pages. So the clients, in most cases, would not even click them (as they already got what they wanted).
xena 3 months ago

So you're willing to pay my hosting bills?
- RamblingCTO 3 months ago
  
  Re-read what I posted and stop projecting.
  
  xena 3 months ago
  
  But if the AI crawlers are taking the website down and money buys more server time, are you willing to do your part and use money to make sure your training data sources are solvent until you can replace them?
  
  RamblingCTO 3 months ago
  
  I am not training shit dude ... So stop projecting. And again: spamming is unfair, as I said.
- CyberDildonics 3 months ago
  
  What a bizarre response. They are just saying that some of these crawlers are being used for search engines in one way or another.