xyzal 3 days ago

I think the proper answer is to aim for the bots to get _negative_ utility value from visiting our sites, that is poisoning their well, not just zero value, that is to block them.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models

  • sigmoid10 3 days ago

    This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.

    • hec126 3 days ago

      You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.

      • rustc 3 days ago

        This would be terrible for accessibility for users using a screen reader.

        • mostlysimilar 3 days ago

          So would the site shutting down because AI bots are too much traffic.

        • MrResearcher 2 days ago

          <a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.

          • rustc a day ago

            Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.

    • soco 3 days ago

      But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?

      • majewsky 3 days ago

        robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.

        • ccgreg 5 hours ago

          I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.

      • sigmoid10 3 days ago

        The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.

      • delichon 3 days ago

        Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?

        • voidUpdate 3 days ago

          When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?

          • zzo38computer 3 days ago

            It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).

          • sharlos201068 3 days ago

            Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.

            • danielheath 3 days ago

              Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.

          • brookst 3 days ago

            robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.

            • ccgreg 5 hours ago

              robots.txt has a maximum relevant size of 500 kib.

  • TonyTrapp 3 days ago

    We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.

    • sokoloff 3 days ago

      > dozens of IPs, so every IP just makes 1-2 requests in total

      Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.

      • lucb1e 3 days ago

        I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)

        That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks

      • aorth 3 days ago

        Parent probably meant hundreds or thousands of IPs.

        Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.

        Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!

        Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.

        Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?

        I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.

      • TonyTrapp 3 days ago

        If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.

      • giantg2 3 days ago

        That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.

  • gchamonlive 3 days ago

    If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?

  • lucb1e 3 days ago

    > Is your user-agent too suspicious?

    Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower

    > A request rate too inhuman?

    I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!

  • jajko 3 days ago

    Thats absolutely brilliant f*cked up idea, poisoning AI while fending them off.

    Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.

  • j-bos 3 days ago

    This makes tons of sense, the AI trainers spend endless resources aligning their llms, the least we could do ia spend a few minutes aligning their owners. Fixing things at the incentive level.

  • usrnm 3 days ago

    Where are you going to get all that content? If it's static, it will get filtered out very fast, if it's dynamic and autogenerated, it might cost even more than just letting the crawler through

    • hec126 3 days ago

      Generate it once every few weeks with LLaMa and then serve as static content?

  • agilob 3 days ago

    Also, lower upload rate to 5kb/s

  • PeterStuer 3 days ago

    You have clearly no idea on the incompetence so many public administrations have in configuring robots.txt for data that is actually created for and specifically meant to be consumed programatically (think rss and atom feeds, REST api endpoints etc.). Half the time the person setting up the robots.txt just blankedly blocks everything, and does not even know (or care) to exclude those.

  • seper8 3 days ago

    I love this idea haha

tedunangst 3 days ago

> It remains unclear why these companies don't adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don't overwhelm source websites.

If the target goes down after you scrape it, that's a feature.

  • prisenco 3 days ago

    This has me wondering what it would take to do a bcrypt style slow hashing requirement to retrieve data from a site. Something fast enough that a single mobile client for a user wouldn't really feel the difference. But an automated scraper would get bogged down in the calculations.

    Data is presented to the user with multiple layers of encryption that they use their personal key to decrypt. This might add an extra 200ms to decrypt. Degrades the user experience slightly but creates a bottleneck for large-scale bots.

    • tschwimmer 3 days ago

      Check out Anubis - it's not quite what you're suggesting but similar in concept: https://anubis.techaro.lol/

      • LPisGood 3 days ago

        How does it work? I don’t have time to read the code, and the website/docs seem to be under construction.

        Does it have the client do a bunch of SHA-256 hashes?

        • 01HNNWZ0MV43FF 3 days ago

          Yeah it's like hashcash. You have to try random numbers until you roll a hash with enough leading zeroes. Then you get a cookie (JWT I think) that's valid for a week.

          SHA2 can run on ASICs and isn't memory-hard, so I'm hoping someone will add something tougher

      • prisenco 3 days ago

        Interesting! Not how I'd approach it but certainly thinking along the same lines.

      • userbinator 3 days ago

        I just hit a site with this --- and hit the back button immediately.

        • Retr0id 3 days ago

          Better than a 500 error

          • fc417fc802 3 days ago

            Also better than a third party service IMO because of the privacy implications. You could even potentially give users the choice (complete overkill but technically you could do it).

      • joeblubaugh 3 days ago

        Maybe once you can use something more professional as an interstitial page

        • ndiddy 3 days ago

          From the announcement page: https://xeiaso.net/blog/2025/anubis/

          > RPM packages and unbranded (or customly branded) versions are available if you contact me and purchase commercial support. Otherwise your users have to see a happy anime girl every time they solve a challenge. This is a feature.

          • xena 3 days ago

            I'm going to be making distro packages and binaries public. I will have to figure out white label monetization I guess.

        • neurostimulant 3 days ago

          It's open source with MIT license, so you should be able to remove those images yourself if you want.

        • Spivak 3 days ago

          [flagged]

          • bakugo 3 days ago

            [flagged]

            • williamdclt 3 days ago

              I’m no anime fan, but this sort of judgement on normalcy leaves me with a very sour impression about whoever says this sort of thing.

              • dns_snek 3 days ago

                [flagged]

                • imtringued 3 days ago

                  You might say that, but claiming everyone is a pedophile is such a tired political play at this point. Its primary purpose is to dehumanize people so that blatant wrongdoing can be justified.

                  The visceral reaction might be genuine, but the actual feelings are probably not. I have yet to see someone who actually "cares about the children". The vast majority of accusations e.g. democrats running a pedophile ring turn out to be completely manufactured. The democrats responded by giving more funding and starting projects combating child abuse, only for the republicans to gut the programs, who think they are a waste of tax payer money and an example of big government.

                  • dns_snek 3 days ago

                    [flagged]

                    • GoblinSlayer 3 days ago

                      Violent games make people violent, and money make people greedy again?

                      • dns_snek 3 days ago

                        How does that relate to anything I said?

    • puchatek 3 days ago

      If we are able to detect AI scrapers then I would welcome a more strategic solution: feed them garbage data instead of the real content. If enough sites did that then the inference quality would take a hit and eventually the perpetrators, too.

      But of course this is the more expensive option that can't really be asked of sites that already provide public services (even if those are paid for by ads).

      • brookst 3 days ago

        I really don’t think we have such a lack of misinformation that we need to invest in creating more of it, no matter the motive.

    • kevin_thibedeau 3 days ago

      The problem is that the server also has to do the work. Fine for an infrequent auth challenge. Not so fine for every single data request.

      • tedunangst 3 days ago

        Tons of problems are easier to verify than to solve.

        • fsckboy 3 days ago

          do any of them involve mining bitcoin?

      • saganus 3 days ago

        Maybe there is a way for the server to ask the client to do the work?

        Something similar to proof-of-work but on a much smaller scale than Bitcoin.

        • mrheosuper 3 days ago

          just add some delay to your response, we don't have to waste any more energy on meaningless calculation.

          • 01HNNWZ0MV43FF 3 days ago

            Adding delay means you have to keep more connections open at a single time. Parallelism doesn't favor a server if your problem is already a small server getting hit by a big scraper

            • Ma8ee 3 days ago

              How expensive is it to just keep a connection open?

              • swiftcoder 3 days ago

                About 20 kilobytes of socket + TLS state, if you've really optimised it down to the minimum. Most server software isn't that lean, of course, so pick a framework designed for running a million or so concurrent connections on a single server (i.e. something like Nginx)

      • prisenco 3 days ago

        Right it would need an algorithm with widely different encryption speeds vs decryption speeds. Lattice-based cryptography maybe?

        • Retr0id 3 days ago

          Hash functions are all you need.

          • Dylan16807 3 days ago

            Yeah, searching for hashes with some prefix is easy to set up.

    • koakuma-chan 3 days ago

      Companies running those bots have more than enough resources

      • prisenco 3 days ago

        Nobody has unlimited resources. Everything is a cost-benefit analysis.

        For highly valuable information, they might throw the GDP of a small country at scraping your site. But most information isn't worth that.

        And there are a lot of bad actors who don't have the resources you're thinking of that are trying to compete with the big guys on a budget. This would cut them out of the equation.

        • timewizard 3 days ago

          Make all websites intentionally waste energy as a strategy to defeat unscrupulous operators has negative costs and marginal benefits.

          • prisenco 3 days ago

            Resources applied to prevent bad actors from degrading or destroying the commons has always been the cost of civilization.

          • noosphr 3 days ago

            Then use it to mine monero or similar.

            The idea that you should pay for content shouldn't be an insane pipedream. It should be the default on the internet.

            Maybe then we wouldn't be in the situation where getting new users is an existential threat to the majority of websites.

          • 01HNNWZ0MV43FF 3 days ago

            I'll add a link on my site recommending that visitors petition their elected officials for a pollution tax

  • MathMonkeyMan 3 days ago

    Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?

    • randmeerkat 3 days ago

      > Why? What is the goal of a scraper, and how does disabling the source of the data benefit them?

      The next scraper doesn’t get the data. People don’t realize we’re not compute limited for ai, we’re data limited. What we’re watching is the “data war”.

      • cyanydeez 3 days ago

        at this point we're _good data_ limited, which has little to do with scraping.

        • DrFalkyn 3 days ago

          Why kind of data that isn’t public would be so valuable for AI training?

          Seems like there’s a fuck ton. All of Wikipedia, GitHub for code, etc.

          I can understand targeting certain sites like Reddit, etc. but not random websites

          • timewizard 3 days ago

            It's to rip off copyrighted content and profit from it instead of the original authors. It's like every other low rent and highly automated scam that finds it's way onto the internet.

            If you look closely even Google does this. This is probably why many popular sites started getting down ranked in the last 2 years. Now they're below the fold and Google can present their content as their own through the AI box.

            • throwaway2037 3 days ago

              Please remember that Google only needs to be marginally better than the competition. And, of course, their primary biz is ads, not serving great results; that is a distant second priority.

              • MathMonkeyMan 3 days ago

                Their biz is ads, but since search is winner takes all they need only be marginally better than the competition... twenty years ago.

                • timewizard 3 days ago

                  > Their biz is ads,

                  Yea, but, the FTC doesn't want it to be.

          • grotorea 3 days ago

            Discord I guess would be quite valuable, even the de facto public servers.

        • XorNot 3 days ago

          Honestly it's hard to tell how much more value the LLM people are going to get out of another copy of the internet.

          It feels a lot like they're stuck for improvements but management doesn't want to hear it.

          • Davidzheng 3 days ago

            It's a bit strange to talk about stuck when the most recent breakthrough is less than a year old.

            • LPisGood 3 days ago

              I’m not sure what you mean by breakthrough, but if you’re talking about Deepseek, it’s more of an incremental improvement than a breakthrough.

        • threatofrain 3 days ago

          Scraping social media is good data, even without ML. The fact that something is "happening" to people in a social space inherently has importance to people. The specter of law is more threatening to whether companies can get their hands on good data.

    • DaSHacka 3 days ago

      Now the only way to obtain that information is through them

    • TuxMark5 3 days ago

      I guess one could make a point that competition will no longer have the access to the scraped data.

rfurmani 3 days ago

After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.

  • karlgkk 3 days ago

    One thing that worked well for me was layering obstacles

    It really sucks that this is the way things are, but what I did was

    10 requests for pages in a minute, you get captchad (with a little apology and the option to bypass it by logging in). asset loads don’t count

    After a captcha pass, 100 requests in an hour gets you auth walled

    It’s really shitty but my industry is used to content scraping.

    This allows legit users to get what they need. Although my users maybe don’t need prolonged access ahem.

    • nomel 3 days ago

      What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.

      [1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

      • karlgkk a day ago

        Probably makes sense for a b2b app where you publish status codes as part of the api

        Bad actors don’t care and annoying actors would make fun of you for it on twitter

    • rfurmani 3 days ago

      I've wanted to but wasn't sure how to keep track of individuals. What works for you? IP Addresses, cookies, something else?

      • karlgkk a day ago

        I use IP addy. Users behind cgnat are already used to getting captcha the first time around

        There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.

    • nukem222 3 days ago

      > This allows legit users to get what they need.

      Of course they could have just used the site directly.

      • karlgkk a day ago

        If bots and scrapers respected the robots and tos, we wouldn’t be here

        It sucks!

    • LPisGood 3 days ago

      What is your website?

xena 3 days ago

Wow it is so surreal to see a project of mine on Ars Technica! It's such an honor!

  • seafoamteal 3 days ago

    I've seen Anubis a couple times irl, mostly on Sourcehut, and the first time I saw it I was like, "Hey, I remember that blog post!" Congratulations on making something both useful and usable!

  • Figs 3 days ago

    Hmm. Instead of requiring JS on the client, why don't you add a delay on the server side (e.g. 1 second default, adjustable by server admin) for requests that don't have a session cookie? For each session keep a counter and a timestamp. Every time you get a request from a session, look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp. If the counter is greater than a configured threshold, slow-walk the response (e.g. add a delay before forwarding the request to the shielded web server -- or transfer the response back out at reduced bytes/second, etc.)

    You can periodically remove tracking data for entries older than a threshold -- e.g. once a minute or so (adjustable) remove tracked entries that haven't made a request in the past minute to keep memory usage down.

    That'd effectively rate limit the worst offenders with minimal impact on most well-behaved edge-case users (like me running NoScript for security) while also wasting less energy globally on unnecessary computation through the proof-of-work scheme, wouldn't it? Is there some reason I'm not thinking of that would prevent that from working?

    • viraptor 3 days ago

      > look up the tracked entry, increment the counter (or initialize it if not found) and update the timestamp

      Not sure about author's motivation, but this part is why I don't track usage - PoW allows you to do everything statelessly and not keep any centralised database or write any data. The benefit of a system slowing down crawling should be minimal resource usage for the server.

    • PufPufPuf 3 days ago

      In a "denial of service prevention" scenario, you need your cost to be lower than the cost of the attacker. "Delay on the server side" means keeping a TCP connection open for that long, and that's a limited resource.

  • true_blue 3 days ago

    On the few sites I've seen using it so far, it's been a more pleasant (and cuter) experience for me than the captchas I'd probably get otherwise. good work!

    • xena 3 days ago

      Thanks! The artist I'm contracting and I are in discussions on how to make the mascot better. It will be improved. And more Canadian.

  • techjamie 3 days ago

    The ffmpeg website is also using it. First time I actually saw it in the wild.

Nckpz 3 days ago

I recently started a side-project with a "code everything in prod" approach for fun. I've done this many times over the past 20 years and the bot traffic is usually harmless, but this has been different. I haven't advertised the hostname anywhere, and in less than 24 hours I had a bunch of spam form submissions. I've always expected this after minor publicity, but not "start server, instantly get raided by bots performing interactions"

  • ndiddy 3 days ago

    This is yet another great example of the innovation that the AI industry is delivering. Why just limit your scraper bots to GET requests when there might be some juicy data to train on hidden behind that form? There's a reason why the a16z funded cracked vibe coder ninjas are taking over software engineering, they're full of wonderful ideas like this.

    • ohgr 3 days ago

      I work with one of those guys. Morally bankrupt at every level of his existence.

  • bendangelo 3 days ago

    There are bots that scrape https registration sites thats how they usually find you.

ipaddr 3 days ago

I've had a number of content sites I've shut down a few sites in the last few days because of the toll these aggressive AI bots. Alexa seems like the worst.

These were created 20 years ago and updated over the years. I use to get traffic but that's been slowed to 1,000 or less legitimate visitors over the last year. But now I have to deal with server down emails caused by these aggressive bots that don't respect the robots file.

  • Aurornis 3 days ago

    > Alexa seems like the worst.

    Many of the bots disguise themselves as coming from Amazon or other big company.

    Amazon has a page where you can check some details to see if it’s really their crawler or someone imitating it.

    • spookie 3 days ago

      yup, actually most that I've seen are impersonating amazon

  • svelle 3 days ago

    It's funny how every time this topic comes up someone says "I've had this happen and x is the worst" With x being any of the big AI providers. Just a couple minutes ago I read the same in another thread and it was Anthropic. A couple weeks back it was Meta.

    My conclusion is that they're all equally terrible then.

    • Aurornis 3 days ago

      All of the crawlers present themselves as being from one of the major companies, even if they’re not.

      Setting user-agent headers is easy.

      • cyanydeez 3 days ago

        at the same time, all the AI providers have some kind of web based AI agent, so let snot pretend they're crafting their services in care of other peoples websites.

        • CaptainFever 3 days ago

          I highly doubt that people are using AI agent features so frequently and so concentrated-ly that it brings down websites.

    • ipaddr 3 days ago

      I agree they are all creating negative value for site owners. In my personal experience this week blocking Amazon solved my server overload issue.

rco8786 3 days ago

I got DoSed by ClaudeBot (Anthropic) just last week. Hitting a website I manage 700,000 times in one month and tripping our bandwidth limit with our hosting provider. What a PITA to have to investigate that, figure it out, block the user agent, and work with hosting provider support to get the limit lifted as a courtesy.

Noticed that the ChatGPT bot was 2nd in traffic to this site, just not enough to cause trouble.

  • i5heu 3 days ago

    At which level of DDos one can claim damages from them?

    • brookst 3 days ago

      You can claim whatever you want, but actually litigation is expensive and it’s not at all a sure thing that “I made a publicly available resource and they used it too much” is going to win damages. Maybe? Maybe not?

      • i5heu 3 days ago

        So I could ddos anyone legally as long I have a some reason?

        • brookst 2 days ago

          Sure. Courts tend to care about intent, but your are welcome to try to change that.

  • Neil44 3 days ago

    I've got claude bot blocked too. It regularly took sites offline and ignored robots.txt. Claude bot is an asshole.

  • throwaway2037 3 days ago

    robots.txt did not work?

    • seabird 3 days ago

      Of course it didn't work. At best, the dorks doing this think there's a gamechanging LLM application to justify the insane valuations right around the corner if they just scrape every backwater site they can find. At worst, they're doing it because it's paying good money. Either way, they don't care, they're just going to ignore robots.txt.

    • hsbauauvhabzb 3 days ago

      I have not monitored traffic in this way, but I imagine most AI companies would explicitly follow links listed in robots, even if not mentioned elsewhere on the site.

    • epc 3 days ago

      I’ve been doing web sites for thirty years, robots.txt is at best a request to polite user agents to respect the server’s desires. None of the malicious crawlers respect it. None of the AI crawlers respect it.

      I’ve resorted to returning xml and zip bombs in canary pages. At best it slows them down until I block their network.

    • lemper 3 days ago

      bro, since when vc funded ai companies have the courtesy to respect robots.txt?

userbinator 3 days ago

All these JS-heavy "anti bot" measures do is further entrench the browser monopoly, making it much harder for the minority of independents, while those who pay big $$$ can still bypass them. Instead I recommend a simple HTML form that asks questions with answers that LLMs cannot yet figure out or get consistently wrong. The more related to the site's content the questions are, the better; I remember some electronics forums would have similar "skill-testing" questions on their registration forms, and while some of them may be LLM'able now, I suspect many of them are still really CAPTCHAs that only humans can solve.

IMHO the fact that this shows up at a time when the Ladybird browser is just starting to become a serious contender is suspicious.

  • mvdtnz 3 days ago

    How does JS entrench a browser monopoly? If you're not using vendor-specific JS extensions or non-standard APIs any browser should be able to execute your JS. Like most web developers I don't have a lot of patience for the people who refuse to run JS on their clients.

    • userbinator 3 days ago

      The effort required to implement a JS engine and keep trendchasing the latest changes with it is a huge barrier to entry, not to mention the insane amount of fingerprinting and other privacy-hostile, anti-user techniques it enables.

      Seeing what used to be simple HTML forms turned into bloated invasive webapps to accomplish the exact same thing seriously angers me; and everyone else who wanted an easily accessible and freedom-preserving Internet.

  • Terr_ 3 days ago

    Or require every fresh "unique" visitor to run some JS that takes X seconds to compute.

    It's not nice for visitors using a very old smartphone, but it's arguably less-exclusionary than some of the tests and third-party gatekeepers that exist now.

    In many cases we don't actually care about telling if someone is truly a human alone, as much as ensuring that they aren't a throwaway sockpuppet of a larger automated system that doesn't care about good behavior because a replacement is so easy to make.

    • userbinator 3 days ago

      that takes X seconds to compute.

      Those who have the computing resources to do commercial scraping will easily get past that.

      In contrast, there are still many questions which a human can easily answer, but even the best LLMs currently can't.

      • Terr_ 3 days ago

        It doesn't have to be bulletproof, it just has to create a cost that doesn't scale economically for them.

        • userbinator 3 days ago

          Computing power is cheap, and getting cheaper for the big guys. Real humans are not.

      • kristiandupont 3 days ago

        >there are still many questions which a human can easily answer, but even the best LLMs currently can't.

        I am genuinely curious: what is an example of such a question, if it's for a person you don't know (i.e. where you cannot rely on inside knowledge)?

    • a2128 3 days ago

      IIRC that's basically already part of what Cloudflare Turnstile does

everdrive 3 days ago

One more reason we're moving away from privacy. Didn't load all the javascript domains? You're probably a bot. Not signed in? You're probably a bot. The web we knew is dying step by step.

One interesting thought: do we know if these AI crawlers intentionally avoid certain topics? Is pornography totally left unscathed by these bots? How about extreme political opinions?

banq 3 days ago

I have blocked these ip from the country: 190.0.0.0/8 207.248.0.0/16 177.0.0.0/8 200.0.0.0/8 201.0.0.0/8 145.0.0.0/8 168.0.0.0/8 187.0.0.0/8 186.0.0.0/8 45.0.0.0/8 131.0.0.0/16 191.0.0.0/8 160.238.0.0/16 179.0.0.0/8 186.192.0.0/10 187.0.0.0/8 189.0.0.0/8

zzo38computer 3 days ago

My issue is not to prevent others from obtaining copies of the files, using Lynx or curl, disabling JavaScripts and CSS and pictures, etc. It is to prevent others from overloading the server due to badly behaved software.

I had briefly set up port knocking for the HTTP server (and only for HTTP; other protocols are accessible without port knocking), but due to a kernel panic I removed it and now the HTTP server is not accessible. (I may later put it back on once I can fix this problem.)

As far as I can tell, the LLM scrapers do not attempt to be "smart" about it at this time; if they do in future, you might try to take advantage of that somehow.

However, even if they don't, there are probably things that can be done. For example, check that if the declared user-agent declares things that it isn't doing, and display an error message if so (users who use Lynx will then remain unaffected and will still be able to access it). Another possibility is to try to confuse the scrapers however they are working, e.g. invalid redirects, valid redirects (e.g. to internal API functions of the companies that made them), invalid UTF-8, invalid compressed data, ZIP bombs (you can use the compression functions of HTTP to serve a small file that is too big when decompressed), EICAR test files, reverse pings (if you know who they really are), etc. What will work and what doesn't work depends on what software they are using.

myzie 3 days ago

An aspect I find interesting is that these crawlers are all doing highly redundant work. As in, thousands of crawlers are running around the world, and each crawler may visit the same site and pages multiple times a week.

This seems like an opportunity for a company like Firecrawl, ScrapingBee, etc to offer built-in caching with TTLs so that redundant requests can hit the cache and not contribute to load on the actual site.

Even if each company that operates a crawler cached pages across multiple runs, I'd expect a large improvement in the situation.

For more dynamic pages, this obviously doesn't help. But a lot of the web's content is more static and is being crawled thousands of times.

I built something for my own company that crawls using Playwright and caches in S3/Postgres with a TTL for this purpose.

Does this make sense to anyone else? I'm not sure if I'm missing something that makes this harder than it seems on the surface. (Actual question!)

  • jaccola 3 days ago

    I have considered this before, but then if the content can be cached why wouldn't the website just do this themselves?

    They have the incentive, it is relatively easy and I don't think there's a huge benefit to centralisation (especially since it will basically be centralised to one of the big providers of caching anyways)

    • myzie 3 days ago

      I'm definitely with you that sites should be leveraging CDNs and similar. But I get that many don't want to do any work to support bots that they don't want to exist in the first place.

      To me it seems like the companies actually doing the crawling have an incentive to leverage centralized caching. It makes their own crawling faster (since hitting the cache is much faster than using Playwright etc to load the page) and it reduces the impact on all these sites. Which would then also decrease the impact of this whole bot situation overall.

    • brookst 3 days ago

      It would shift the complexity and cost of large scale caching to a provider that would sell to the scrapers. Not sure it has much value, but it’s kind of a classic three tier distribution system with a middleman to make life easier for both producer and consumer.

  • xena 3 days ago

    What does the user agent oook like for if you wanted to crawl xeiaso.net?

edoloughlin 3 days ago

I'm being trite, but if you can detect an AI bot, why not just serve them random data? At least they'll be sharing some of the pain they inflict.

  • nosianu 3 days ago

    You mean like this?

    [2025-03-19] https://blog.cloudflare.com/ai-labyrinth/

    > Trapping misbehaving bots in an AI Labyrinth

    > Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives.

    • barbazoo 3 days ago

      What a colossal waste of energy

    • fc417fc802 3 days ago

      > No real human would go four links deep into a maze of AI-generated nonsense.

      ... I would. Out of curiosity and amusement I would most definitely do that. Not every time, and not many times, but I would definitely do that one or a few times.

      Guess I'm getting added to (yet another) Cloudflare naughty list.

      > It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

      In that case wouldn't it be faster and easier to restyle the CSS of wikipedia pages?

    • mbesto 3 days ago

      Wait, what happens when a Cloudflare Worker AI meets an AI Labyrinth?!

      • ronsor 3 days ago

        Cloudflare deletes itself.

  • noirscape 3 days ago

    Bandwidth isn't free, not at the volume these crawlers scrape at; serving them random data (for example by leading them down an endless tarpit of links that no human would end up visiting) would still incur bandwidth fees.

    Also it's not identifiable AI bot traffic that's detected (they mask themselves as regular browsers and hop between domestic IP addresses when blocked), it's just really obviously AI scraper traffic in aggregate: other mass crawlers have no benefit from bringing down their host sites, except for AI.

    A search engine has nothing if it brings down the site they're scraping (and has everything to gain from identifying itself as a search engine to try and get favorable request speeds - the only thing they'd need to check is if the site in question isn't serving different data, but that's much cheaper), same with an archive scraper and those two are pretty much the main examples I can think of for most scraping traffic.

    • BlarfMcFlarf 3 days ago

      Hmm, maybe you could zipbomb the data? Aka, you send a few kilobytes of compressed data that expands to many gigabytes on client side?

    • miohtama 3 days ago

      For Cloudflare, bandwidth is practically free.

    • cyanydeez 3 days ago

      arnt a lot of these bots now actively loading javascript? you could just load a simple script that does the job .

    • charcircuit 3 days ago

      >Bandwidth isn't free

      Via peering agreements it is.

      • rcxdude 3 days ago

        Not something available to smaller sites

        • charcircuit 3 days ago

          Yes, it is. They transitively get it via the agreements the smaller site's host's host makes. Or via services like Cloudflare.

      • xena 3 days ago

        What button do I click in the AWS panel for that?

        • charcircuit 3 days ago

          There is no button. AWS is where you go to light money on fire.

  • xena 3 days ago

    You can detect the patterns in aggregate. You can't detect it easily at an individual request level.

    • bluGill 3 days ago

      In short if you get several million requests and expect to only get 100 you won't know which are the real requests and which are the AI ones - but it is obvious that the vast majority are AI.

  • jmpeax 3 days ago

    You skipped the last section "Tarpits and labyrinths: The growing resistance" of the article.

  • DecentShoes 3 days ago

    Random data? Why not "recipes" that just say "Bezos is a pedo" over and over ?

ggm 3 days ago

Entire country blocks are lazy, and pragmatic. The US armed forces at one point blocked AU/NZ on 202/8 and 203/8 on a misunderstanding about packets from China, also from these blocks. Not so useful for military staff seconded into the region seeking to use public internet to get back to base.

People need to find better methods. And, crawlers need to pay a stupidity tax or be regulated (dirty word in the tech sector)

  • noirscape 3 days ago

    They can absolutely work if you aren't expecting any traffic from those countries whatsoever.

    I don't expect any international calls... ever, so I block international calling numbers on my phone (since they are always spam calls) and it cuts down on the overwhelming majority of them. Don't see why that couldn't apply to websites either.

    • koito17 3 days ago

      Although it's a very lazy practice, this is exactly how many Japanese sites (and internet services) fight against bad actors. In short, they block non-Japanese traffic and data center IPs. I expect these measures to become insufficient as consumers adopt IoT devices and provide ample amounts of residential IPs for botnets.

      As for phone numbers, businesses and individuals employ a similar strategy. Most "legitimate" phone numbers begin with 060 or 070. Due to lack of supply, telcos are gradually rolling out 080 numbers. 080 numbers currently have a bad reputation because they look unfamiliar to the majority of Japanese. Similarly, VoIP numbers all begin with 050, and many services refuse such numbers. Most people instinctively refuse to answer any call that is not from a 060 or 070 number.

    • alabastervlog 3 days ago

      Country or region blocks based on IPs used to (c. ~2000) be pretty standard. Blackhole the blocks associated with China, Russia, and maybe Africa, and your failed-login logs drop from scrolling so fast you can't read them, to a handful of lines per minute. Almost all the traffic was from those blocks, and was malicious. Meanwhile, for many sites, your odds (especially back then) of getting legitimate traffic from, say, China, was nearly zero, so the cost of blocking them was effectively nothing.

      Cloudflare is basically still just this, but with more steps.

    • ggm 3 days ago

      Sure. Absolutely works. Right up until it doesn't. I think the MIL was the wrong people to assume "we will never need packets from these network blocks"

      The other thing is that phone numbers follow a numbering scheme where +1 is north america and +64 is NZ. Its easy to know the longterm geographic consequence of your block, modulo faked out CLID. IP packets don't follow this logic and Amazon can deploy AWS nodes with IPs acquired in Asia, in any DC they like. The smaller hosting companies don't say the IP range they route for banks have no pornographers on them.

      It's really not sensible to use IP blocks except for the very specific cases like yours: "I never terminate international calls" is the NAT of firewalls: "I don't want incoming packets from strangers" sure the cheapest path is to block entire swathes of IPv4 and IPv6. But if you are in general service delivery, that rarely works. If you ran a business doing trade in China, you'd remove that block immediately.

    • kragen 3 days ago

      It depends on whether the information on the website is supposed to be publicly available or not. "This information is publicly available except to people from Israel" sends a really terrible message.

      • Retric 3 days ago

        It sends a great message to crack down on these companies, as long as you mention why it’s blocked.

        • kragen 3 days ago

          "You're cut off from access to knowledge because you live in the same country as AI researchers"?

          • Retric 3 days ago

            A DoS attack is a DoS attack even if someone is pretending to be a “Researcher.”

            People in Iran, Russia, etc get annoyed with sanctions but that’s kind of the point. If your government isn’t responding appropriately, yes you’ll get shafted it’s what you do after that which solves the problem.

            • kragen 3 days ago

              Me, I prefer to relate to people as individuals rather than, as you are advocating, interchangeable representatives of their area of residence. If what you want is World War III, this is how you get it.

              In particular, universal access to knowledge is a fundamental principle of liberalism.

              • Retric 2 days ago

                I’d personally love for someone to hand me a billion dollars no strings attached.

                That’s got nothing to do with solving the issues created by these people, but if you’re going to toss out meaningless non sequitur’s then I figure I might as well join in on the fun.

          • cyanydeez 3 days ago

            "researchers" is like ignoring the whole Capitalists. We don't call them script researchers. We call them script kiddies.

            There's the whole other side of these AI researchers, and thats just slop artisans.

nashashmi 3 days ago

And this is why the Internet has become a maze of captcha’s.

  • fsckboy 3 days ago

    yeah but the HN community would be ground zero for "automating scraping tasks"

    "We have met the enemy and he is us." -- Walt Kelly

time4tea 3 days ago

Yeah my small site got 170k requests from one bot in a few mins. Of course it was rate limited, but didn't seem to know 429 or 444 (drop connection) so kept on coming back for more. I do have ipset drop too but at the moment takes a person to enable it... just so much effort to stop these **s. Exhausting!

rambambram 3 days ago

Make AI 'pay' by delaying every request and presenting the content with a warning on top, like: AI bots are (sc)raping the internet, that's why we do this.

Or something like: AI is making your experience worse, complain here (link to OpenAI).

Maybe not the most technical solution, but this at least gets the signal across to regular human beings who want to browse a site. Puts all this AI bs in a bad spotlight.

zkmon 3 days ago

An opensource repo asking who is responsible for this AI invasions? Well, it is you, who is responsible for all this. What did you think when you helped tech to advance so rapidly, over-pacing the needs of humans? Read about panchatantra story of 4 brothers who got a dead tiger alive, just to boast of their skill and greatness.

boyter 3 days ago

Crawling, incidentally, I think is the biggest issue with making a new search engine these days. Websites flat out refuse to support any crawler [other] than Google, and Cloudflare and other protection services and CDN's flat out deny access to incumbents. It is not a level playing field.

I wrote the above some time ago. I think its even more true today. Its practically impossible to crawl the way the bigger players do and with the increased focus on legislation in this area its going to lock out smaller teams even faster.

The old web is dead really. There really needs to be a move to more independent websites. Thankfully we are starting to see more of this like the linked searchmysite discussed earlier today https://news.ycombinator.com/item?id=43467541

  • zlagen 3 days ago

    that's a good point, in search we have google as a monopoly and since a big percentage of sites only want to be crawled by them it reinforces the monopoly. So a lot of people complain about bots not following robots.txt but if you follow them to the letter it's impossible to make anything useful. Also AFAIK robots.txt doesn't have any legal standing

ANarrativeApe 3 days ago

Excuse my ignorance, but is it time to update the open source licenses in the light of this behavior? If so, what should the evolved license wording be?

I appreciate that this could be easily circumvented by a 'bad actor', but it would make this abuse overt...

  • johnnyanmac 3 days ago

    From my little understanding, we have a sort of agreement in place with an item called robot.txt that's more or less a hanshake with such scrapers. Of course, the issue is these scrapers are blatantly ignoring robots.txt

    A license can help as well, but what's a license without enforcement? These companies are simply treating the courts as a cost to do business.

    • CaptainFever 3 days ago

      Close, robots.txt was originally for web crawlers, to reduce accidental denial-of-service attacks. It had nothing to do with the scraping (i.e. downloading content and parsing the HTML tags in a programmatic manner).

      • WesolyKubeczek 3 days ago

        What do you think a search engine’s crawler bot is doing exactly? I could sure be wrong, but I have a hunch that “downloading content and paraing the HTML tags in a programmatic manner” describes it.

        • CaptainFever 3 days ago

          Yes, but the difference is that the term "scraping" also targets things like automatically generating RSS feeds from HTML pages, which is not covered by robots.txt.

          • WesolyKubeczek 3 days ago

            I thought robots.txt covered all automated, programmatic access by third parties where a bot slurps stuff and follows links, without splitting hairs about it.

            But what do I know, the young whippersnappers will just word lawyer me to death, so I better shut up and go away.

grotorea 3 days ago

Is this stuff only affecting the not for profit web? What are the for profit sites doing? I haven't seen Anubis around the web elsewhere. Are we just going to get more and tighter login walls and send everything into the deep web?

  • burkaman 3 days ago

    For profit sites are making deals directly with the AI companies so they can get some more of that profit.

  • _xtrimsky 2 days ago

    I think what they mean is that most not for profit small sites don't have expensive hardware or DDOS blocking mechanisms. A small 256mb ram vps might be enough for 1000 users per month traffic, but not enough for 200,000 users a day traffic.

  • surfingdino 3 days ago

    I think we killed the old web. We'll see new ways of communicating, publishing, and gathering over the internet. It's sad, but it's also exciting.

    • epolanski 3 days ago

      Literally nothing in this data driven world (sports, technology, entertainment, everything is data driven and maximized) is exciting, nothing.

kh_hk 3 days ago

Proof of work is sufficient (although easy to bypass on targeted crawls) for protecting endpoints that are accessed via browsers, but plain public APIs have to resort to other more primitive methods like rate limiting.

Blocking by UA is stupid, an by country kind of wrong. I am currently exploring ja4 fingerprints, that together with other metrics (country, Arn, block list), might give me a good tool to stop malicious usage.

My point is, this is a lot of work, and it takes time off the budget you give to side projects.

eevilspock 3 days ago

The irony is that the people and employees of the AI companies will vehemently defend the morality of capitalism, private property and free markets.

Their robber baron behavior reveals their true values and the reality of capitalism.

  • johnnyanmac 3 days ago

    Insert the Sinclair quote here. Anything to drive up the stock, no matter how immoral or illegal.

  • randmeerkat 3 days ago

    > Their robber baron behavior reveals their true values and the reality of capitalism.

    This is rather reductionist… By your same logic I could say that Stalin and Mao revealed the true values and reality of communism.

    Let’s not elaborate on it further though and just leave this as a simple argument. Free market capitalism has led us to the most prosperous, peaceful, and advanced society humanity has ever ventured to create. Communism threatened that prosperity and peace with atrocities on a scale that exists beyond human comprehension. Capitalism, even with all of its faults, is the obvious choice.

    • johnnyanmac 3 days ago

      It's rather strawman to bring up communism in a conversation that talked nothing about it, except that Capitalism is clearly flawed.

      Capitalism without law ends up with the same kind of authoritariasm as communism without law. Some Rich Guy ends up telling everyone what to do as a ruler with loose rules that no longer resemble the economic model. That's what people complain about when they bring up terms like "late stage capitalism".

    • CyberDildonics 3 days ago

      Internet comments are inherently reductionist.

kazinator 3 days ago

I've been seeing crawlers which report an Agent string like this:

  Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3405.80 Safari/537.36
Everything but the Chrome/ is the same. They come from different IP addresses and make two hit-and-run requests. The different IPs always use a different Chrome string. Always some two digit main version like 69, 70. Then a .0. and then some funny minor and build numbers: typically a four digit minor.

When I was hit with a lot of these a couple of weeks ago, I put in a custom rewrite rule to redirect them to the honeypot.

The attack quickly abated.

j45 3 days ago

Surprised at how crawling as a whole seems to have taken a sustained step backwards with old best practices that are solved being new to devs today?

I can't help but wonder if big AI crawlers belonging to the LLMs wouldn't be doing some amount of local caching with Squid or something.

Maybe it's beneficial somehow to let the websites tarpit them or slow down requests to use more tokens.

  • cyanydeez 3 days ago

    I'm guessing its the rise of the AI agents that just go out and download "research" and not necessarily building LLMs.

haswell 3 days ago

Lately I’ve been thinking a lot about what it would take to create a sort of “friends and family” Internet using some combination of Headscale/Tailscale. I want to feel free and open about what I’m publishing again, and the modern internet is making that increasingly difficult.

  • DrFalkyn 3 days ago

    Chances are one your friends/ family will eventually be compromised in some way.

    There’s always VPNs, though you can only be on at one at a time per device

jrvarela56 3 days ago

This is going to start happening to brick and mortar businesses through their customer support channels

  • imtringued 3 days ago

    Kitboga already deployed a counter-scam LLM that calls scammers to waste their time. The goal is to keep them on the line as long as possible.

  • sean_lynch 3 days ago

    I’d love to hear more. What are they seeing?

    • jrvarela56 3 days ago

      Unfounded. Assuming automated/voice-enabled assistants will spam for opening times, price & description of products/services, scheduling/booking, etc.

      In the long run it'll be an arms race but the transition will be rough for businesses as consumers can adopt these tools faster than SMBs or enterprises can integrate them.

  • cdolan 3 days ago

    And enterprise call centers

CyberDildonics 3 days ago

Can't you just rate limit beyond what a person would ever notice and do it by slowing the response?

  • ordersofmag 3 days ago

    Not if they hop to a different IP address every few requests. And they generally aren't bothered slow responses. It's not like they have to wait for one request to finish before they make another one (especially if they are making requests from thousands of machines).

    • CyberDildonics 3 days ago

      You're saying that large companies are hitting individual websites with thousands of unrelated IP addresses?

      • Vespasian 3 days ago

        Yep we've been seeing that on our random small scale site that used to be open (and mostly relevant to a very limited number of people).

        It was nice for interested guests to get an impression of what we are doing.

        First the AI crawlers came in from foreign countries that could be blocked.

        Then they beat down the small server by being very distributed, calling from thousands of ips one or two requests each.

        We finally put a stop to it by requiring a login with a message informing people to physically show up to gain access.

        Worked fine for over 15 years but AI finally killed it.

      • jaggederest 3 days ago

        How do you think they do crawling if not like that? They'd be IP banned instantly if they used any kind of predictable IP regime for more than a few minutes.

        • CyberDildonics 3 days ago

          I don't know what is actually happening, that's why I'm asking.

          Also you're implying that the only way to crawl is to essentially DDOS a website by blasting them from thousands of IP addresses. There is no reason crawlers can't do more sites in parallel and avoid hitting individual sites so hard. There are plenty of crawlers for the last few decades that don't cause problems, these are just stories about the ones that do.

  • xena 3 days ago

    You'd think, but no :(

aorth 3 days ago

I think it's strange to focus on "FOSS sites" in this article. I have regular corporate sites that are getting slammed by ChatGPT, Perplexity, ByteDance, and others too.

miyuru 3 days ago

Isn't this by problem solved by using commoncrawl data. I wonder what changed to AI companies to do mass crawling individually.

https://commoncrawl.org/

JCharante 3 days ago

I believe we should have microtransactions to access resources. Pay a server a tiny amount and it'll return the content. This way if crawlers dominate traffic it just means they're paying a bunch for it

101008 3 days ago

I have a blog with content about a non tech topic and I had any problem. I am all against AI scrappers, but I didn't notice any change being behind Cloudflare (funny enough, if I ask GPT and Claude about my website, they know it)

  • WesolyKubeczek 3 days ago

    They could hit Cloudflare’s caches and run with that. The problems are if your site is dynamic and your CDN has to hit the origin every time.

hbcondo714 3 days ago

> many AI companies engage in web crawling

Individuals do too. Tools like https://github.com/unclecode/crawl4ai make it simple to obtain public content but also paywalled content including ebooks, forums and more. I doubt these folks are trying to do a DDoS though.

  • CyberDildonics 3 days ago

    How do they manage to get 'paywalled' content?

    • hbcondo714 3 days ago

      Maybe 'paywalled' is not the best word but using their Identity Based Crawling feature with Managed Browsers[1], you can use an existing account and scrape content that requires authentication. This may not sound like anything new but IMHO, crawl4ai's workflow is easy to follow.

      [1] https://docs.crawl4ai.com/advanced/identity-based-crawling

temp008 3 days ago

I wonder if a service exists where IPs known for crawling are reported and reputation of such IPs is tracked for others to use and ban / ratelimit by default

  • mrweasel 3 days ago

    The problem with that approach is that you'll quickly add large swaths of IPs belonging to cloud service provides, such as AWS. We already know that AWS, Azure, GCP and Alibaba are part of the problem, so we can technically just rate-limit them already. I believe that all of them publish their IP ranges.

    Google also publish the IP ranges for GoogleBot I believe, and Bing probably does the same, so we can then whitelist those IPs and still have sites appear in searches.

    My issue is that the burden is again placed on everyone else, not the people/companies who are causing the problem.

    It's crazy to me to think about how much needless capacity is built into the internet to deal with crawlers. The resource waste is just insane.

pdw 3 days ago

I wonder how long until we'll see DNS blocklists to blackhole IP addresses associated with scrapers. It seems like the logical evolution.

  • kh_hk 3 days ago

    There are ip blocklist. Ranging from free, to paid, from text files to apis. Some of the sites offering IP blacklists also need to protect themselves from automated crawls too. It all goes full circle

_DeadFred_ 3 days ago

Sounds like every request should require crypto processing be fed back to the host on the client side as quid pro quo.

navane 3 days ago

How do they know which square contains a bicycle?

instagib 3 days ago

Can we force the bots to mine cryptocurrency?

  • userbinator 3 days ago

    Anger someone enough and you'll get a real DDoS instead.

zlagen 3 days ago

in the perfect world we would have a very high trust internet in which everyone follows the rules, checks robots.txt, rate limits, etc. But we don't live in such a world. Getting angry at these ai bots is useless. People should start considering what and how they host their data. If you're worried about bandwidth costs you have many alternatives to host your data for free or at a very little cost. i.e: github/gitlab/cloudflare.

If you're worried about your data getting scraped and used then maybe you can consider putting it behind a login or do some proof of work/soft captcha. Yeah, this isn't perfect but it will keep most dumb bots away.

Some people are hosting their sites like we're still in 1995 and times have changed.

kittikitti 3 days ago

One of the main peddlers for residential proxies is Israel, see BrightData.

throwaway81523 3 days ago

Saaay, what happened to that assistant US attorney who prosecuted Aaron Swartz for doing exactly this? Oh wait, Aaron didn't have billions in VC backing. That's the difference.

internet101010 3 days ago

I block all traffic except that which comes from the country of Cloudflare.

nektro a day ago

those crawlers and models are a scourge on the internet

superkuh 3 days ago

While the motivations may be AI related the cause of the problem is the first and original type of non-human person: corporations. Corporations are doing this, not human persons, not AI.

yieldcrv 3 days ago

nice, a free way to keep our IPFS pins alive

  • i5heu 3 days ago

    This is not how it works.

    You have to actively pin data for it to be distributed.

EVa5I7bHFq9mnYK 3 days ago

Why can't Cloudflare do it? They know in real time which IPs spew millions of scrape requests to various sites, so they can classify them as AI bots and allow site owners to block them.

egypturnash 3 days ago

This is 100% off-topic but:

Is it just me or does Ars keep on serving videos about the sound design of Callisto Protocol in the middle of everything? Why do they keep on promoting these videos about a game from 2022? They've been doing this for months now.

  • misnome 3 days ago

    I also get this for 1+ years, and just assume that it’s a symptom of my adblockers/pihole working and screwing up something in their auto ad targeting.

    • egypturnash 2 days ago

      I’m glad to know it’s not just me! It’s such a weird and specific failure state.

RamblingCTO 3 days ago

Sure, overly spamming websites is shitty behaviour. But blocking AI crawlers hurts you in the end. Guess what will replace SEO in the long run?

  • entropi 3 days ago

    >But blocking AI crawlers hurts you in the end. Guess what will replace SEO in the long run?

    Maybe. But even if that turns out to be true, what good is it for the source website? The "AI" will surely not share any money (or anything else that may help the source website) with the source anyways. Why would they, they already got the content and trained on it.

    • RamblingCTO 3 days ago

      What good is it? If "AI" doesn't know about you down the line, you won't be discovered. Be it in LLM weights or via crawling (perplexity, jina reader etc.), you won't get any organic traffic. It's not about sharing profits.

      • entropi 3 days ago

        Again, the "AI" doesn't care about the website. It doesn't even link to it in vast majority of the cases. Even if it did, the "AI" derives a lot of its business value from the fact that it is providing what the client requests while removing the need to visit potentially dozens of these pages. So the clients, in most cases, would not even click them (as they already got what they wanted).

  • xena 3 days ago

    So you're willing to pay my hosting bills?

    • RamblingCTO 3 days ago

      Re-read what I posted and stop projecting.

      • xena 3 days ago

        But if the AI crawlers are taking the website down and money buys more server time, are you willing to do your part and use money to make sure your training data sources are solvent until you can replace them?

        • RamblingCTO 3 days ago

          I am not training shit dude ... So stop projecting. And again: spamming is unfair, as I said.

    • CyberDildonics 3 days ago

      What a bizarre response. They are just saying that some of these crawlers are being used for search engines in one way or another.