The Cost of Being Crawled: LLM Bots and Vercel Image API Pricing

111 points by navs 2 days ago

A single $5 vps should be able to handle easily tens of thousands of requests...

Not that much for simple thumbnails in addition. So sad that the trend of "fullstack" engineers being just frontend js/ts devs took off with thousands of companies having no clue at all about how to serve websites, backends and server engineering...

bigiain 2 days ago

It's 1999 or 2000, and "proper" web developers, who wrote Perl (as God intended) or possibly C (if they were contributors to the Apache project), started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.
History repeats itself...
- XCSme 2 days ago
  
  I still use PHP, is your point that everyone will happily use NextJS in 10 years? I doubt it.
  
  Grimblewald 19 hours ago
  
  I really hope not, because I really hate how JS has fucked the internet. Just look at how shit the experience on www.reddit.com vs old.reddit.com is. Old uses a lot of JS now, which has made the experience a touch worse, but it still mostly serves a static HTML page. It loads quickly, renders quickly, and lets me do useful preference based things on my end.
  I hate what JS has done to the internet, and I think it plays a heavy hand in the internets enshitification.
  
  threatofrain a day ago
  
  The proper comparison is with Laravel. How will Laravel fare vs Next in 10 years? Hard to say, they could both be equally legacy by then.
- navs a day ago
  
  > started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.
  Your point is they shouldn't?
  
  bigiain 20 hours ago
  
  Nah, my point was that people who ignored the incumbent "wisdom" in the late 90's, actually took over the web.
  As much as some "technical" people deride PHP and the sort of self taught developers that were using it back then, WordPress pretty much is "the web" if you exclude Facebook and other global scale centralised web platforms, and the bits of the non FAANG et al owned web that aren't WordPress are very likely to be PHP too. Hell, even Facebook might still count as a PHP site.
  In 30 years time, it won't be the most elegant or pure language or framework choices that dominate, it'll be the language/frameworks that people who don't care about elegance or purity end up using to get their idea onto the internet. If I had to guess, it'll likely be LLM written Python - deeply influenced and full of idioms from publicly available 2018-2024 era open source Python code that the AI grifters hoovered up and trained their initial models on.
majorchord 21 hours ago

> A single $5 vps should be able to handle easily tens of thousands of requests...
Source:
- harrisi 18 hours ago
  
  https://news.ycombinator.com/item?id=34676186
  
  ranger_danger 17 hours ago
  
  1300 req/sec is not tens of thousands
  
  harrisi 15 hours ago
  
  Sorry, I wasn't trying to make any affirmative or negative claims; I just knew of a source related to the claim.
e____g 2 days ago

> A single $5 vps should be able to handle easily tens of thousands of requests
Sure, given enough time. Did you miss a denominator?
- SkiFire13 2 days ago
  
  Nha, obviously they meant that the vps will die after those thousands of requests and you will have to buy a new one /s

leerob 2 days ago

(I work at Vercel) While it's good our spend limits worked, it clearly was not obvious how to block or challenge AI crawlers¹ from our firewall (which it seems you manually found). We'll surface this better in the UI, and also have more bot protection features coming soon. Also glad our improved image optimization pricing² would have helped. Open to other feedback as well, thanks for sharing.

¹: https://vercel.com/templates/vercel-firewall/block-ai-bots-f...

²: https://vercel.com/changelog/faster-transformations-and-redu...

ilyabez 2 days ago

Hi, I'm the author of the blog (though I didn't post it on HN).
1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.
I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.
2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.
We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.
- leerob a day ago
  
  Not sure if you will see the reply, but please reach out lee at vercel.com and I'm happy to help.
zamalek 2 days ago

I'm sure what you can share is limited, as I'm guessing this is cat and mouse. That being said, is there anything you can share about your implementation?
- leerob 2 days ago
  
  We’re working on a bot filtering system that blocks all non-browser traffic by default. Alongside that, we’re building a directory of verified bots, and you’ll be able to opt in to allow traffic only from those trusted sources. Hopefully shipping soon.
  
  sroussey 2 days ago
  
  Verified bots? You mean the companies that got big reading your info so now you know who they are, but not allow any new comers so the people that were taking the data all this time get rewarded by killing competition for them. lol.
  
  Spivak 2 days ago
  
  You have it exactly right sans the reason to allow them in the first place. They're bots that provide reciprocal value to the site owner. Otherwise why even bother letting them through.
  It's wild how people don't get that facebook and googlebot gets let through paywalls and such because they bring the site real tangible revenue. If you want to get the same privileges you have to start with the monetary value provided to the sites you index. Lead gen is hard and major search engines provide crazy value for next to nothing.
  
  otherme123 2 days ago
  
  Do they? AI bots provide me with nothing (best case scenario) or giving my content in their pages without "read more" links thus lowering my number of visitors.
  Search bots, and specially Google, provide my site a lot of value. They respect the robots.txt, I can see that about half my visits come from search, they identify properly as bots. It's almost impossible to notice a search bot in the graphs.
  But AI bots suck. They don't even read the robots.txt, they hit the site as hard as it can hold, when they receive a 5xx, a 444 or a 426 they interpret it as "keep requesting hard until you get a 200", they can easily DoS or bankrupt a small site, they use fake user agents. As the OP post shows, their activity can be clearly seen in the log graphs as huge spikes coming from a single client. OpenAI scanned 100% of one of my sites (more than 20,000 individual pages) in two days causing intermitent DoS, while the Google is at 80% of the sitemap.xml. And cherry on top, I still can't see a single visit in my logs that come from their services.
  I think you might be confusing search bots with AI bots.
  
  Spivak 2 days ago
  
  I think what I said is agreeing with you completely.
  
  wredcoll 2 days ago
  
  Yes, but googlebot wasn't providing value when it first started. That's the issue here.
cratermoon 2 days ago

> it clearly was not obvious how to block or challenge AI crawlers
https://xeiaso.net/notes/2025/anubis-works/
- majorchord 21 hours ago
  
  Setting the user-agent to curl (and maybe others) completely bypasses anubis.

bhouston 2 days ago

The issue is Vercel Image API is ridiculously expensive and also not efficient.

I would recommend using Thumbor instead: https://thumbor.readthedocs.io/en/latest/. You could have ChatGPT write up a React image wrapper pretty quickly for this.

styfle a day ago

The article explains that they were using the old Vercel price and that the new price is much cheaper.
> On Feb 18, 2025, just a few days after we published this blog post, Vercel changed their image optimization pricing. With the new pricing we'd not have faced a huge bill.
qudat 2 days ago

We use imgproxy at https://pico.sh
Works great for us
- omnimus 2 days ago
  
  Maybe atleast link to the project https://imgproxy.net/ next time? Your comment is basically an ad for your product. I am sure many clicked the link expecting there to be some image resizing proxy solution…

gngoo 2 days ago

I once sat down to calculate the costs of my app if it ever went viral being hosted at vercel. That has put me off on hosting anything on vercel ever or even touching NextJS. It feels like total vendor lock in once you have something running there, and you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself.

arkh 2 days ago

> you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself
The length to which many devs will go to not learn server management (or SQL).
- einsteinx2 a day ago
  
  See also the entire job of “AWS Cloud Engineer” aka “I want to spend years learning how to manage proprietary infrastructure instead of just learning Linux server management” and the companies that hire them aka “we don’t have money to hire sysadmins to run servers, that’s crazy! Instead let’s pay the same salaries for a team of cloud engineers and be locked in to a single vendor paying 10x the price for infra!” It’s honestly mind boggling to me.
  
  colonial a day ago
  
  Server management has gotten vastly easier over time as well, especially if you're just looking to host stuff "for fun."
  Even without fancy orchestration tools, it's very easy to put together a few containers on your dev machine (something like Caddy for easy TLS and routing + hand rolled images for your projects) and just shotgun them onto the cheapest server you can find. At that point the host is just a bootloader for Podman and can be made maximally idiot-proof (see Fedora CoreOS.)
sharps_xp 2 days ago

i also do the sit down a calculate exercise. i always end up down a rabbit hole of how to make a viral site as cheaply as possible. always ends up in the same place: redis, sqlite, SSE, on suspended fly machines, and a CDN.

jhgg 2 days ago

$5 to resize 1,000 images is ridiculously expensive.

At my last job we resized a very large amount of images every day, and did so for significantly cheaper (a fraction of a cent for a thousand images).

Am I missing something here?

jsheard 2 days ago

It's the usual PaaS convenience tax, you end up paying an order of magnitude or so premium for the underlying bandwidth and compute. AIUI Vercel runs on AWS so in their case it's a compound platform tax, AWS is expensive even before Vercel adds their own margin on top.
- cachedthing0 2 days ago
  
  I would call it ignorance tax, paas can be fine if you know what you are doing.
leerob 2 days ago

(I work at Vercel) We moved to a transformation-based price: https://x.com/TheBuildLog/status/1892308957865111918
- jhgg 2 days ago
  
  Sweet! That's much more reasonable!
Banditoz 2 days ago

Yeah, curious too.
Can't the `convert` CLI tool resize images? Can that not be used here instead?
- giantrobot 2 days ago
  
  Whoa there Boomer, that doesn't sound like it uses enough npm packages! It also doesn't sound very web scale. /s
mvdtnz 2 days ago

You're not missing anything. A generation of programmers has been raised to believe platforms like Vercel / Next.js are not only normal, but ideal.
BonoboIO 2 days ago

Absolutely insane pricing, maybe for small blogs, but didn’t they calculate this trough?
Millions of episode, of course they will be visited and the optimization is run.
- ilyabez 2 days ago
  
  Hi, I'm the author of the blog (though I didn't post it on HN).
  The site was originally secondary to our business and was built by a contractor. It was secondary to our business and we didn't pay much attention until we actually added the episode pages and the bots discovered them.
  I saw a lot of disparaging comments here. It's definitely our fault for not understanding the implications of what the code was doing. We didn't mention the contractor in the post, because we didn't want to throw them under the bus. The accountability is all ours.

ashishb 2 days ago

As someone who maintains a Music+Podcast app as a hobby project, I intentionally have no servers for it.

You don't need one. You can fetch RSS feeds directly on mobile devices; it is faster, less work to maintain, and has a smaller attach surface for rouge bots.

bn-l 2 days ago

If you want to do something interesting with the feeds it would be harder.
- ashishb 2 days ago
  
  > If you want to do something interesting with the feeds it would be harder.
  I am curious: What do you do with the feeds that can't be done in a client-side app? An aggregation across all users or recommendation system is one thing, but it can even be done via the clients sending analytics data back to the servers.
  
  ilyabez a day ago
  
  Hi, I'm the author of the blog (though I didn't post it on HN).
  If you want to have a cross-platform experience (mobile + web), you'd have to have a server component.
  We do transcript, chapter and summary extraction on the server (they are reused across customers), RSS fetching is optimized (so we don't hit the hosts from all the clients independently), our playlists are server-side (so they can be shared across platforms). As we build out the app, features like push notifications will require a server component too.
  I agree with you that a podcast app can be built entirely client-side, but that will be limiting for more advanced and/or expensive use cases (like using LLMs).
  
  ashishb a day ago
  
  > We do transcript, chapter and summary extraction on the server
  Yeah, do this server side but let the clients request it after fetching RSS feeds directly.
  And you will reduce both bandwidth usage and increase reliability of your app.
  There is no reason to transform images at all IMHO

VladVladikoff 2 days ago

Death by stupid micro services. Even at 1.5 mil pages, and the traffic they are talking about this could easily be hosted on a a fixed $80/month linode.

KennyBlanken 2 days ago

This isn't specific to microservices. I've seen two organizations with a lot of content have their website brought to its knees because multiple AI crawlers were hitting it.
One of them was pretending to be a very specific version of Microsoft Edge, coming from an Alibaba datacenter. Suuuuuuuuuuuuuuuuuure. Blocked its IP range and about ten minutes later a different subnet was hammering away again. I ended up just blocking based off the first two octets; the client didn't care, none of their visitors are from China.
All of this was sailing right through Cloudflare.
- VladVladikoff 2 days ago
  
  I’ve dealt with AI crawlers. I’ve even seen 8 different AI crawlers at once. And yes some have been very aggressive, and I have even blocked some who are particularly bad (ignoring robots.txt rules). But their traffic is a tiny fraction of what my infrastructure sees on a regular basis. A well optimized platform, with good caching, shouldn’t really struggle with a few crawlers.
- afarah1 2 days ago
  
  Honest question, why is rate limiting insufficient?
  Can be done in two lines in nginx which is not just a common web server but also used as an API gateway or proxy.
  You can rate limit by IP pretty aggressively without affecting human traffic.
  
  Aeolun 2 days ago
  
  One /24 of IP’s hammering on your website at a rate limited 2 rps is still a combined 500/s. I’m not sure many sites can sustain that.
  
  evantbyrne 2 days ago
  
  For a public website? Well if you don't have thousands of pages, then the solution would be as simple as installing Varnish, which is good practice anyways. If you actually have enough unique paths for an unauthenticated botnet to saturate, well that's a bit more complicated.
  
  Aeolun 2 days ago
  
  Many sites hosted on Vercel, I suppose. If sites are hosted on nginx/varnish I’d be surprised if they didn’t do an order of magnitude more.
  
  evantbyrne a day ago
  
  Yeah the playbook for serverless is to target developers that don't know anything about infrastructure, lock them in with proprietary APIs, and then hit them with a huge bill once they have any real traffic.
  
  ndriscoll a day ago
  
  If you're using nginx as a proxy like the above commenter suggested, then if you're serving static/cached pages (should be able to for most public pages?), it can do over 10k RPS even on my n100 minipc (the limit there is actually the 1 Gbit NIC).

GodelNumbering 2 days ago

Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.

Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls

I ended up banning every SEO bot in robots.txt and a bunch of other bots

marcusb 2 days ago

I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)
Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.
- GodelNumbering 2 days ago
  
  I checked IPs on those, they belonged to MSFT
- hansvm 2 days ago
  
  Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?
  If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?
  
  marcusb 2 days ago
  
  I haven't tested that behavior, sorry.
  
  hansvm 2 days ago
  
  No worries. I'll get around to it. I was just curious if you might've explored a bit. Thank you.

nullorempty 2 days ago

Yeah, AI crawlers - add that to my list of phobias. Though for a bootstrapped startup why not look to cut all recurrent expenses and just deploy imagemagik that I am sure will do the trick for less.

outloudvi 2 days ago

Vercel has a fairly generous free quota and a non-negligible high pricing scheme - I think people still remember https://service-markup.vercel.app/ .

For the crawl problem, I want to wait and see whether robots.txt is proved enough to stop GenAI bots from crawling since I confidently believe these GenAI companies are too "well-behaved" to respect robots.txt.

otherme123 2 days ago

This is my experience with AI bots. This is my robots.txt:
User-agent: * Crawl-Delay: 20
Clear enough. Google, Bing and others respect the limits, and while about half my traffic are bots, they never DoS the site.
When a very well known AI bot crawled my site in august, they fired up everything: fail2ban put them temporarily in jail multiple times, the nginx request limit per ip was serving 426 and 444 to more than half their requests (but they kept hammering the same Urls), and some human users contacted me complaining about the site going 503. I had to block the bot IPs at the firewall. They ignore (if they even read) the robots.txt.
dvrj101 2 days ago

Nope they have been ignoring robots.txt since the start. There are multiple posts all over the internet.

randunel 2 days ago

> Optimizing an image meant that Next.js downloaded the image from one of those hosts to Vercel first, optimized it, then served to the users.

So Metacast generate bot traffic on other websites, presumably to "borrow" their content and serve it to their own users, but they don't like it when others do the same to them.

ilyabez 2 days ago

Hi, I'm the author of the blog (though I didn't post it on HN).
I'd encourage you to read up on how the podcast ecosystem works.
Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.
All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.
This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.
- randunel 2 days ago
  
  Do you or do you not visit and respect "robots.txt" on the hosts you've mentioned in your blog post as downloading via next.js?

mediumsmart 2 days ago

Don’t feed the bots. Why a pixel image? Take an svg and make it pulse while playing.

CharlieDigital 2 days ago

Is there no CDN? This feels like it's a non-issue if there's a CDN.

ilyabez a day ago

Hi, I'm the author of the blog (though I didn't post it on HN).
We're going to put Cloudflare in front of our Vercel site and control cache for SSR pages with Cache-Control headers.
- CharlieDigital a day ago
  
  I'm kind of surprised that Next.js -- being known for SSR and SSG -- isn't offered with a CDN solution OOB on Vercel; seems like a no-brainer.
  Last startup, we ran Astro.js behind CloudFront and we were able to serve pretty large volume of public-facing traffic from just 2 server nodes with 3-tiered caching (Redis for data caching, application level output caching, and CloudFront with CF doing a lot of the heavy lifting)

ramesh31 2 days ago

The cost of getting locked into Vercel.

dylan604 2 days ago

I guess it goes to show how jaded I am, but as I was reading this, it felt like an ad for Vercel. I'm so sick of marketing content being submitted as actual content, that when I read a potentially actual blog/post-mortem, my spidey senses get all tingly about potential advertising. However, I feel like if I turn down the sensitivity knob, I'll be worse off than knee jerk thinking things like this are ads.

ilyabez a day ago

Hi, I'm the author of the blog (though I didn't post it on HN).
I can assure you it is not an ad for Vercel.

bitbasher 2 days ago

$5 for 1,000 image optimizations? Is Vercel not caching the optimization? Why would it be doing more than one per-image on a fresh deploy?

cratermoon 2 days ago

"Step 3: robots.txt"

Will do nothing to mitigate the problem. As is well known, these bots don't respect it.

randunel 2 days ago

Would you reckon OP's bot(s) respect it when borrowing content from the large variety (their words) of podcast sources they scrape?
- ilyabez a day ago
  
  Hi, I'm the author of the blog (though I didn't post it on HN).
  I've addressed this topic in another comment above and will copy it here.
  I'd encourage you to read up on how the podcast ecosystem works.
  Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.
  All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.
  This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.
  
  randunel a day ago
  
  Replace "podcasts" with "search results" in your comment, and "RSS feed" with "LLM output" and you've got yourself the exact same argument for what's going on today. The company names are different, of course, but not by much because some of the players stayed the same.
  Your lack of reply to "do you observe robots.txt when you download content such as images" is basically a "no".
  
  cratermoon a day ago
  
  If they are well-coded, they don't constantly crawl. They use and pay attention to headers like ETag, If-Modified-Since and/or If-None-Match and support conditional requests.
  Badly behaving RSS readers on the other hand....
  https://rachelbythebay.com/w/2024/05/27/feed/

sergiotapia 2 days ago

Another story for https://serverlesshorrors.com/

It's crazy how these companies are really fleecing their customers who don't know any better. Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.

I spend $12 a month on BunnyCDN. And $9 a month on BunnyCDN's image optimizer that allows me to add HTTP params to the url to modify images.

1.33TB of CDN traffic. (ps: can't say enough good things about bunnycdn, such a cool company, does exactly what you pay for nothing more nothing less)

This is nuts dude

jsheard 2 days ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."?
Yes actually, there's a lot to complain about with Vercel but to their credit they do offer both soft and hard spending limits, unlike most other newfangled clouds.
OTOH god help you if you're on Netlify, there you're looking at $0.55/GB with unbounded billing...
leerob 2 days ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.
(I work at Vercel). Yes, there are soft and hard spend limits. OP was using this feature, it's called "spend management": https://vercel.com/docs/spend-management
- ilyabez a day ago
  
  Hi, I'm the author of the blog (though I didn't post it on HN).
  We sure do, though we were initially confused by the wording. We thought "stop deployment" meant that we wouldn't be able to deploy. So, we had it turned off initially.
  @leerob helped us figure it out on the Vercel subreddit, then we turned it on.
sgarland 2 days ago

+1 for BunnyCDN. It's fantastic.

andrethegiant 2 days ago

It’s a shame that the knee-jerk reaction has been to outright block these bots. I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

[disclaimer: I run https://pure.md, which helps websites shield from this traffic]

mtlynch 2 days ago

>I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.
Why?
There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.
- wongarsu 2 days ago
  
  Unless you're selling something. If you have articles praising your product/service/person and "comparison" articles of the "top 10 X 2025" (your offering happens to be number one) you want the bots to find you.
  The LLM SEO game has only just begun. Things will only go downwards from here
  
  sroussey 2 days ago
  
  Or technical docs. For example:
  https://bun.sh/llm.txt
  
  RamblingCTO 2 days ago
  
  I love that! That's one of my biggest pain points: wrong/outdated usage of dependencies.
- randunel 2 days ago
  
  OP in this case is by no means the original author. In this linked post, they mentioned they scrape third parties themselves. OP's bots might not be as sophisticated, but they're still "borrowing" others' content the same way.
- andrethegiant 2 days ago
  
  ChatGPT and others have some sort of attribution, where they link to the original webpage. How or when they decide to attribute is unclear. But websites are starting to pay attention to GEO (generative engine optimization) so that their brand isn’t entirely ignored by ChatGPT and others.
  
  Incipient 2 days ago
  
  I do agree that LLM-as-a-search is going to likely become more and more prevalent as inference gets cheaper and faster, and people don't too much care about 'minor' hallucinations.
  What I don't see however is any way this new way of searching will give back. There is some handwaving argument about links, however the entire value prop of an llm is you DON'T need to go to the source content.
  
  genewitch 2 days ago
  
  could have just left it as SEO and changed the S to "Slop"
pavel_lishin 2 days ago

> I think in the future, websites will learn to serve pure markdown to these bots instead of blocking.
What for? Why would I serve anything to these leeches?
- randunel 2 days ago
  
  Because you, in this case OP, also generates bot traffic to "borrow" content from other websites to serve to their own users. Ironic, no?
  
  pavel_lishin a day ago
  
  Ah. I didn't realize andrethegiant's website was meant to _serve content_ to the bots, instead of _shielding us_ from the bots.
  I fell for a class "To Serve Man" situation.
RamblingCTO 2 days ago

I think you're a bit late to the game ;) I built and sold 2markdown last year, which was then copied by firecrawl/mendable. And then you also have jina reader. Also "compare with" in the footer does nothing.
Swizec 2 days ago

If only there were some way for websites to serve information and provide interactivity in a machine readable format. Like some sort of application programming interface. You could even return different formats based on some sort of 4-letter code at the end of a URL like .html, .json, .xml, etc.
And what if there was some standard sort of way for robots to tell your site what they're trying to do with some sort of verb like GET, PUT, POST, DELETE etc. They could even use a standard way to name the resource they're trying to interact with. Like a universal resource finder of some kind. You could even use identifiers to be specific! Like /items/ gives you a list of items and /items/1.json gives you data about a specific item.
That would be so awesome. The future is amazing.
- marcusb 2 days ago
  
  The only thing that would make that even more perfect would be if there was some way for the site owner to signal to prospective bots which parts of the site are open to the bots to visit. I know this seems really complicated, but I really think it could be expressed in a simple text file.
  
  tough 2 days ago
  
  i dont know, robots.txt sounds too complicated for 2025
  
  thwarted 2 days ago
  
  I would have worded this as "it sounds to simple for 2025".
- mubou 2 days ago
  
  Accept: and rel="alternate" were literally made for this
dmitrygr 2 days ago

Until these bots become good citizens (eg: respecting robots.txt), I will be serving them gzipped gibberish that decompresses to terabytes.
The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.
- randunel 2 days ago
  
  What do you reckon, does OP in this post respect robots.txt or do they "borrow" content in a similar manner, without respecting such standards?
- AlienRobot 2 days ago
  
  I thought the AI wars would be fought with bombs vs. bots not with ZIP bombs vs. bots.
tough 2 days ago

how would one serve them .txt instead?
- andrethegiant 2 days ago
  
  Add a Cloudflare snippet / some other edge function, and transform the response to convert to plaintext
riffic 2 days ago

Markdown over HTTPS reminds me a bit of the gemini protocol:
https://en.wikipedia.org/wiki/Gemini_(protocol)
happyzappy 2 days ago

Cool globe graphic on that site :)
detaro 2 days ago

or you know, AI crawlers could behave and get all that without any extra work for everybody. What makes you think they'll suddenly respect your scheme?

cachedthing0 2 days ago

"Together they sent 66.5k requests to our site within a single day."

Only scriptkiddies are getting into problems by such low numbers. Im sure security is your next 'misconfiguration'. Better search an offline job in the entertainment industries.

aledalgrande 2 days ago

I know the language earned you the downvotes (please be kind), but the author of the article is ex Google and ex AWS, I too would expect some better infra in place (caching?) and certainly not Vercel.
- ilyabez a day ago
  
  Hi, I'm the author of the blog (though I didn't post it on HN).
  Some context here.
  I was actually a PM at Google & AWS, not an engineer. Even though I have a CS degree, I had not been professionally coding for almost 20 years. This is my comeback to software development and I've got a lot to catch up on. Hope this sets the stage appropriately.
  I mentioned in an earlier comment that we didn't actually build the site and it was on a back-burner until we added episode pages and got hit by costs. It's a lesson learned indeed and we're now treating the website as a first-class citizen in our stack.
  
  aledalgrande 18 hours ago
  
  Good for you Ilya, best of luck with your company!