Video scraping: extracting JSON from a 35s screen capture for 1/10th of a cent

309 points by simonw 8 months ago

It's a cool example and I guess there are or will be very convenient apps that will stream the last X min of screen recording and offer help with what you see.

But it just hurts my programmer soul that it is somehow more effective to record an app, that first renders (semi-)structured text into pixels, then record those millions of pixels into a video, send that over the network to a cloud and run it through a neural network with billions of parameters, than it is to access the 1 kilobyte of text that's already loaded into the memory and process locally.

And yes there are workflows to do that as demonstrated by other comments, but it's a lot of effort that will be constantly thwarted by apps changing their data-structures or obfuscating whatever structure they have or just because software is so layered and complicated that it's hard to get to the data.

luke-stanley 8 months ago

I'm glad this worked for Simon, but I would probably prefer using a User Script that scrapes DOM text changes and streams them to a small local web server to append to a JSONL file that has the URL, text change and timestamp. Probably since I already have something doing this, it allows me to do a backup of things I'm looking at in real time, like streaming LLM generations, and it just relies on normal browser technology. I should probably share my code since it's quite useful. I'm a bit uncomfortable relying on a LLM to transcribe something where there is a stream of text that could be used in a robust way, and with real data, Vs well trained but indirect token magic. A middle ground might be to have grounded extraction and evidence chains, with timestamps, screenshots, cropped regions it's sourcing from, spelled out reasoning. There's the extraction / retrieval step and there's a kind of data normalisation. Of course, it's nice that he's got something that just works with two or three steps, it's good the technology is getting quite reliable and cheap a lot of the time, but still, we could do better.

simonw 8 months ago

I did something a bit like that recently to scrape tweets out of Twitter: https://til.simonwillison.net/twitter/collecting-replies
luke-stanley 8 months ago

I put my Userscript and Python server script here: https://gist.github.com/lukestanley/c3a37ab61a45e72b74995a5c... It tries to save as Markdown. I'm sure it could be much better in many ways. But it works well enough for me right now.
ranger_danger 8 months ago

The userscript idea is great, I could think of some uses for this, such as text-to-speech for live comments. Do you know of any examples of projects already doing this?
- ian_hn 8 months ago
  
  Things like this? https://greasyfork.org/en/scripts?q=speech

pridkett 8 months ago

Video scraping doesn’t need to be just screen captures. I’ve demoed a solution with Gemini where you take a video walking up and down aisles in a retail store and it captured 100% accurate data on product name, quantity/size, sku, and price for a little under 75% of the products. And that was back in January.

This has huge implications for everything from competitive pricing, to understanding store layouts, to creating your own grocery store inflation monitor. Just subtly take a video and process it.

And the models have only gotten better.

tgv 8 months ago

> This has huge implications for everything from competitive pricing, to understanding store layouts
Even smaller stores have been monitoring their competitors since a long time.
> your own grocery store inflation monitor
You could also check your itemized bill.
- jasonjayr 8 months ago
  
  > You could also check your itemized bill.
  But only for the things you buy.
  Note too, some big retail stores actually have a "license" or "contract" for customers hiding behind the service desk, and often, video recording is one of the things they forbid. It's not "illegal" to do so, but if they catch you, and insist you leave, and you refuse, now you're trespassing, and that has legal consequences.
  
  pests 8 months ago
  
  I've explored the mystery shopper apps a few times.
  A common task is to take photos of shelf's and the products / pricing.
  Its framed as "make sure our employees are doing it correct" but based on the strict image requirements (needed for later computer processing) I have the feeling it is actually a competitor trying to get shelf and price info. It feels a little cloak and dagger.
TechDebtDevin 8 months ago

I've done this as well with bookshelves at thrift stores that are completely unorganized. I don't want to lose my mind reading every title on every book binding, so I take a picture and ask the model to list all the books on the shelf then I can easily scan through the list to see if something catches my eye.
- pridkett 8 months ago
  
  This is an old thread and will probably be lost now, but this sort of tool has been industrialized now. At the library book sales in my area, you'll see a handful of people there right as they open, they'll have either a scanner device or an app on their phone. They just scan over the spines of the books and it beeps when it finds something interesting. It highlights the book and they pick it out.
  On the one hand, great application of technology. On the other hand, the folks using them have zero interest in reading the books. So, when the library happens to have a nearly brand new copy of "Harry Potter and the Order of the Phoenix: Illustrated Edition" there's almost no chance that someone in the community who wants to read the book will get it. Instead it gets snapped up by the resellers who make a couple of bucks off sending it to someone else who may be across the country.
thegabriele 8 months ago

Is there any possibility to view such demo? Or learning something more?
chamomeal 8 months ago

Holy shit I’ve been thinking about doing this for my local grocery stores, so I can stop wondering things like “who has lady fingers”.
I was wondering how well it would work. That’s such good news
- bippihippi1 8 months ago
  
  would be way easier if stores interface with search engines directly. majority of stores have inventory systems. just a matter of time until they intehrate. if you search for a product you can sometimes see places that have it nearby.
  
  cxr 8 months ago
  
  Often their own data is wrong.

anigbrowl 8 months ago

Being Google, isn't it highly likely that the price is a loss leader which will later be changed once customers are sufficiently locked in? I get that this is more convenient than doing it programatically or manually, but that seems like a reason to using something other than gmail. This approach just seems incredibly wasteful to me.

simonw 8 months ago

Pretty much everyone who prices out Gemini has the same question - these prices just seem WAY to cheap to be sustainable long term.
I've tried bouncing this off some Google employees and the general vibes I got back from them is that Google is very good at running stuff like this at a scale that drives down the cost for individual queries, so they seemed confident that these prices were not a loss leader strategy.
I don't know if I can believe that though. It's just SO cheap!
- kanwisher 8 months ago
  
  In past Google notoriously 10x prices once it’s no longer the main pet project and they can’t just lose cash on it

cloudking 8 months ago

"Accessing my Gmail data programatically" is pretty painless with Apps Script, you can pipe those emails into a Sheet. LLMs are pretty good at writing Apps Script too, I've used them for dozens of small internal tools like your use case https://developers.google.com/apps-script/reference/gmail

etewiah 8 months ago

You've got me thinking. Would this work for real estate data? A lot of sites make it quite hard to grab their raw data. Also, perhaps it could gain some insights from the photos...

TechDebtDevin 8 months ago

Been scraping real estate data off every major real estate site for a while. They practically give away their data, there's zero reason to introduce an added cost for llms.
Sure you could do this, and it would work, but you'd spend about 100000x what I do with a $10 Hetzner VPS and a small amount of proxy bandwidth.
bambax 8 months ago

It's crazy to think we live in a world where video to llm ocr is simpler (and cheaper?) than plain old html parsing. Maybe someone will rebuild the Twitter API like this?!?
simonw 8 months ago

I'm certain it would. That would be a really fun experiment to run!
jerpint 8 months ago

Could also work for social media which can be hard to scrape

dxxvi 8 months ago

It seems to me that I don't know how to use the Google AI Studio. There is this youtube video with music lines at the top the video. I want to extract those music lines and put them in images. This is what I did: download the video with yt-dlp, get all the points of time at which I want to get the music line, use ffmpeg to extract frames at those points of time, use imagemagick to crop the music lines in those frames. It works but the tedious part is that I have to get all the points of time when there's a new music line. Does anybody know how to ask AI to do that?

euroderf 8 months ago

Couldn't he have sent it as a fax and then photographed the fax ?

m-hodges 8 months ago

This is gonna push things towards some very unfortunate DRM.

toomuchtodo 8 months ago

Can't stop the webcam->LLM
- okwhateverdude 8 months ago
  
  Poison pixels and other cat-and-mouse things will definitely happen.

TacticalCoder 8 months ago

That is amazing! I tried something not dissimilar, but only with a single picture: I took an AI model (I think it was GPT 4o, don't remember as I'm using several) and asked it:

    "Analyze this screenshot and tells me if everything seems legit"

And it could the difference between a legit screenshot and a phishing attempt. Without me telling it which site name was legit or not.

And it was pretty detailed: "There's an 'l' in interactlvebrokers.ie" that is made to look like an 'i'". Or something like that.

We're already getting and we're going to get lots* of shiny new helper tools.

maeil 8 months ago

One of the huge challenges is preventing false positives. Did you also try giving it a legit screenshot?
- bambax 8 months ago
  
  False positives are one thing, but the real danger are false negatives when someone trusts the model blindly. I'd rather examine a website with my own eyes than ask an LLM what it thinks.

siruncledrew 8 months ago

Has anyone tried this with Llama 3.2 or any other “open source” options?

teruakohatu 8 months ago

I admit this is a pretty cool technique, but what is missing is how accurate the data extraction was. Without knowing that it is not possible to judge how useful this technique is.

kaveet 8 months ago

> You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.
simonw 8 months ago

I watched the 35 second long video and confirmed by eyeballing the JSON that the result was exactly correct.
korkybuchek 8 months ago

He said in his tweet that he verified the results.

zahlman 8 months ago

>You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers.

>I could have clicked through the emails and copied out the data manually one at a time. This is error prone and kind of boring. For twelve emails it would have been OK, but for a hundred it would have been a real pain.

This seems contradictory to me.

simonw 8 months ago

Why's that?
Navigating to and then copying and pasting specific text out of twelve different emails (and then removing commas and dollar signs and reformatting dates as YYYY-MM-DD) is still a whole lot more work than watching a 35s video to check that it did the tedious data entry for you.
For the 100 email version I'd do more of a spot check, depending on how high the stakes were.

spenczar5 8 months ago

Suppose I give Gemini a 10 minute video. Will it spend 10 minutes “watching” it if I ask it to extract something? Or does it know how to speed up the video? I assume it must do some sort of preprocessing like extracting keyframes; it surely (?) can’t be looking at the raw encoded video bytes, after all.

simonw 8 months ago

It won't take 10 minutes, but it might still take a minute or two (for Pro) - though Flash and Flash 8B should be significantly faster.
It does process a version of the raw video but it can run that faster than the default video playback rate.
That is quite a bit of detail here: https://ai.google.dev/gemini-api/docs/vision?lang=python#pro...
"The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second."

danjc 8 months ago

I think this sort of thing is what Microsoft intended with Recall. The problem is the privacy implications are horrible.

simonw 8 months ago

Something I really like about this technique is that I stay in complete control of what I expose to the model. If I don't want something fed into the model I omit it from the screen recording.

Havoc 8 months ago

Still amazed that video is so "cheap" on tokens despite being way more bytes than text

odo1242 8 months ago

Pretty sure there's some strong preprocessing bring applied to that video though. Maybe even to the point of extracting text and deduplicating it between frames.