ripperdoc a day ago

It's a cool example and I guess there are or will be very convenient apps that will stream the last X min of screen recording and offer help with what you see.

But it just hurts my programmer soul that it is somehow more effective to record an app, that first renders (semi-)structured text into pixels, then record those millions of pixels into a video, send that over the network to a cloud and run it through a neural network with billions of parameters, than it is to access the 1 kilobyte of text that's already loaded into the memory and process locally.

And yes there are workflows to do that as demonstrated by other comments, but it's a lot of effort that will be constantly thwarted by apps changing their data-structures or obfuscating whatever structure they have or just because software is so layered and complicated that it's hard to get to the data.

luke-stanley 3 days ago

I'm glad this worked for Simon, but I would probably prefer using a User Script that scrapes DOM text changes and streams them to a small local web server to append to a JSONL file that has the URL, text change and timestamp. Probably since I already have something doing this, it allows me to do a backup of things I'm looking at in real time, like streaming LLM generations, and it just relies on normal browser technology. I should probably share my code since it's quite useful. I'm a bit uncomfortable relying on a LLM to transcribe something where there is a stream of text that could be used in a robust way, and with real data, Vs well trained but indirect token magic. A middle ground might be to have grounded extraction and evidence chains, with timestamps, screenshots, cropped regions it's sourcing from, spelled out reasoning. There's the extraction / retrieval step and there's a kind of data normalisation. Of course, it's nice that he's got something that just works with two or three steps, it's good the technology is getting quite reliable and cheap a lot of the time, but still, we could do better.

pridkett 3 days ago

Video scraping doesn’t need to be just screen captures. I’ve demoed a solution with Gemini where you take a video walking up and down aisles in a retail store and it captured 100% accurate data on product name, quantity/size, sku, and price for a little under 75% of the products. And that was back in January.

This has huge implications for everything from competitive pricing, to understanding store layouts, to creating your own grocery store inflation monitor. Just subtly take a video and process it.

And the models have only gotten better.

  • tgv 2 days ago

    > This has huge implications for everything from competitive pricing, to understanding store layouts

    Even smaller stores have been monitoring their competitors since a long time.

    > your own grocery store inflation monitor

    You could also check your itemized bill.

    • jasonjayr a day ago

      > You could also check your itemized bill.

      But only for the things you buy.

      Note too, some big retail stores actually have a "license" or "contract" for customers hiding behind the service desk, and often, video recording is one of the things they forbid. It's not "illegal" to do so, but if they catch you, and insist you leave, and you refuse, now you're trespassing, and that has legal consequences.

      • pests a day ago

        I've explored the mystery shopper apps a few times.

        A common task is to take photos of shelf's and the products / pricing.

        Its framed as "make sure our employees are doing it correct" but based on the strict image requirements (needed for later computer processing) I have the feeling it is actually a competitor trying to get shelf and price info. It feels a little cloak and dagger.

  • TechDebtDevin a day ago

    I've done this as well with bookshelves at thrift stores that are completely unorganized. I don't want to lose my mind reading every title on every book binding, so I take a picture and ask the model to list all the books on the shelf then I can easily scan through the list to see if something catches my eye.

  • thegabriele a day ago

    Is there any possibility to view such demo? Or learning something more?

  • chamomeal a day ago

    Holy shit I’ve been thinking about doing this for my local grocery stores, so I can stop wondering things like “who has lady fingers”.

    I was wondering how well it would work. That’s such good news

    • bippihippi1 a day ago

      would be way easier if stores interface with search engines directly. majority of stores have inventory systems. just a matter of time until they intehrate. if you search for a product you can sometimes see places that have it nearby.

      • cxr 15 hours ago

        Often their own data is wrong.

anigbrowl 2 days ago

Being Google, isn't it highly likely that the price is a loss leader which will later be changed once customers are sufficiently locked in? I get that this is more convenient than doing it programatically or manually, but that seems like a reason to using something other than gmail. This approach just seems incredibly wasteful to me.

  • simonw 2 days ago

    Pretty much everyone who prices out Gemini has the same question - these prices just seem WAY to cheap to be sustainable long term.

    I've tried bouncing this off some Google employees and the general vibes I got back from them is that Google is very good at running stuff like this at a scale that drives down the cost for individual queries, so they seemed confident that these prices were not a loss leader strategy.

    I don't know if I can believe that though. It's just SO cheap!

    • kanwisher a day ago

      In past Google notoriously 10x prices once it’s no longer the main pet project and they can’t just lose cash on it

dxxvi a day ago

It seems to me that I don't know how to use the Google AI Studio. There is this youtube video with music lines at the top the video. I want to extract those music lines and put them in images. This is what I did: download the video with yt-dlp, get all the points of time at which I want to get the music line, use ffmpeg to extract frames at those points of time, use imagemagick to crop the music lines in those frames. It works but the tedious part is that I have to get all the points of time when there's a new music line. Does anybody know how to ask AI to do that?

euroderf 3 days ago

Couldn't he have sent it as a fax and then photographed the fax ?

etewiah 4 days ago

You've got me thinking. Would this work for real estate data? A lot of sites make it quite hard to grab their raw data. Also, perhaps it could gain some insights from the photos...

  • TechDebtDevin a day ago

    Been scraping real estate data off every major real estate site for a while. They practically give away their data, there's zero reason to introduce an added cost for llms.

    Sure you could do this, and it would work, but you'd spend about 100000x what I do with a $10 Hetzner VPS and a small amount of proxy bandwidth.

  • bambax a day ago

    It's crazy to think we live in a world where video to llm ocr is simpler (and cheaper?) than plain old html parsing. Maybe someone will rebuild the Twitter API like this?!?

  • simonw 4 days ago

    I'm certain it would. That would be a really fun experiment to run!

  • jerpint 4 days ago

    Could also work for social media which can be hard to scrape

zahlman 19 hours ago

>You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers.

>I could have clicked through the emails and copied out the data manually one at a time. This is error prone and kind of boring. For twelve emails it would have been OK, but for a hundred it would have been a real pain.

This seems contradictory to me.

  • simonw 19 hours ago

    Why's that?

    Navigating to and then copying and pasting specific text out of twelve different emails (and then removing commas and dollar signs and reformatting dates as YYYY-MM-DD) is still a whole lot more work than watching a 35s video to check that it did the tedious data entry for you.

    For the 100 email version I'd do more of a spot check, depending on how high the stakes were.

spenczar5 a day ago

Suppose I give Gemini a 10 minute video. Will it spend 10 minutes “watching” it if I ask it to extract something? Or does it know how to speed up the video? I assume it must do some sort of preprocessing like extracting keyframes; it surely (?) can’t be looking at the raw encoded video bytes, after all.

  • simonw a day ago

    It won't take 10 minutes, but it might still take a minute or two (for Pro) - though Flash and Flash 8B should be significantly faster.

    It does process a version of the raw video but it can run that faster than the default video playback rate.

    That is quite a bit of detail here: https://ai.google.dev/gemini-api/docs/vision?lang=python#pro...

    "The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second."

m-hodges 4 days ago

This is gonna push things towards some very unfortunate DRM.

  • toomuchtodo 3 days ago

    Can't stop the webcam->LLM

    • okwhateverdude 3 days ago

      Poison pixels and other cat-and-mouse things will definitely happen.

siruncledrew a day ago

Has anyone tried this with Llama 3.2 or any other “open source” options?

teruakohatu 4 days ago

I admit this is a pretty cool technique, but what is missing is how accurate the data extraction was. Without knowing that it is not possible to judge how useful this technique is.

  • kaveet 3 days ago

    > You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.

  • simonw 3 days ago

    I watched the 35 second long video and confirmed by eyeballing the JSON that the result was exactly correct.

  • korkybuchek 4 days ago

    He said in his tweet that he verified the results.

danjc 3 days ago

I think this sort of thing is what Microsoft intended with Recall. The problem is the privacy implications are horrible.

  • simonw 3 days ago

    Something I really like about this technique is that I stay in complete control of what I expose to the model. If I don't want something fed into the model I omit it from the screen recording.

Havoc 3 days ago

Still amazed that video is so "cheap" on tokens despite being way more bytes than text

  • odo1242 2 days ago

    Pretty sure there's some strong preprocessing bring applied to that video though. Maybe even to the point of extracting text and deduplicating it between frames.

TacticalCoder a day ago

That is amazing! I tried something not dissimilar, but only with a single picture: I took an AI model (I think it was GPT 4o, don't remember as I'm using several) and asked it:

    "Analyze this screenshot and tells me if everything seems legit"
And it could the difference between a legit screenshot and a phishing attempt. Without me telling it which site name was legit or not.

And it was pretty detailed: "There's an 'l' in interactlvebrokers.ie" that is made to look like an 'i'". Or something like that.

We're already getting and we're going to get lots* of shiny new helper tools.

  • maeil a day ago

    One of the huge challenges is preventing false positives. Did you also try giving it a legit screenshot?

    • bambax a day ago

      False positives are one thing, but the real danger are false negatives when someone trusts the model blindly. I'd rather examine a website with my own eyes than ask an LLM what it thinks.