Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).
Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.
That's unfortunate. My personal sense is that while agentic LLM's are not going to get us close to AGI, a few relatively modest architectural changes to the underlying models might actually do that, and I do think mimicry of our own self-referential attention is a very important component of that.
While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.
I love this idea, but can't find anything about it. Is this a neologism you just coined? If so, is there any particular paper or work that led you to think about in those terms?
Yes, I just coined the neologism. It was supposed to be partly sarcastic (why stay at pixels, why not just go fully multimodal and treat the missing channels as missing information?), I am kind of surprised why it got so upvoted.
(IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)
Pretending channels can be effectively merged into a single percel vector, that would open up interesting channels beyond human perception even, e.g. lidar. Or it would be interesting to train a model that feels at home in 4D space.
Deep things often, not always, take more attention to appreciate than the superficial. It's a precious resource people are seldom disposed to allocate a lot of when headline-surfing HN.
"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.
Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.
It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.
I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?
Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.
Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.
This is also feature is built into Edge (and I agree it's great, but I mostly use it so I can listen to pages while doing chores around the office/closing my eyes.
What I would love is an easy way to just convert the page to a mp3 that queues into my podcast app to listen to while taking a walk or driving. It probably exists, but I haven't spent a lot of time looking into it.
I do this too. It's great. The term I've seen used to describe this is 'Immersion Reading'. It seems to be quite a popular way for neurodivergent people to get into reading.
Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.
There are some concerns here that should be addressed separately:
> Ok but what are you going to decode into at generation time, a jpeg of text?
Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.
> we process text in many more ways than just reading it
As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.
Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.
Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.
I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?
From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.
Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.
In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.
But the tweet itself is kinda an answer to the question you're asking.
It seems like we're still pretty far away from that being viable, if chatgpt is any indication. Whenever it suggests "should I generate an image of that <class design, timeline, data model, etc>, it really helps visualize it!", the result is full of hallucinations.
There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.
one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.
In all seriousness, I found those sorting dance videos to be really educationally effective (when coupled with going over the pseudocode) - e.g. https://youtu.be/3San3uKKHgg?si=09EQYJNIkRqvQgWG
One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).
"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.
This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.
But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.
It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space
A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".
If we had a million times the compute? We might have brute forced our way to AGI by now.
It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.
You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.
But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.
I wouldn't think it would be good for coding assistants, or things where character precision is important.
OTOH maybe the information implied by syntax coloring could make syntax patterns easier to recognize and internalize? Once internalized, perhaps it'd retain and use that syntax understanding on plaintext too if you fine tune it by gradually removing the color coding. Similar approaches have worked for improving their innate (no "thinking", no tool use) arithmetic accuracy.
It might be helpful for intuiting the structure of a program. Imagine if you had to read code all on a single line, with newlines represented with \n.
I can get the feel of a piece of code just by looking at it. Even if you blurred the image, just the shape of the lines of code conveys a lot of information.
True, but LLMs are already really good at that kind of thing. Even back in 2015, before transformers, here's a karpathy blog post showing how you could find specific neurons that tracked things like indent position, approx column location, long quotes, etc.
That said, I do think algorithms and system designs are very visual. It's way harder to explain heaps and merge sorts and such from just text and code. Granted, it's 2025 now and modern LLMs seem to have internalized those types of concepts ~perfectly for a while now, so IDK if there's much to gain by changing approaches at that level anymore.
Another example might be the way people used to show off their Wordle scores on Twitter when the game first came out. Just posting the gray, green and yellow squares by themselves, sans text, communicates a surprising amount of information about the player's guesses.
> more information compression (see paper) => shorter context windows, more efficiency
It seems crazy to me that image inputs (of text) are smaller and more information dense than text - is that really true? Can somebody help my intuition?
It must be the tokenizer. Figuring out words from an image is harder (edges, shapes, letters, words, ...), yet internal representations are more efficient.
I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.
Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?
I've thought about this a lot, and it comes down ultimately to context size. Programming languages themselves are sort of a "compression technique" for assembly code. Current models even at the high end (1M context windows) do not have near enough workable context to be effective at writing even trivial programs in binary or assembly. For simple instructions sure, but for now the compression of languages (or DSLs) is a context efficiency.
Seems we're now at a point of time when OCR is doing so well, that printing text out and letting computers literally read it is suggested to be superior to processing the endoded text directly.
Neural networks have essentially solved perception. It doesn't matter what format your data comes in, as long as you have enough of it to learn the patterns.
I think the DCT is a compelling way to interact with spatial information when the channel is constrained. What works for jpeg can likely work elsewhere. The energy compaction properties of the DCT mean you get most of the important information in a few coefficients. A quantizer can zero out everything else. Zig zag scanned + RLE byte sequences could be a reasonable way to generate useful "tokens" from transformed image blocks. Take everything from jpeg encoder except for perhaps the entropy coding step.
At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.
I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.
An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.
I can listen to music in my head. I don't think this is an extraordinary property but it is kind of neat. That hints at the fact that I somehow must have encoded this music. I can't imagine I'm storing the equivalent of a MIDI file, but I also can't imagine that I'm storing raw audio samples because there is just too much of it.
It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.
Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.
Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.
I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.
This should be "pixels are (maybe) a better representation than the current representation of tokens". Which is very different. Text is surely more information dense than the image containing the same text, so the problem is finding the best representation of text. If each word is expanded to a very large embedding and you see pixels doing better, than the problem is in the representation and not in the text vs image.
I don’t quite follow. The way I see it, I hat the llm “reads” depends on the input modality. If the input is a human it will be in text form, has to be. If the input is through a camera then yes, even text will be camera frames and pixels, and that is how I expect the llms to process. So I would a vision llm would already be doing this.
I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?
You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.
The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:
Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?
Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.
Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.
Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.
There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.
we re going to get closer and closer to removing all hand-engineered features of neural network architecture, and letting a giant all-to-all fully connected network collapse on its own to the appropriate architecture for the data, a true black box.
Not criticizing per se but I just watched this recent (and great!) interview where he extols how special written language is. That was my take away at least. Still trying to wrap my head around this vision encoder approach. He’s way smarter than me! https://youtu.be/lXUZvyajciY
Really interesting analysis on the latest DeepSeek innovation. I’m tempted to connect it to the information density of logographic script, which DeepSeek engineers would all be natively fluent.
The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.
> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.
> Maybe it makes more sense that all inputs to LLMs should only ever be images.
So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?
From the post he's referring to text input as well:
> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
Italicized emphasis mine.
So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.
Back before transformers, or even LSTMs, we used to joke that image recognition was so far ahead of language modeling that we should just convert our text to PDF and run the pixels through a CNN.
Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.
I absolutely don’t “read the text all at once” and do read “left to right”. Could be why I usually find that my reading speed is slower than most. Although I’ve never really had a hard time with comprehension or remembering details.
The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.
People do skip words or scan for key phrases, but reading still happens in sequence. The brain depends on word order and syntax to make sense of text, so you cannot truly read it all at once. Skimming just means you sample parts of a linear structure, not that reading itself is non-linear. Eye-tracking studies confirm this sequential processing (check out the Rayner study in Psychological Bulletin if you are interested).
Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
There is an interesting discussion down thread about ADHD and sequential reading. As someone who has ADHD I may be biased by how my brain works. I definitely don't read strictly linearly, there is a lot of jumping around and assembling of text.
> Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
My initial reaction was to say speak for yourself about what reading is or isn’t, and that text is written linearly, but the more I think about it, the more I think you have a very good point. I think I read mostly linear and don’t often look ahead for punctuation. But sentence punctuation changes both the meaning and presumed tone of words that preceded it, and it’s useful to know that while reading the words. Same goes for something like “, Barry said.” So meaning in written text is definitely not 100% linear, and that justifies reading in non-linear ways. This, I’m sure, is one reason that Spanish has the pre-sentence question mark “¿”. And I think there are some authors who try to put who’s talking in front most of the time, though I can’t name any off the top of my head.
What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!
I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text
That description feels relatable to me. Maybe buffered more than buttered, in my case ;)
It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?
I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.
The relevant technical term is "saccade"
> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.
> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.
I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.
Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.
That eyewiki entry was really cool. Among the unexpectedly interesting bits:
> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].
also ping pong around the page (ADHD'r). At times I read a sentance or two in linear fashion, then start jumping, or start or move to the end and read backwards, or any mix of this depending.
If you're an adult you probably have compensated for the saccades and developed a strategy that doesn't force you to read linearly. This is much of what "speed reading" courses try to do intentionally.
Yet again Hollywood is prescient. This post reminds me of the language of the aliens in Arrival. It seems like the OP would see that as a reasonable input to an LLM.
I made exactly this point at the inaugural Portland AI tinkerers meetup. I had been messing with large document understanding. Converting PDF to text and then sending to gpt was too expensive. It was cheaper to just upload the image and ask it questions directly. And about as accurate.
Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).
Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.
I’ve had written up a proposal for a research grant to basically work exactly on this idea.
It got reviewed by 2 ML scientists and one neuroscientist.
Got totally slammed (and thus rejected) by the ML scientists due to „lack of practical application“ and highly endorsed by the neuroscientist.
There’s so much unused potential in interdisciplinary research but nobody wants to fund it because it doesn’t „fit“ into one of the boxes.
Sounds like those ML "scientists" were actually just engineers.
A lot of progress is made through engineering challenges
This is also "science"
That's unfortunate. My personal sense is that while agentic LLM's are not going to get us close to AGI, a few relatively modest architectural changes to the underlying models might actually do that, and I do think mimicry of our own self-referential attention is a very important component of that.
While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.
Isn't this effectively what the latent space is? A bunch of related vectors that all bundle together?
I love this idea, but can't find anything about it. Is this a neologism you just coined? If so, is there any particular paper or work that led you to think about in those terms?
Yes, I just coined the neologism. It was supposed to be partly sarcastic (why stay at pixels, why not just go fully multimodal and treat the missing channels as missing information?), I am kind of surprised why it got so upvoted.
(IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)
Pretending channels can be effectively merged into a single percel vector, that would open up interesting channels beyond human perception even, e.g. lidar. Or it would be interesting to train a model that feels at home in 4D space.
I think there's a decent chance you may have just created the ideal name for what will become one of the most important concepts ever. Bravo!
Deep things often, not always, take more attention to appreciate than the superficial. It's a precious resource people are seldom disposed to allocate a lot of when headline-surfing HN.
Should future attributions in white papers go to js8 from HN?
This is an interesting thought. Trying to imagine how you represent that as a vector.
You still need to map percels to a latent space. But perhaps with some number of dimensions devoted to modes of perception? E.g. audio, visual, etc
I was going to say toxel
Like a tokenized 3D voxel?
Tokenized pixel. I understand now that's not what js8 was talking about, so my original comment doesn't really make sense
[dead]
"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.
Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.
It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.
I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?
Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.
Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.
This is also feature is built into Edge (and I agree it's great, but I mostly use it so I can listen to pages while doing chores around the office/closing my eyes.
What I would love is an easy way to just convert the page to a mp3 that queues into my podcast app to listen to while taking a walk or driving. It probably exists, but I haven't spent a lot of time looking into it.
I do this too. It's great. The term I've seen used to describe this is 'Immersion Reading'. It seems to be quite a popular way for neurodivergent people to get into reading.
Any chance you could share the source?
I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes
What’s your extension? Sounds interesting!
Just FYI, Firefox reader mode does the same thing. It's a little button in the address bar.
Reading mode in chrome does this too. Although the tts sounds like it's far behind sota
The pixel to sounds would pass through “reading” so there might be information loss. It is no longer just pixels.
There was the Byte Latent Transformer, to end the tokenizer, which seemingly went nowhere. https://ai.meta.com/research/publications/byte-latent-transf...
fair team currently subject to tbd labs politics
Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.
There are some concerns here that should be addressed separately:
> Ok but what are you going to decode into at generation time, a jpeg of text?
Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.
> we process text in many more ways than just reading it
As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.
Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.
Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.
I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?
From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.
Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.
In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.
But the tweet itself is kinda an answer to the question you're asking.
I guess it is because of the absurdly high information density of text - so text is quite a good input.
It seems like we're still pretty far away from that being viable, if chatgpt is any indication. Whenever it suggests "should I generate an image of that <class design, timeline, data model, etc>, it really helps visualize it!", the result is full of hallucinations.
Somewhat related:
There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.
https://academic.oup.com/mbe/article/36/2/220/5229930
one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.
Of course PowerPoint is the best input to LLMs. They will come to that eventually.
I'd actually prefer to communicate to ChatGPT via Microsoft Paint. Much more efficient than typing.
Leading scientists claim interpretative dance is the AI breakthrough the world has been waiting for!
In all seriousness, I found those sorting dance videos to be really educationally effective (when coupled with going over the pseudocode) - e.g. https://youtu.be/3San3uKKHgg?si=09EQYJNIkRqvQgWG
It's slides all the way down. Once models support this natively, it's a major threat to slides ai / gamma and the careers of product managers.
Yeah, I’ve seen great results with this approach.
Clippy knew this all along.
Kapathy's points are correct (of course).
One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).
"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.
This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.
But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.
im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )
but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP
It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space
People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.
It will require like 20x the compute
A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".
If we had a million times the compute? We might have brute forced our way to AGI by now.
But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.
Why do you suppose this is a compute limited problem?
It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
Thanks.
Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now
Why so much compute? Can you tie it to the problem?
Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.
Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.
Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.
There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.
So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.
And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.
That's why
Thanks. That helps a lot.
Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.
You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.
But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.
Interesting idea! Haven’t heard that before.
Recent and related:
Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)
DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)
I wouldn't think it would be good for coding assistants, or things where character precision is important.
OTOH maybe the information implied by syntax coloring could make syntax patterns easier to recognize and internalize? Once internalized, perhaps it'd retain and use that syntax understanding on plaintext too if you fine tune it by gradually removing the color coding. Similar approaches have worked for improving their innate (no "thinking", no tool use) arithmetic accuracy.
It might be helpful for intuiting the structure of a program. Imagine if you had to read code all on a single line, with newlines represented with \n.
I can get the feel of a piece of code just by looking at it. Even if you blurred the image, just the shape of the lines of code conveys a lot of information.
True, but LLMs are already really good at that kind of thing. Even back in 2015, before transformers, here's a karpathy blog post showing how you could find specific neurons that tracked things like indent position, approx column location, long quotes, etc.
https://karpathy.github.io/2015/05/21/rnn-effectiveness/
That said, I do think algorithms and system designs are very visual. It's way harder to explain heaps and merge sorts and such from just text and code. Granted, it's 2025 now and modern LLMs seem to have internalized those types of concepts ~perfectly for a while now, so IDK if there's much to gain by changing approaches at that level anymore.
Another example might be the way people used to show off their Wordle scores on Twitter when the game first came out. Just posting the gray, green and yellow squares by themselves, sans text, communicates a surprising amount of information about the player's guesses.
https://xcancel.com/karpathy/status/1980397031542989305
Thanks. There are also these:
- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/
- https://chromewebstore.google.com/detail/xcancelcom-redirect...
Thanks! Added to toptext also.
> more information compression (see paper) => shorter context windows, more efficiency
It seems crazy to me that image inputs (of text) are smaller and more information dense than text - is that really true? Can somebody help my intuition?
It must be the tokenizer. Figuring out words from an image is harder (edges, shapes, letters, words, ...), yet internal representations are more efficient.
I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.
I absolutely think that it can, but it depends on what mean you associate with each pixel.
Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?
I've thought about this a lot, and it comes down ultimately to context size. Programming languages themselves are sort of a "compression technique" for assembly code. Current models even at the high end (1M context windows) do not have near enough workable context to be effective at writing even trivial programs in binary or assembly. For simple instructions sure, but for now the compression of languages (or DSLs) is a context efficiency.
I can attest that existing LLMs work surprisingly well for disassembly.
Possible but not precise depending on your use case. LLM compilers would suffer from the same sort of propensity to bugs as humans.
Seems we're now at a point of time when OCR is doing so well, that printing text out and letting computers literally read it is suggested to be superior to processing the endoded text directly.
Neural networks have essentially solved perception. It doesn't matter what format your data comes in, as long as you have enough of it to learn the patterns.
The information density of a bitmap representation of text is just silly low compared to normal textual encodings, even compressed.
PDF is arguably a confusing format for LLMs to read.
I think the DCT is a compelling way to interact with spatial information when the channel is constrained. What works for jpeg can likely work elsewhere. The energy compaction properties of the DCT mean you get most of the important information in a few coefficients. A quantizer can zero out everything else. Zig zag scanned + RLE byte sequences could be a reasonable way to generate useful "tokens" from transformed image blocks. Take everything from jpeg encoder except for perhaps the entropy coding step.
At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.
I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.
An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.
I can listen to music in my head. I don't think this is an extraordinary property but it is kind of neat. That hints at the fact that I somehow must have encoded this music. I can't imagine I'm storing the equivalent of a MIDI file, but I also can't imagine that I'm storing raw audio samples because there is just too much of it.
It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.
Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.
Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.
> Is there anybody that has a handle on how we store auditory content in our memories?
It's so weird that I don't know this. It's like I'm stuck in userland.
https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)
You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.
I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.
Sometimes you want to be Unicode-precise, such as when checking if domain names are legit.
This should be "pixels are (maybe) a better representation than the current representation of tokens". Which is very different. Text is surely more information dense than the image containing the same text, so the problem is finding the best representation of text. If each word is expanded to a very large embedding and you see pixels doing better, than the problem is in the representation and not in the text vs image.
I don’t quite follow. The way I see it, I hat the llm “reads” depends on the input modality. If the input is a human it will be in text form, has to be. If the input is through a camera then yes, even text will be camera frames and pixels, and that is how I expect the llms to process. So I would a vision llm would already be doing this.
> if the input is a human it will be in text form, has to be.
Why can't it be a sequence of audio waveforms from human speech?
I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?
You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.
The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:
https://en.wikipedia.org/wiki/Byte-pair_encoding
It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.
Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf
Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?
Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.
presumably anyone tokenizing chinese characters, which are basically entire words.
There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.
Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?
Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.
Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.
There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.
Chinese text == Method of loci
Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.
we re going to get closer and closer to removing all hand-engineered features of neural network architecture, and letting a giant all-to-all fully connected network collapse on its own to the appropriate architecture for the data, a true black box.
Which is the Logical conclusion.
If the neural network can distill a model out of complex input data.
Especially when many model are frequently trained through data augmentation practices that actively degrade input to achieve generalisation abilities.
Then why are we stuck wearing silk glove tokenizers?
Not criticizing per se but I just watched this recent (and great!) interview where he extols how special written language is. That was my take away at least. Still trying to wrap my head around this vision encoder approach. He’s way smarter than me! https://youtu.be/lXUZvyajciY
Really interesting analysis on the latest DeepSeek innovation. I’m tempted to connect it to the information density of logographic script, which DeepSeek engineers would all be natively fluent.
For reference, here's the paper: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...
The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.
Could someone explain to me the difference? They both get turned to tensors of floats.
eh, some part of the model will be translating those pixels into tokens. We're just moving the extra step into the blackbox.
The text should be printed and a photo of the printed paper on a wooden table should be passed as input into the LLM.
There are many unicode characters that look alike. There are also those zero width characters.
[dead]
> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.
> Maybe it makes more sense that all inputs to LLMs should only ever be images.
So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?
If you can read your input on your screen your computer apparently knows how to convert your texts to images.
No? He’s talking about rendered text
From the post he's referring to text input as well:
> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
Italicized emphasis mine.
So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.
I mean, text is, after all, highly stylised images
It's trivial for text to be pasted in, and converted to pixels (that's what my, and every computer on the planet, does when showing me text)
Back before transformers, or even LSTMs, we used to joke that image recognition was so far ahead of language modeling that we should just convert our text to PDF and run the pixels through a CNN.
Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.
Sequential reading of text is very inefficient.
I absolutely don’t “read the text all at once” and do read “left to right”. Could be why I usually find that my reading speed is slower than most. Although I’ve never really had a hard time with comprehension or remembering details.
I remember doing speed reading courses back when I was young and a big part of it was learning to read a paragraph diagonally.
Its much, much faster. At first there's a loss of understanding of course but once you've practiced enough you will be much faster.
LLMs don't "read" text sequentially, right?
The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.
Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...
Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.
Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.
If the attention is masked, then yes they do.
I think you’re making a lot of assumptions about how people read.
He isn't, plenty of studies have been done on the topic. Eyes dart around a lot when reading.
People do skip words or scan for key phrases, but reading still happens in sequence. The brain depends on word order and syntax to make sense of text, so you cannot truly read it all at once. Skimming just means you sample parts of a linear structure, not that reading itself is non-linear. Eye-tracking studies confirm this sequential processing (check out the Rayner study in Psychological Bulletin if you are interested).
Thanks for the reference!
Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
There is an interesting discussion down thread about ADHD and sequential reading. As someone who has ADHD I may be biased by how my brain works. I definitely don't read strictly linearly, there is a lot of jumping around and assembling of text.
> Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
My initial reaction was to say speak for yourself about what reading is or isn’t, and that text is written linearly, but the more I think about it, the more I think you have a very good point. I think I read mostly linear and don’t often look ahead for punctuation. But sentence punctuation changes both the meaning and presumed tone of words that preceded it, and it’s useful to know that while reading the words. Same goes for something like “, Barry said.” So meaning in written text is definitely not 100% linear, and that justifies reading in non-linear ways. This, I’m sure, is one reason that Spanish has the pre-sentence question mark “¿”. And I think there are some authors who try to put who’s talking in front most of the time, though I can’t name any off the top of my head.
What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!
I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text
is that crazy? I'm not buying it is
That description feels relatable to me. Maybe buffered more than buttered, in my case ;)
It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?
Don't know, probably? I'm a linear reader
I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.
The relevant technical term is "saccade"
> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.
> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.
https://eyewiki.org/Saccade
Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading
I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.
Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.
That eyewiki entry was really cool. Among the unexpectedly interesting bits:
> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].
also ping pong around the page (ADHD'r). At times I read a sentance or two in linear fashion, then start jumping, or start or move to the end and read backwards, or any mix of this depending.
If you're an adult you probably have compensated for the saccades and developed a strategy that doesn't force you to read linearly. This is much of what "speed reading" courses try to do intentionally.
some of us with ADHD just kind of read all the words at once
Yet again Hollywood is prescient. This post reminds me of the language of the aliens in Arrival. It seems like the OP would see that as a reasonable input to an LLM.
I made exactly this point at the inaugural Portland AI tinkerers meetup. I had been messing with large document understanding. Converting PDF to text and then sending to gpt was too expensive. It was cheaper to just upload the image and ask it questions directly. And about as accurate.
https://portland.aitinkerers.org/talks/rsvp_fGAlJQAvWUA
[dead]
[flagged]
It's kind of beautiful that they can actually do that.