Okay, so this is a PySpur ad, alright. Since I'm interested in this kind of tools, and I see on their GitHub that they don't have loops yet, I have to ask: does anyone know of a similar tool (node/DAG-based) that does support looping?
It seems to be a common problem; so far, I've played with Rivet, n8n, and "LLM Party" nodes for ComfyUI; they all seem to focus on everything other than allowing to conveniently loop the flows.
They should have also posted the PySpur pipeline, it would be interesting to see the agentic flow they used in this article. I am doing a lot of this kind of worflows manually, copy pasting stuff, I'd like to have some tools to design AI flows. PySpur looks pretty interesting visually.
One little thing I consider a brilliant UX innovation of ComfyUI is that it automatically embeds the entire workflow as metadata in images it produces. That means you can take any image generated in ComfyUI and just drag&drop it into the ComfyUI, and it'll recreate the workflow that produced it. This also enables a neat way of sharing workflows explicitly - you export it as a PNG[0] and publish it, and others can both view it as an image and import it into their ComfyUI instance. As a result, if you see someone sharing their AI art or workflow made with ComfyUI, there's a good chance you can literally just drag&drop the image to get the "source code".
I think all node-based tools should offer this kind of export, and I humbly suggest that PySpur would benefit from having it too :).
--
[0] - Right click on canvas, Workflow Image -> Export -> png.
If this is the startup that finally unleashes AI spam bot articles and comments to the top of HackerNews, I'm gonna quit the internet and move into a log cabin.
Or we just skip the middlemen and exchange our prompts instead.
Back to the core issue - apparently few people took a long enough look at the article to notice it was co-written by AI; i.e. there were human editors in the loop. Sure, the format is a bit off-putting, but that's IMO mostly because nobody can be arsed to write like that, even if their own thesis supervisor told them they should, as proper structuring makes it easier for the reader to understand a complex topic.
Anyway, point is, I personally have no issue with people using AI to improve their texts - LLMs already are better at writing than most people anyway. Just as long as the saved effort is put into ensuring the content itself is valuable and communicated well.
no, for copyright and other social agreements, there has to be a declaration of authorship for original works.. maybe others can expand and clarify; certainly will vary on the major marketplaces in the world
> for copyright and other social agreements, there has to be a declaration of authorship for original works
It's the Internet - we never cared about such things here. Attribution and linking, yes. "Copyright" and "authorship of original works" - are you sure you're not a legacy publisher desperate to insert itself into the free exchange of knowledge and put up a toll gate? :).
I'm joking, but only a little. Unless you actually believe LLMs sin against the Church of Intellectual Property with every token they produce, this complaint feels out of place in context of a blog post summarizing research work done in the open. There are situations in which one could try to argue LLMs violate rights of some authors, but this isn't one of such situations.
it's not the bullet points per se, the general structure of the analysis has a certain vibe to it at a level deeper than that of just visual presentation
but this is something where it's up to you to decide what you want from your ghostwriting. my comments would not a system prompt make
I don't like formatting in bullet points and listicles much, but the contents are pretty good, they cover many papers in a lightweight way, you can get a decent overview in 10 minutes for what would take hours to research.
> the contents are pretty good, they cover many papers in a lightweight way, you can get a decent overview in 10 minutes for what would take hours to research.
Exactly, I'm still surprised it works so well.
Also, which formatting do you prefer? I explicitly prompted it to write everything in bullet points because I find it more digestible
I prefer prose because it has better flow. Usually I have to specify this manually or the LLMs will happily generate bulletpoints and listicles.
More recently I prefer chain-of-thought prose to the final answer. I can trigger it with a prompt even on non-reasoning models and it's usually easier to follow.
Hi, OP here; this article helped me a lot to understand better KV caches, which is ultimately why I co-wrote it with AI + read it several times before posting
getting tired of these blog posts that end with "this post is AI-generated" as if it's going to surprise us. it's getting repetitive. imo, articles should be prefaced if they're ai generated or not to make the reader not feel stupid after reading the whole thing
with that said, i love the content! will be bookmarking for future reference
Hi, OP here. My intention wasn't to "gotcha" anyone by mentioning that in the end, it was simply to be upfront. Many blog posts/content put out these days are obviously 100% AI-generated, yet it's never being mentioned. This one was probably 80%/20% (I still did many manual edits).
At the start of the article there's a very clear profile picture and name of the author: Jean Kaddour.
If you want to be upfront, you should mention at the start that it's written by AI instead of showing this fake author.
This would give people the choice on whether to read it.
Putting it at the end is just to give you plausible deniability. Clearly your intention is to present this as if it was written by this Mr. Kaddour, which is a lie.
EDIT: they removed the fake author in response to this comment
Ah, good catch! You seem to really assume the worst on our side, but fine, this is HN :)
The author was simply there because of the website template we used; by default, it wants you to specify an author so we did. I removed the author now, thanks for making me aware!
If you really have good intentions, you need to state clearly that it was written by AI at the top.
The place that you just removed the fake author from would be a good position I think, you could even put the logo of the AI you used where the profile picture was.
I feel like we’re living in strange times where your comment appears to be AI generated as well. You complain about the surprise at the end and then offer up a similar structural surprise in your reply.
Strange times indeed, given that I naturally write comments structured similarly to GP. Hell, I'm probably even more of an LLM with human face than GP, because I capitalize the first words in my sentences, exactly like ChatGPT does.
Not sure if I'm getting this. Is this cache implemented as part of the forward pass through the network, in a general Python datastructure like a dict? Or is the cache somehow part of the fabric of the neural network?
The KV cache is typically stored in a data structure external to the trained weights—often a buffer or set of tensors kept alongside the model’s forward pass (e.g., in PyTorch, one might store it in a dictionary-like container). It’s not baked into the neural network parameters themselves; instead, it’s an auxiliary memory that holds precomputed key-value pairs so the model doesn’t have to re-encode past tokens on each new inference step.
Neither. Think of it as something like redis or memcached. It's external to the program, and the program will run just fine without it. But it avoids a lot of duplicate works.
It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.
Very clean writeup.
On the attention sinks, you mention they enable "infinite-length sequence processing". What does that mean exactly in practice? Isn't deepseek still capped at 128k?
"Infinite-length sequence processing" in StreamingLLM refers to handling much longer sequences than the model's training window (e.g., millions of tokens), by combining a sliding window for recent tokens with fixed attention sinks from the start of the sequence.
I can't speak for DeepSeek, but if I had to guess, I'd say that the infinite context window isn’t practical because storing all past tokens eventually becomes too expensive.
Agreed on the writeup itself. It's beautifully written and presented. Kudos to Jean Kaddour and anyone else that may have been involved in putting it together.
Thanks for reading! In most contexts (including this one), seq length encompasses both the initial input (prompt) tokens and the output tokens the model generates. It’s the total length of all tokens processed by the model so far.
Please do! Seeing that you used multiple research papers to back up this writing inspired me to use this in my current research project for the literature review and eventual write up.
The template will be hugely helpful for a non-programmer like me.
It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”
Fair point, thanks for clarification, it seems this was first proposed in https://arxiv.org/pdf/2405.04434? I was confused by your title mentioning DeepSeek but then first paragraph revert to "...language models like ChatGPT and DeepSeek faster at generating text".
Right, that's a good point. I'll adjust the intro a bit.
We wanted to provide a more holistic overview on what MLA is, what came before it, and why it matters :) hope it was useful!
The phrase, “the first token looks at 1 token,” is simply a shorthand for the self-attention step when the sequence length is one. Although there are no preceding tokens, we still treat it as an O(1^2) operation where the first token effectively attends to itself (or a special [BOS] token). This approach preserves the big-O analysis when summing over all tokens.
No way to know until you painstakingly verify every single assertion that the AI made! The author of this article certainly didn't, and the content was good enough to them.
> This blog post is mostly AI-generated using a PySpur workflow with minor human edits.
it's funny that this was clear about 5% in just due to the classic chatgpt-style format and tone
Okay, so this is a PySpur ad, alright. Since I'm interested in this kind of tools, and I see on their GitHub that they don't have loops yet, I have to ask: does anyone know of a similar tool (node/DAG-based) that does support looping?
It seems to be a common problem; so far, I've played with Rivet, n8n, and "LLM Party" nodes for ComfyUI; they all seem to focus on everything other than allowing to conveniently loop the flows.
> so this is a PySpur ad, alright
They should have also posted the PySpur pipeline, it would be interesting to see the agentic flow they used in this article. I am doing a lot of this kind of worflows manually, copy pasting stuff, I'd like to have some tools to design AI flows. PySpur looks pretty interesting visually.
We will push it to our repo very soon :)
One little thing I consider a brilliant UX innovation of ComfyUI is that it automatically embeds the entire workflow as metadata in images it produces. That means you can take any image generated in ComfyUI and just drag&drop it into the ComfyUI, and it'll recreate the workflow that produced it. This also enables a neat way of sharing workflows explicitly - you export it as a PNG[0] and publish it, and others can both view it as an image and import it into their ComfyUI instance. As a result, if you see someone sharing their AI art or workflow made with ComfyUI, there's a good chance you can literally just drag&drop the image to get the "source code".
I think all node-based tools should offer this kind of export, and I humbly suggest that PySpur would benefit from having it too :).
--
[0] - Right click on canvas, Workflow Image -> Export -> png.
Thank you so much, this is super valuable feedback and actually very easy to add. We will add this soon!
We do support loops! :) Just haven't advertised it inside the Github readme yet.
One of the reasons why we started building pyspur is precisely because none of the other tools support loops!
If you need more support; shoot me an email; you can find it at the bottom of the github readme.
EDIT: Just added a screenshot to the readme.
It was going pretty well until the exclamation point at the end of the first paragraph.
Replaced the exclamation point with a dot, hope it's better now!
let me know if you have more feedback!
It's blatantly obvious; nobody uses so many bullet points.
I do :(
It's a great and concise way to write!
Yep, me too. Always loved to write in bullet points and in fact, I prompted the LLMs explicitly to do so.
True! We love bullet points! :)
If this is the startup that finally unleashes AI spam bot articles and comments to the top of HackerNews, I'm gonna quit the internet and move into a log cabin.
It was already the case that no one reads the article before commenting. Soon it will be that no one writes the article either.
Just need AI to write the comments and the circle of life will be complete
Or we just skip the middlemen and exchange our prompts instead.
Back to the core issue - apparently few people took a long enough look at the article to notice it was co-written by AI; i.e. there were human editors in the loop. Sure, the format is a bit off-putting, but that's IMO mostly because nobody can be arsed to write like that, even if their own thesis supervisor told them they should, as proper structuring makes it easier for the reader to understand a complex topic.
Anyway, point is, I personally have no issue with people using AI to improve their texts - LLMs already are better at writing than most people anyway. Just as long as the saved effort is put into ensuring the content itself is valuable and communicated well.
Totally agree!
It doesn't matter whether you use AI or not, what matters is whether your articles are relevant and useful or not
no, for copyright and other social agreements, there has to be a declaration of authorship for original works.. maybe others can expand and clarify; certainly will vary on the major marketplaces in the world
> for copyright and other social agreements, there has to be a declaration of authorship for original works
It's the Internet - we never cared about such things here. Attribution and linking, yes. "Copyright" and "authorship of original works" - are you sure you're not a legacy publisher desperate to insert itself into the free exchange of knowledge and put up a toll gate? :).
I'm joking, but only a little. Unless you actually believe LLMs sin against the Church of Intellectual Property with every token they produce, this complaint feels out of place in context of a blog post summarizing research work done in the open. There are situations in which one could try to argue LLMs violate rights of some authors, but this isn't one of such situations.
your clown talk ? yeah probably not copyrightable
The US copyright system is not a one-line profane insult topic, to me.. we are different, yes
Hey, the more clownish I talk, the better chance I have at getting it protected as original work!
[dead]
I loved bullet points before AI was a thing. Now I'm accused of being AI.
Same here lol! I ask it to write in bullet points every time
Fair point! Do you prefer a different format or tone? We really like the concise bullet point format :)
it's not the bullet points per se, the general structure of the analysis has a certain vibe to it at a level deeper than that of just visual presentation
but this is something where it's up to you to decide what you want from your ghostwriting. my comments would not a system prompt make
I see! We wanted an overview of KV caching internally and quickly understand the latest methods without much fluff, I think it did a great job at that
I'd rather eat sand than read an AI-generated article. If you don't care enough to write it, I don't care enough to read it.
I don't like formatting in bullet points and listicles much, but the contents are pretty good, they cover many papers in a lightweight way, you can get a decent overview in 10 minutes for what would take hours to research.
> the contents are pretty good, they cover many papers in a lightweight way, you can get a decent overview in 10 minutes for what would take hours to research.
Exactly, I'm still surprised it works so well.
Also, which formatting do you prefer? I explicitly prompted it to write everything in bullet points because I find it more digestible
I prefer prose because it has better flow. Usually I have to specify this manually or the LLMs will happily generate bulletpoints and listicles.
More recently I prefer chain-of-thought prose to the final answer. I can trigger it with a prompt even on non-reasoning models and it's usually easier to follow.
How long would it take you to generate your own equally good summary of these papers using Claude? Maybe 30 seconds?
Would be curious to compare the results!
Hi, OP here; this article helped me a lot to understand better KV caches, which is ultimately why I co-wrote it with AI + read it several times before posting
getting tired of these blog posts that end with "this post is AI-generated" as if it's going to surprise us. it's getting repetitive. imo, articles should be prefaced if they're ai generated or not to make the reader not feel stupid after reading the whole thing
with that said, i love the content! will be bookmarking for future reference
Hi, OP here. My intention wasn't to "gotcha" anyone by mentioning that in the end, it was simply to be upfront. Many blog posts/content put out these days are obviously 100% AI-generated, yet it's never being mentioned. This one was probably 80%/20% (I still did many manual edits).
Glad you overall liked it!
At the start of the article there's a very clear profile picture and name of the author: Jean Kaddour.
If you want to be upfront, you should mention at the start that it's written by AI instead of showing this fake author.
This would give people the choice on whether to read it.
Putting it at the end is just to give you plausible deniability. Clearly your intention is to present this as if it was written by this Mr. Kaddour, which is a lie.
EDIT: they removed the fake author in response to this comment
Ah, good catch! You seem to really assume the worst on our side, but fine, this is HN :)
The author was simply there because of the website template we used; by default, it wants you to specify an author so we did. I removed the author now, thanks for making me aware!
If you really have good intentions, you need to state clearly that it was written by AI at the top.
The place that you just removed the fake author from would be a good position I think, you could even put the logo of the AI you used where the profile picture was.
I feel like we’re living in strange times where your comment appears to be AI generated as well. You complain about the surprise at the end and then offer up a similar structural surprise in your reply.
Strange times indeed, given that I naturally write comments structured similarly to GP. Hell, I'm probably even more of an LLM with human face than GP, because I capitalize the first words in my sentences, exactly like ChatGPT does.
Not sure if I'm getting this. Is this cache implemented as part of the forward pass through the network, in a general Python datastructure like a dict? Or is the cache somehow part of the fabric of the neural network?
The KV cache is typically stored in a data structure external to the trained weights—often a buffer or set of tensors kept alongside the model’s forward pass (e.g., in PyTorch, one might store it in a dictionary-like container). It’s not baked into the neural network parameters themselves; instead, it’s an auxiliary memory that holds precomputed key-value pairs so the model doesn’t have to re-encode past tokens on each new inference step.
Neither. Think of it as something like redis or memcached. It's external to the program, and the program will run just fine without it. But it avoids a lot of duplicate works.
It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.
Very clean writeup. On the attention sinks, you mention they enable "infinite-length sequence processing". What does that mean exactly in practice? Isn't deepseek still capped at 128k?
Thank you! Great question.
"Infinite-length sequence processing" in StreamingLLM refers to handling much longer sequences than the model's training window (e.g., millions of tokens), by combining a sliding window for recent tokens with fixed attention sinks from the start of the sequence.
I can't speak for DeepSeek, but if I had to guess, I'd say that the infinite context window isn’t practical because storing all past tokens eventually becomes too expensive.
Agreed on the writeup itself. It's beautifully written and presented. Kudos to Jean Kaddour and anyone else that may have been involved in putting it together.
Thank you so much, glad you liked it
When you say sequence length, does it only count the output tokens or are input tokens also included in that?
Thanks for the post, it was an excellent read!
Thanks for reading! In most contexts (including this one), seq length encompasses both the initial input (prompt) tokens and the output tokens the model generates. It’s the total length of all tokens processed by the model so far.
Please do! Seeing that you used multiple research papers to back up this writing inspired me to use this in my current research project for the literature review and eventual write up.
The template will be hugely helpful for a non-programmer like me.
hmm. after my engineer degree put all of the vector math in the form
k = Wx
seeing
k = xW
is jarring. Is there a reason for using horizontal vectors? Common for data science docs?
It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”
Isn't batch-first a Pytorch thing? I started with Tensorflow and it's batch-last.
TFv1 or TFv2? AFAIK it's batch-first in TFv2
You are in the right here. Horizontal vectors are common for (some) deep learning docs, but column factors are the literature standard elsewhere.
It is more efficient to compute k = xW with the weights transposed than k = Wx.
What's specific to deepseek here that other models do not use, or are you just riding the keyword wave?
DeepSeep proposed the multi-head latent attention technique! :)
As far as I know, they are the only ones using it so far
Fair point, thanks for clarification, it seems this was first proposed in https://arxiv.org/pdf/2405.04434? I was confused by your title mentioning DeepSeek but then first paragraph revert to "...language models like ChatGPT and DeepSeek faster at generating text".
Right, that's a good point. I'll adjust the intro a bit. We wanted to provide a more holistic overview on what MLA is, what came before it, and why it matters :) hope it was useful!
Just refined it a bit; I hope it's clearer now!
Neat. Can you share the workflow that created this blog? What models did it use?
Thanks! Yes, will push it as template to our repo (https://github.com/PySpur-Dev/pyspur) soon! We used o1 and claude3.5
How were the Images in the blog generated?
> each token needs to "look at" or "attend to" all other tokens to understand the context.
> First token: Look at 1 token (cost: O(1^2))
Umm, is this right? There is not 1 token existing before generating the first token, so how do you look at it? AI slop?
The phrase, “the first token looks at 1 token,” is simply a shorthand for the self-attention step when the sequence length is one. Although there are no preceding tokens, we still treat it as an O(1^2) operation where the first token effectively attends to itself (or a special [BOS] token). This approach preserves the big-O analysis when summing over all tokens.
> Umm, is this right?
No way to know until you painstakingly verify every single assertion that the AI made! The author of this article certainly didn't, and the content was good enough to them.
Trust me, AGI is almost there.
I did :)
Yes, 80% of the post is generated, yet, I still reviewed everything!
And if I had written everything from scratch, this would have probably taken a week rather than a few hours.
[dead]