(Somewhat) Recent paper on model collapse

Sorry, this is kind of a repost, since my last post violated some guidelines.

This paper that i read a few days ago establishes the existance of a strong model collapse, and says that (feom their own words "as little as 1\% of the trainig data") can still lead to model collapse.

What i found interesting beyond their results, which boil down to: Model collapse can be steep, only diminishing as the ration of synthetic data/real dat gets smaller, and that larger models experience a more severe model collapse. Is that in tge end they talk about the importance of labeling and curating real data.

How would real datavbe labelled? Wouldn't it be easier to labell generated data? And isnt the internet already stuffed with Generated images, SEO llm articles and more? All of which are becoming harder and harder to detect.

I dont want to call this a pre war steel situation since humans are still making human made content, but with llm made SE optimised content flooding the web on mass, amd the first three image results on google being AI generated, its starting to seem more like finding a needle in a haystack.

What do you guys think?

gnabgib a year ago

Article title: Strong Model Collapse