AudioX: Diffusion Transformer for Anything-to-Audio Generation

zeyuet.github.io

148 points by gnabgib 2 days ago

The toilet flushing one is full of weird, unrelated noises.

The tennis video, as other commented, is good but there is a noticeable delay between the action and the sound. And the "loving couple holding IA hands and then dancing", well, the input is already cringe enough.

For all these diffusion models, look like we are 90% here, now we just need the final 90%.

Fauntleroy 2 days ago

The video to audio examples are really impressive! The video featuring the band showcases some of the obvious shortcomings of this method (humans will have very precise expectations about the kinds of sounds 5 trombones will make)—but the tennis example shows its strengths (decent timing of hit sounds, eerily accurate acoustics for the large internal space). I'm very excited to see how this improves a few more papers down the line!

pcthrowaway 2 days ago

There were a lot of shortcomings.
- The woman playing what I think was an Erhu[1] seemed to be imitating traditional music played by that instrument, but really badly (it sounded much more like a human voice than the actual instrument does). Also, I'm not even sure if it was able to tell which instrument it was, or if it was picking up on other cues from the video (which could be problematic, e.g. if it profiles people based on their race and attire)
- Most of the sound was pretty delayed from the visual cues. Not sure why
- The nature sounds were pretty muddy
- (I realize this is from video to music, but) the video with pumping upbeat music set to the text "Maddox White witnessed his father getting butchered by the Capo of the Italian mob" was almost comically out of touch with the source
Nevertheless, it's an interesting demo and highlights more applications for AI which I'm expecting we'll see massive improvements in over the next few years! So despite the shortcomings I agree it's still quite impressive.
[1] https://en.wikipedia.org/wiki/Erhu

kristopolous 2 days ago

really the next big leap is something that gives me more meaningful artistic control over these systems.

It's usually "generate a few, one of them is not terrible, none are exactly what I wanted" then modify the prompt, wait an hour or so ...

The workflow reminds me of programming 30 years ago - you did something, then waited for the compile, see if it worked, tried something else...

All you've got are a few crude tools and a bit of grit and patience.

On the i2v tools I've found that if I modify the input to make the contrast sharper, the shapes more discrete, the object easier to segment, then I get better results. I wonder if there's hacks like that here.

vunderba 2 days ago

> The workflow reminds me of programming 30 years ago - you did something, then waited for the compile, see if it worked, tried something else...
Well sure... if your compiler was the equivalent of the Infinite Improbability Drive.
I assume you're referring to the classic positive/negative prompts that you had to attach to older SD 1.5 workflows. From the examples in the repo as well as the paper, it seems like AudioX was trained to accept relatively natural english using Qwen2.
- kristopolous 2 days ago
  
  no, I'm talking pretty recent stuff. I was dealing with https://huggingface.co/bytedance-research/UNO and https://huggingface.co/HiDream-ai/HiDream-I1-Full earlier today
  These are both released this month.
  What I'd like to see is some kind of i2i with multiple i input and guidance
  So I can roughly sketch, and I don't mean controlnet or anything where I'm dealing with complex 3d characters, but give some kind of destination - and I don't mean the crude stuff that inpainting gives ... none of these things are what I'm talking about.
  i'm familiar with the comfyui workflows and stay pretty on top of things. I've used the krita and photoshop plugin and even have built a civitai mcp server for bringing in models. AFAIK nobody else has done this yet.
  None of these are hands on in the right way.
  
  vunderba 2 days ago
  
  Thanks for the links. I've added HiDream-I1 to the prompt adherence comparison chart. From my testing, it has adherence capabilities comparable to Flux.
  https://genai-showdown.specr.net
  
  numpad0 a day ago
  
  Just my reading, not well separated from my own views but - he's wanting a steam powered paintbrush with thousand buttons that only professionally trained artists are allowed to use and does nothing if held in a hand of an average person. He's done with proxy manipulation through metadata such as fabricated museum captions, negative Danbooru tags, and lines painted over existing works. This is not exactly the problem definition made above nor a fair description to the problem, and it's opposite of "democratization" concepts, but I do believe that's what it is. It's also what Photoshop is, anyway. Untrained users can barely draw a smiley with Photoshop open on a Wacom display.
  If I really think about it, it feels just weird to me that fabricated metadata is supposed to be enough to yield an art. Metadata by definition do not contain data. It's management data artificially made to be as disconnected from data as possible. The connections left between the two are basically unsafe side effects.
  I wish OpenAI and its followers quit setting bridges ablaze left and right, though I know it's tall order.
  
  kristopolous a day ago
  
  here's some surreal art I made today as an example: https://9ol.es/dress.mp4 ... this was uno/wan/kdenlive via pinokio for the first 2 ... there's AI slop and then there's AI as an interesting new medium for exploring the strange ... that's what I want to do more of
  The song is https://www.youtube.com/watch?v=6K2U6SuVk5s
  
  vunderba a day ago
  
  Yeah, we all have our own workflows. For me, I usually have a very specific visual concept in mind (which I will block out roughly on graph paper if need be). I can usually get to where I want to go with a combination of inpainting and various types of controlnets.
  Like this: (Created by noted thespian Gymnos Henson)
  https://specularrealms.com/wp-content/uploads/2024/11/Gorgon...
  
  numpad0 a day ago
  
  Do time pass slower on the English Internet?

gigel82 2 days ago

That "pseudo-human laughter" gave me some real chills; didn't realize uncanny valley for audio is a real thing but damn...

squarefoot a day ago

I use regularly AI music services to build rock songs out of my lyrics, old poetry or popular songs, etc. and sometimes they hallucinate in creepy ways, adding after the song ends either evil laughter or horror sounds, demon-like voices, starting singing in completely made up languages. They're creepy but fun and interesting at the same time. Creepy sounds aside, I've had a lot of fun experimenting with AI music hallucinations as they sometimes create interesting and unusual things that spark more creativity (I'm already a musician); I sometimes felt like someone who grew up listening only bad pop trash being suddenly exposed to Frank Zappa.
BizarroLand 2 days ago

Sometimes when I lie awake at night I wonder what it is about things that are "almost human" that terrifies so many of us so deeply.
It's like the markings on the back of tiger's heads that simulate eyes to prevent predators from attacking it. I'm sure there used to be something that tigers benefited from having this defense for enough for it to survive encoding into their DNA, right?
So, what was it that encoded this fear response into us?
- observationist 2 days ago
  
  Dead things, and behaviors that don't align with our predictive models shift the context to one of threat - if something shaped like something you understand starts behaving in a way that you no longer understand, you'll become progressively more concerned. If a pencil started rolling around aggressively chasing you, it'd evoke fear, even though you'd probably defend yourself fairly capably.
  If enough predictive models are broken, people feel like they've gone crazy - various drugs and experiments demonstrate a lot of these factors.
  The interesting thing about uncanny valley is that the stimuli are on a threshold, and humans are really good at picking up tiny violations of those expectations, which translates to unease or fear.
- numpad0 2 days ago
  
  Corpses. Bodies not dressed as deceased imply existence of a threat.
- causality0 2 days ago
  
  Other hominids as well as visibly diseased humans.

al_th a day ago

I was a bit disappointed, even know there is no reason I should expect much in this space

- Tennis clip => ball is strongly unsynced with hit

- Dark mood beach video, no one in the screen => very high audio mood, lots of laughter like if it was summer on a busy beach

- Music inpainting completely switching style of audio (e.g. on the siren)

- "Electronic music with some buildup" : the gen just turns the volume up ?

I guess we have still some road to cover, but it feels like early image generation with out of touch hands and visual features. At least the generation are not non-sensical at all

oezi 2 days ago

Audio, but not Speech, right?

teeklp 2 days ago

[dead]