Any plans for sound recognition AI in AutoTagger?

Started by monochrome, August 18, 2025, 11:43:46 PM

Previous topic - Next topic

monochrome

I know that most people here use IMatch for photos - me too.

I've had some success with Mistral's Voxtral to describe field recordings - especially if you guide it by sending a description of an image that you took while recording the sound.

I would use my camera to take a picture of the scene when I was recording the sound, then set that image as the cover image of the FLAC file with the sound. Just sending the sound to Voxtral gave mediocre results, but if I sent the photo to Pixtral, and then included the generated description in the prompt to Voxtral, the result would be much better.

(On a side note, I wonder if you could do it in reverse for movies - use Voxtral to describe any speech or sounds, and then use that information to guide the image recognition. With IMatch's variables you could probably set up a pipeline of Autotagger AIs.)