Any plans for sound recognition AI in AutoTagger?

Started by monochrome, August 18, 2025, 11:43:46 PM

Previous topic - Next topic

monochrome

I know that most people here use IMatch for photos - me too.

I've had some success with Mistral's Voxtral to describe field recordings - especially if you guide it by sending a description of an image that you took while recording the sound.

I would use my camera to take a picture of the scene when I was recording the sound, then set that image as the cover image of the FLAC file with the sound. Just sending the sound to Voxtral gave mediocre results, but if I sent the photo to Pixtral, and then included the generated description in the prompt to Voxtral, the result would be much better.

(On a side note, I wonder if you could do it in reverse for movies - use Voxtral to describe any speech or sounds, and then use that information to guide the image recognition. With IMatch's variables you could probably set up a pipeline of Autotagger AIs.)

Nonmons

Quote from: monochrome on August 18, 2025, 11:43:46 PMI know that most people here use IMatch for photos - me too.

I've had some success with Mistral's Voxtral to describe field recordings - especially if you guide it by sending a description of an image that you took while recording the sound.

I would use my camera to take a picture of the scene when I was recording the sound, then set that image as the cover image of the FLAC file with the sound. Just sending the sound to Voxtral gave mediocre results, but if I sent the photo to Pixtral, and then included the generated description in the prompt to Voxtral, the result would be much better.

(On a side note, I wonder if you could do it in reverse for movies - use Voxtral to describe any speech or sounds, and then use that information to guide the image recognition. With IMatch's variables you could probably set up a pipeline of Autotagger AIs.)
Interesting approach using Voxtral with photos as context. It makes sense that combining image + sound metadata could improve AutoTagger results. I wonder if in future updates, support for direct audio tagging (like detecting ambient sounds, voices, or instruments) could be added alongside image recognition. That would make the workflow even more powerful.

bathtubthis

Quote from: monochrome on August 18, 2025, 11:43:46 PMI know that most people here use IMatch for photos - me too.

I've had some success with Mistral's Voxtral to describe field recordings - especially if you guide it by sending a description of an image that you took while recording the sound.

I would use my camera to take a picture of the scene when I was recording the sound, then set that image as the cover image of the FLAC file with the sound. Just sending the sound to Voxtral gave mediocre results, but if I sent the photo to Pixtral, and then included the generated description in the prompt to Voxtral, the result would be much better.

(On a side note, I wonder if you could do it in reverse for movies - use Voxtral to describe any speech or sounds, and then use that information to guide the image recognition. With IMatch's variables you could probably set up a pipeline of Autotagger Doodle Baseball AIs.)
Interesting idea! As far as I know, AutoTagger currently focuses on image recognition only, and there aren't any built-in plans for sound recognition AI yet. Your workaround using Voxtal with supporting images sounds like a clever approach though, and combining multiple AIs in a pipeline could definitely enhance results.

Mario

#3
No plans to integrate speech recognition / transcription into AutoTagger at this point. I think this is rather a niche "problem".
Feel free to add a feature request so other users can see it, comment on it, like it.

As far as I know, neither Ollama nor LM Studio currently support speech to text, so this would require IMatch to run yet another AI like Whisper or what's currently the hot thing this week. I think the cloud-based AI services all support speech to text in one way or another. But that's probably a development minefield for me, starting with things like extracting speech from videos, dealing with audio files (all involving software patents and restrictive license conditions) etc.