Some experiences from using the new Open AI 4.1 and Google 2.0 Flash Models

Started by Stenis, May 07, 2025, 02:30:51 AM

Previous topic - Next topic

Stenis

The older Open API 4.0 models and Gemma 3 models with Autotagger

First I was very thrilled just by seeing these different AI-supported texting in Description and Keywords really functioning but soon I started to realize there was quite a few challenges with them to get them to follow my intentions via the three field specific prompts and the general forth prompt. It was also very easy to spot some limitations and peculiarities:

OpenAI GPT 4.0 Mini was OK in general and reasonably fast (4 sek/picture maybe) and it also managed to keep a mandatory structure I wanted on the first line.

A Description example: Marrakesh Morocco 2025 - In the Medina - (Autotagger text)

BUT, it wasn´t at all possible to use on my safari pictures where I wanted it to add "Specie"."Family" and "Scientific Latin name". It was in fact hopeless and mixed Thopssons gaselles with Grants Gaselles, It missed totally DIK-DIk antelopes and Waterbucks and mixed up Wilderbeasts with buffaloes and everything in between.

BUT the bigger OpenAI GPT 4.0 nailed almost every specie I throwed at it but to a substantially higher cost but at that time I had no problem to accept that because it just ha to be correct.

I also tried Gemma 3 4GB with Ollama and I liked ho Gemma treated the langage. It wrote the language more vivid and better than these earlier OpenAI-models BUT:

- First, it was totally unaware of the "mandatory" demands I had around the first line of text which for me was a dealbreaker
- Second, it was as bad as the GTP 4.0 Mini to nail thge species.


Then we got an upgrade of the OpenAI-models and we got Google 2.0 Flash and 2.0 Flash Lite and that changed everything because both lived mostly up to what OpenAI and Google had promised in their marketing news.

Both OpenAI GPT 4.1 Mini and Gemini 2.0 Flash Lite were much better in nailing the species I trow at them nailing almost 100% but there they split because the GPT 4.1 Mini is much slower than Gemini 2.0 Flash Lite.

Gemini 2.0 Flash Lite is now my choise because it both solves to keep the mandatory structure I demand, writes really well, identifies species really well and even identifies "landmarks" in a way that often is nothing but astonisching. On top of that it is really fast and efficient as a flash compared both with the older OpenAI 4.0 models and the new ones. I think it is also cheaper to run than the competition. What this extra speed has done to my workflow´s efficiency and my productivity can not be over estimated. My wait states are almost gone compared to how it was before. Så what Google has manage to do with Gemini 2.0 Flash Lite is to both lift the quality of the output, increase the output speed and to lower the prices for the service. That will be hard to beat.

Together with the new iMatch-implementations of the Gemini 2.0 Flash-models this will be a true game changer and it will lift the metadata maintenance to a whole new level compared to how it was just a couple of weeks ago. It is hard to overestimate the importance of speed for productivity and this well designed Autotagger interface fits right into these fast workflows Gemini 2.0 Flash now has made possible for all of us.

BUT there is still a few stains! So far I have seen a few problems and the worst is that it sometimes skips to update especially RAW-files. Sometimes it is just to update them again but I have seen cases where I have to use the bigger Gemini 2.0 Flash-model instead in order to solve it. It is easy to spot these cases since they lack the yellow pen-symbol after the update.

Still all of them hangs initially for me and has to be released with the famous three finger grip of Shift-Alt-S.

Note: Concerning the OpenAI-models you also have to change and increase the "Rate Limits" for each model because they need to be changed to met or match the performance limits in the spec sheets at OpenAI. 





Mario


Stenis

Sorry for the spelling! I must have been tired :-)
I wanted to correct it but it seems closed for editing now.

I will add a few examples later so you get a hint about how good these improvements in Gemini 2.0 Flash Lite really are.

Jingo

I have also been playing with the Gemini Lite models for my bird photography and it is EXCELLENT at identifying and describing the birds. The ability to take the AI keywords and map them to the thesaurus and only file the hierarchical keywords is a big time save for me. Game changer.... Well done Mario!

Example Images fed in (keywords were not present already):

9aj4w7RD6n.png

Results:
qkMMjbQxJ0.png


Smaller harder to identify bird:
IMatch2025x64_GFxwxZMSly.png

Results:
IMatch2025x64_23Mafq45IU.png

One more for good measure.. looks like a Willet right?  But, notice the legs are yellow and not blue/green.... will the AI get it right?

IMatch2025x64_hgcqC3triS.png

Bingo! Amazing...

AI|Wildlife; Animals|Bird; Aquatic; Grey; Habitat; Long Beak; Migration; Objects|Water; Ornithology; Places|Nature; Shorebird; Shoreline; Spotted; Wading; Yellow Legs

The bird in the image appears to be a Lesser Yellowlegs (Tringa flavipes). This medium-sized shorebird is characterized by its long, slender, bright yellow legs and a slightly upturned, straight black beak. It has a greyish-brown back and wings, with a white belly.  Lesser Yellowlegs are commonly found in wetlands, marshes, and along coastlines, particularly during migration. They primarily feed on insects, small crustaceans, and aquatic invertebrates, which they forage for in shallow water. During the breeding season, they nest in boreal forests of North America. They are known for their call, a distinctive series of two to four clear, whistled notes.

Mario

Impressive. Keep in mind to switch to a larger image size when the object in the image is quite small - this gives the AI more pixels to work with. Costs a bit more, though.

Stenis

Thanks Jingo for those nice and good examples that just confirms my own experiences using Gemini 2.0 Flash Mini. For me this is a double improvement. No need to use the bigger models for jobs like this that I was forced to do with OpenAI GPT 4.0. It gets cheaper and also is so much faster.

It seems like Gemini is far better to utilize the treads and perform efficient parallell processing than Gemma and OpenAI is and was.

Thanks for the input about the image size Mario. I have also noticed though that some pictures with animal on a distance that the earlier generation missed actually gets handled perfectly now with the Gemini Lite. It is really a very big difference and the new Gemini models performs  in a way that make them much more reliable than the older models.

So Mario your implementation of Gemini 2.0 Flash is really a fantastic selling point for the whole iMatch platform that you didn´t have a couple of weeks. I remember you were a little sceptic at first :-) and Jingo is so right about his comments on the Keyword automation and about how the Teasarus kan be optimized to solve a keyword-problem many other systems do not have any solutions for.

It looks like most things around the AI-automation of Description-texts and Keywords and Theasarus maintenace just fell in the right places together with the iMatch Autotagger.

Stenis

I think I have to revise some of my early reflections on the performances of Gemini 2.0 Flash Lite and Open AI GPT 4.1 Mini. At a first test Gemini was really much faster but that might have been because the default settings of "Rate Limits" (AI Rate Limits) for Open AI Mini was choked by these default settings. I have adjusted them later on and now even Open AI GPT 4.1 Mini really flies.

A clocked a batch of 260 pictures that got updated with different text on every picture with a lot of my fixed restrictions for all three tag prompts of iMatch and all these 260 very finished in 1 minute and 45 seconds!

So, I have switched back to Open AI after I have seen that Gemini 2.0 Flash Lite sometimes have failed adding texts to especially my RAW-files. I don´t really know why yet. I also feels that Open AI writes my text a little more to my liking - they are more structured. I´m really impressed by the results. To night I migt have processed close to 100 pictures from very different sessions. Some old digitized color slides from 1976 taken in Nepal and some more recent from gymnastics competitions my grand children have participated in and I can say I would have struggled to get all these concepts they use about gymnastics correct.

Another thing is that it has been running through these images without any hanging issues. I don´t know if that have had anything to do with the adjustment of the "Rate Limits" but now it has been working very well. I wonder if others have any experiences around the "Rate Limits".

Stenis

"Tonight, I might have processed close to 100 pictures from very different sessions."

That shall of course be 1000.

With the speed I just have experienced now with Open AI 4.1 Mini after adjusting the "Rate Limits" it shouldn't take even 7 minutes to process Descriptions and Keywords for 1000 files. I can´t even think of how many hours that would have taken me to write as good texts as I get now with Autotagger and Open AI API 4.1 Mini by myself.

This morning I also have studied a few very interesting videos about the improvements done to Open AI Chat GPT so I think I also have to download and study the user manual of Chat GPT in depth to get even better knowledge on prompting and the various models, tools and settings ther is to handle to get that system to open up for me.

This is so fantastically interesting.

sinus

Quote from: Stenis on May 28, 2025, 12:29:16 PMWith the speed I just have experienced now with Open AI 4.1 Mini after adjusting the "Rate Limits" it shouldn't take even 7 minutes to process Descriptions and Keywords for 1000 files. I can´t even think of how many hours that would have taken me to write as good texts as I get now with Autotagger and Open AI API 4.1 Mini by myself.
...
This is so fantastically interesting.

Very interesting, indeed.
But at the moment, I think, if you want correct descriptions and Keys, you have to check them all. And this you should add to your 7 minutes. 

Of course, if you are aware, that maybe there are some errors or not very good sentences in the description, you could leave it and simply check/correct it, when you are sending them to another person or simply, if you have time. 

AI is very fascinating, yep, but still at the beginning, I think. 
Best wishes from Switzerland! :-)
Markus

Mario

Quote from: sinus on May 28, 2025, 01:05:37 PMAI is very fascinating, yep, but still at the beginning, I think. 
I think AI's are already good enough for most photographer workflows.

They can read text in images, facial expressions, racing numbers, runner numbers, etc. They recognize most tourist spots, landmarks, important buildings, car brands and types, boat types, fashion styles, clothing names, food and drink, animals and insects, sports, products on shelves, and can interpret charts among many other things.

The latest models understand context and "the style I want you to return your description in" quite well. This allows for fine-tuning of length, style, order of elements, keywords, etc.

Of course, there are some variations between models from OpenAI/Google/Mistral. However, these differences evolve with each new model generation.

Locally running models (privacy-friendly and free) aren't as powerful as massive cloud-based models. But even local models (like Google's Gemma 3) are more than sufficient for most purposes—I use it all the time.

Humans still have an advantage in edge cases or when deep contextual knowledge and expertise are required, such as scientific documentation, brand identity, etc.

But these situations are really just edge cases.

Creating keywords, descriptions, headlines, and maybe some traits for the kind of photos most IMatch users produce is pretty much a solved problem. If you have not tried the latest OpenAI / Google models or the Gemma 3 model when you use Ollama, give it a try. You might be surprised.

Note: The prompt is often the key. A good prompt produces good results. Finding a good prompt may take a bit of experimenting.

Stenis

Quote from: sinus on May 28, 2025, 01:05:37 PM
Quote from: Stenis on May 28, 2025, 12:29:16 PMWith the speed I just have experienced now with Open AI 4.1 Mini after adjusting the "Rate Limits" it shouldn't take even 7 minutes to process Descriptions and Keywords for 1000 files. I can´t even think of how many hours that would have taken me to write as good texts as I get now with Autotagger and Open AI API 4.1 Mini by myself.
...
This is so fantastically interesting.


Very interesting, indeed.
But at the moment, I think, if you want correct descriptions and Keys, you have to check them all. And this you should add to your 7 minutes.

Of course, if you are aware, that maybe there are some errors or not very good sentences in the description, you could leave it and simply check/correct it, when you are sending them to another person or simply, if you have time.

AI is very fascinating, yep, but still at the beginning, I think.

Absolutely, but there is really less errors than I expected. For example if I have been lazy and not confirmed identities of people or written their names in my mandatory first line text can be too general when there is people in the motifs. But it faces are confirmed the texts often gets surprisingly well written.

If there are signs even in in Arabic, Hebrew, Norwegian or Russian for example they get correctly translated into English if that happens to be the language used.

It identifies both animals and landmarks correctly almost 100% now with the new version 4.1 even with the "Mini"-version which is a huge improvement compared to the old 4.0-version. Before 4.1 I HAD to use the full bigger model to identify animals correctly. Now the new 4.1 MINI model fixes this very good.

Even if I really believe in the importance of "context" to images (after working seven years in the City Museum of Stockholm) I can also say that despite that I often felt that I just couldn't justify too many sentenses on every picture when I wrote them manually when there is tens of tousands of them to handle (at my old age since my days are numbered) and of that reason Auto-generated text often get better and informative than the ones I write myself. Despite I know how to write informative texts.

Before Autotagger I also might have used PhotoMechanic to batch a common text to a whole bunch of similar pictures like I saw I had done previously on some pictures from The famous Fogner Park in Oslo which is full of statues of the famous Norwegian sculpture artist Axel Vigeland but with Autotagger I could get texts that were really more relevant than the more common texts I had written before. So there is really different ups and downs when it comes to a realistic comparission of quality of these texts. With Autotagger-texts I got texts that really reflected the unique expressions of all these statues which was a great improvent to a more simpler general text.

I'm also an old teacher and languages are important to me and I really think especially the Open AI API:s are writing surprisingly well.

Stenis

Frogner skulpturer.jpg

Here some text examples:

Pic 1:
Oslo Norway 2011 - Frogner Park with statues in granite by the famous Axel Vigeland - This image shows three granite statues of human figures sitting and crouching together. These sculptures are part of the Vigeland installation in Frogner Park, Oslo. The statues display detailed human anatomy and express different emotions. The stone material and realistic style highlight the human condition.

Keywords: Art, Figure, Human, Sculpture, Statue, Stone

Pic 2:
Oslo Norway 2011 - Frogner Park with statues in granite by the famous Axel Vigeland - This image shows a stone statue of an adult bending over with two children holding onto the legs. The statue is covered with snow on the upper back and head. The scene is captured at night with snow on the ground. The sculpture is part of a public art display, likely representing family or care.

Keywords: Adult, Art, Child, Family, Night, Snow, Statue

Pic 3:
Oslo Norway 2011 - Frogner Park with statues in granite by the famous Axel Vigeland - A stone sculpture of two people embracing ((a child) my correction) is partially covered with snow. The scene is set at night, showing the detailed texture of the stone and snow. This artwork reflects human form and deep emotion through careful craftsmanship.

Keywords: Human, Night, Sculpture, Snow, Statue, Stone

Pic 4
Oslo Norway 2011 - Frogner Park with statues in granite by the famous Axel Vigeland - This image shows a stone sculpture pillar composed of numerous intertwined human figures carved in relief. The figures are mostly nude and depicted in various poses, forming a tall, cylindrical shape against a dark night sky. This is part of the Vigeland installation ((Called the Monolite) my correction , a famous artwork known for its detailed and expressive human forms.

Keywords: Artwork, Figure, Human, Pillar, Sculpture, Stone

I think this is surprisingly good and much better and informative than my own older more rudimentary, shorter and more general texts.

The snow might have caused the first problem. The figures do embrace but not just each other but mainly a child. AI might have fixed that too without the snow.

I think this is really great!
I think the latest upgrades of Google's and Open AI's models has been a real game changer.
If I correct texts I have seen that it is mostly because AI just can´t decode roles and relations between people in the motifs.
I know even some people that have problems with that :-)

The keywords are fine by me. I want a flat structure with keywords only made up by single words and words in singular and I have prompted that so I get results like these ones and I have very little problems with that . There comes just one problem to my mind and of some reason the word Market was spelled with capitals like MARKET which I have corrected in the Thesaurus so I guess even that is Ok now. Otherwise the Auto keywording is working without giving me any problems at all really. I have limited them to eight and to match them againt the Thesaurus, so I have sort of a fixed vocabulary after having processed maybe 10 000 pictures so far.


Stenis

So, I can just confirm what Mario writes.
Autotagging Descriptions and Keywords is very much a solved problem already today.
It is very little dram and problems around autotagging.
This is pretty amazing since it is just a couple of years this tech has been available.

... and the formidable thing is that aslong as Mario will add support for new even better models it will get even better in the future.

Mario


Jingo

I have been very impressed with the hit rate for my bird photography... but it is not always correct of course.  I'm going to go back and see what it can do to some of my vacation photos now.. been awhile since I tagged those photos and if I can get some general keywords on them - that will save a ton of time.

Curious: what is the overall consensus for TOP AI's at the moment (perhaps by category)?

Mario

QuoteCurious: what is the overall consensus for TOP AI's at the moment (perhaps by category)?
There is no real answer to that. There are AI benchmarks which evaluate and compare AIs from OpenAI, Google, Anthropic, DeepSeek, Mistral and others in various areas, from math to logic, argumentation, text generation etc.

Google for AI benchmark to find them.

The "Create description and keywords" we're using in AutoTagger is merely a side show and not part of any benchmark I know. The latest models are all hybrid, mixed expert and reasoning models, more suited for math, logic, reasoning, chat.  No improvements for the part we're interested in for AutoTagger. They may be even as good as it ever gets, for that specific purpose.

The latest Gemini models (2.0 is available in IMatch, 2.5 is currently in "Preview" mode at Google) and the 4.1 generation of OpenAI (available in IMatch) are very, very good. A big step upwards from the previous generation (which AutoTagger still supports, so you can make your own comparison).

Locally running models like Gemma 3 are privacy-friendly and free. But are (currently) no match for the big models.
12 billion parameters (which already requires a graphic card with 16GB VRAM) compared to 500 billion parameters of the models running in the cloud.

BUT, the bigger models provide more details, know more landmarks and stuff. But if all you want is a description of an image and some matching (hierarchical) keywords, running Gemma 3 locally in Ollama or LMStudio for AutoTagger is more than sufficient already.

Stenis

Quote from: Jingo on May 29, 2025, 12:40:35 PMI have been very impressed with the hit rate for my bird photography... but it is not always correct of course.  I'm going to go back and see what it can do to some of my vacation photos now.. been awhile since I tagged those photos and if I can get some general keywords on them - that will save a ton of time.

Curious: what is the overall consensus for TOP AI's at the moment (perhaps by category)?

I have also found a strong desire to go back now and reprocess most of my older metadata works because the AI Description-texts really seems to get better and more consistent than many I have written earlier. Pretty unexpected but true in my case. I did not see that coming :-)

I'm  pretty sure there isn't any real consensus since we are approaching our task from very different angles and with very different needs.

Some want it cheap or not having to support the big american players and tries to run everyting local under Ollama or something else. Some prioritizes speed and prodictivity and others have specific quality needs that not every solution can meet. There is also differences in the providers business models to consider. I feel confident in the OpenAI solution where a can prepay a suitable amount and don't need to open up for a continues uncontrolled withdraval from my card. In the Google Gemini case they give you a credit and bill you afterwords. I prefer the OpenAI  way because I don't like surprises.

With that said this doesn't seem to be very much of an issue since it is really affordable in both cases as far I have been able to see when using the Lite or Mini models.

I don't believe in a "one size fits all" aproach to this.

I would never recommend OpenAI GPT 4.1 Mini generally to everyone because it seems to fit my needs pretty well but I am very satisfied with the results I get now and the overall productivity gains I have seen. Since very much of the various systems performaces  is deeply dependant on our own prompt engineering everyone will have to find their own ways to a great extent and it is very fortunate the configuration options  are so flexible in Autotagger.