SQLite Vector Search for Natural Language Queries

Started by monochrome, August 15, 2025, 04:15:17 PM

Previous topic - Next topic

monochrome

IMatch's new AI descriptions and keywords are simply amazing. Finally I have all my 100k+ photos roughly organized. Not perfectly, but enough for it to be very useful. One thing I found annoying though is that when searching you have to match the word in the description exactly. Like, you can't search for "car" when your AI God decided to use "vehicle".

So I thought I'd play around with two things:

1. There is a locally runnable model (80MB) called "all-MiniLM-L6-v2" that can turn a sentence into an LLM embedding - that is, a vector in "semantic space" where sentences that have the same meaning end up close to each other.
2. There is a vector search extension to SQLite (sqlite-vec) that enables you to quickly find all vectors that are close to a given vector.

The plan was to (1) take all AI image descriptions I have (2) turn them into embedding vectors (3) stuff it all into SQLite-Vec and (4) use that for natural language queries.

It worked. It's only a proof-of-concept so far, but I could now search with phrases like "something that symbolizes childhood" and get a list of photos with my daughter playing, or "everyday life" and get photos matching that concept, even though there were no keyword or word matches.

I don't know where I'm going with this but I think this could be a cool File Window App, or even better a pluggable search provider if such a thing can be realized.

Mario

I have something like this on my to-do list for a while. Just not sure how much demand for something like this is.

QuoteLike, you can't search for "car" when your AI God decided to use "vehicle".

You can map keywords returned by the service to the ones you prefer. And use synonyms to add both car and vehicle keywords, independent from which one the AI produces.

For searching, use a regexp like [car.*|vehic.*] to find car, cars, vehicle, vehicles.

Jingo

I've wanted to extend my Fancy Search app for some time with a "wizard" like interface to allow for easily combined searching like other programs have XNViewMP for example:

You cannot view this attachment.

I could see adding something like an "AI natural language" search engine as well.... just need to find the time to play around and do it!  :P

Mario

#3
I wonder how often such searches are needed by IMatch users?
You have the search bar which can do lots.
You have the Filter Panel, which can do lots.
I guess that this simply is sufficient for almost all use cases.

Natural language queries are fine, I guess, until you get too many false positives or the demands on hardware are too high, e.g. when running AutoTagger and the embedding model at the same time with Ollama. Embedding models are usually a lot smaller than general models, though.

I could always extend the search bar (and allow for a popup to have more room for entering text) like

@tag:title contains "Car" AND (@tag:city is "London" OR @tag:city contains "Lon") to allow fine-grained per tag and per group searches. But there are not many (if at all) requests for this.

And you can always combine two Metadata value filters to search in title and city, or select the city in the location category and search for the title in the search bar.

It all depends on what users want. I have many ideas for new features for the search bar, the Quick Filter and basically everything else. But I have learned to not do too much initially - rather ship a MVP and wait for user feedback. I would have guessed we get a lot of feature request for the Quick Filter, for example. But there was only one FR so far, with zero likes.
Same for AutoTagger, Peek View etc. Many things that could be added...

mastodon

It would be great, if you have not keyworded everyhting, because there are so many synonimes

monochrome

Quote from: Mario on August 16, 2025, 12:28:26 PMJust not sure how much demand for something like this is.
That's the crux, isn't it? I really don't know because I don't know how useful it is. Might try it out with a small app - just dump the ImatchDB and have a small webapp that lets me search.


QuoteFor searching, use a regexp like [car.*|vehic.*] to find car, cars, vehicle, vehicles.
I was really trying to get away from that. When you have a small controlled vocabulary it works, but when you have natural-language descriptions that have more semantic meaning than just a bag of words I don't see this scaling.

There are really two use cases here as I see it - you either have a controlled vocabulary in which case keyword searches as we do now are perfect, or you have natural language descriptions in which case I think the search engine has to follow?

I'd imagine most people with lots of metadata have a controlled vocabulary, because that's what we've had in our toolbox; with AI image descriptions that may change, but I really don't know.

Mario

Maybe we can better control the prompt, so the AI does not use vehicle or car or automobile or transport to describe the same thing? AutoTagger can do this for keywords already...

I've worked with embeddings and vector databases and it has potential for some situations.

I could cook up of a feature for IMatch which produces embeddings from descriptions or keywords or, actually, any tag the user want and stores them inside the IMatch database. Or even a separate database.

BUT, the problem is, this is all super-fast evolving technology.
The way embeddings are created varies widely and is model-dependent. The open source libraries you plan to use may become abandoned tomorrow. The vector extensions for SQLite may be abandoned or replaced by something incompatible tomorrow. Also the model may become stale.

This results in a large technical debt.
Perfectly OK for your own hobby project or app. Not so much in the IMatch context.

The beauty of the AutoTagger design is that it is totally free of any technical debt. If the API of Ollama, LM Studio, OpenAI etc. changes, I only need to update what I call a "connector" to interface with the new API and fetch keywords and texts.
IMatch databases store no model-specific content. Nothing depends on the actual model used, or the AI technology stack that produced the response.

Everything is done "in the moment" and the results produced by AutoTagger become regular keywords and tag values and are then independent from the AI service and model used to create them.

All this AI tech is fun, whether you use Python or Rust or C++. But when things become part of a long.-living project like IMatch, where I have to deal with maintenance and compatibility for years, things become a lot more complex.

Embeddings are model-specific, so, which model to choose and will it exist and be compatible in a year or three? This is true for whether you use your own libraries to produce the embeddins, or rely on the corresponding features available in Ollama, LM Studio or the cloud-based AI's.

Is there a migration plan to migrate embeddings produced by model v1 to model v3? Or become they worthless at some point in time? We can of course throw away the existing embeddins and re-create them for a different model when needed. That's a disruptive process, though. Also, for larger IMatch databases, this can be really time-consuming.

Which vector database system to choose for storing the embeddings. It must have a many year maintenance outlook and have bindings for C/C++ and C# so I can use it, not only Python, as usual for AI libraries. 

Again, I definitely see potential in this and I have played with embeddings for various purposes.
But there is much more involved than getting it to "run" with some pip and node.js commands for a hobby project. 
And, as i said, is there a real use-case for IMatch and actual demand for this?

PandDLong

Quote from: monochrome on August 16, 2025, 07:01:45 PMThere are really two use cases here as I see it - you either have a controlled vocabulary in which case keyword searches as we do now are perfect, or you have natural language descriptions in which case I think the search engine has to follow?

I adhere to a controlled vocabulary in order for all of the search tools to work effectively and I am comfortable with building complex searches using operators.  It all works for me, because I know the tools I have in the toolkit and I accept there are limitations.  

However, I can see natural language capability being the next logical step not only is AI a partial driver with auto-tagging but also natural language processing is becoming quite mainstream. For many people, a controlled vocabulary is likely a foreign concept and challenging to apply (they are probably not IMatch users today).

Controlled vocabularies have limitations.  If I have a picture with a convertible, I have to decide whether I want to add the term to my controlled vocabulary or just keyword it with a more generic 'car' or even more generically 'vehicle'.  A controlled vocabulary that gets too large loses its effectiveness and makes it more challenging to apply correctly but if it is too small, there are other issues.  There is a balance that is needed and can be challenging to find and stick.

Natural language processing has the potential to remove many of those limitations.  It also simplifies the ability to have multiple contributors to a database where each uses slightly different semantics and language - AI has effectively done this for people who currently manage their database alone because AI is additional contributors.

In fullness of time and development, perhaps only a description will be needed and keywords become redundant and fade into legacy.

Mario's points are very valid, such development has significant risk and can create technical debt as this field is evolving rapidly. 

I certainly have no expectation of IMatch delivering on such capability until the time is right.  I do appreciate seeing the forward thinking and this thread.

Michael



ben

Very interesting thread.

I think i am a user who is using both scenarios.
I have very strict keywords for certain things but I also like to simply add metadata to "describe things in easy words". When I search for images, it's very easy to find those with strict keywords. The others I usually find quicker by simply going through my (also strict) folder structure. But it would be soo much better to find them with rather natural language.
So, if there is a way for iMatch to evolve into this direction, I would be very happy with it.

Mario

#9
I've made a few more tests with my testbed this morning, using the 3 top models available for Ollama.
I can generate embeddings from any tag and run queries for these embeddings. The query result are the n best matching files for the search text.

Results are OK-ish, more or less, depending on what your expectations are.
Searching for "person" does not find images where the description contains "A stylish model is captured in an outdoor setting" or "A model poses in a black blazer". Finding these images would have been nice.

Update: When I use the "nomic-embed-text" model instead of "mxbai-embed-large", searching for "person" I get much better results, including files with descriptions as "A young woman poses for", "A healthcare professional" or "A powerful wave crashes against the Phare de Brest lighthouse in France. A solitary figure stands on the rocks".

So, a lot of room for experiments and tests. 

Problem is that there is no "in-process" vector database (like SQLite). There is an extension for SQLite (https://github.com/asg017/sqlite-vec) but this is apparently a one-person project (no problem with that) but the last update was in January and not much progress has been made since then. The original author is still responsive to questions and promises an update soon.

If this projects is no longer maintained, the embeddings will still work, but may fail at some time in the future, when an SQLite update breaks the extension. Then something new must be found...

I could implement this as an experimental feature and, for the experiment, store the embeddings in a separate database, in the same folder as the IMatch database. This keeps the experimental data out of the IMatch database, so no "harm" can be done to the IMatch database.

Then an experimental feature that allows the user to create embeddings manually for the description tag. And when experimental features are enabled and the embeddings database exists, IMatch would offer an additional search mode for the File Window search bar "Natural Language Query" or something.

This would allow me to actually test this with some databases. And when It's good enough, ship it with an update so users can try when they want and let me know what they think.

Luckily, the hardware requirements for embeddings are far less than for vision models, so even user with a regular graphic card and 4GB VRAM or less should be able to use this.

What do you think?

Oh, performance:

On my system, generating embeddings for 2,300 files takes about 120 seconds, so a 8 to 9 minutes per 10,000 files.
This has to be done only once for each file, though.

And the embedding must be updated when the metadata changes.
And this requires Ollama to be running, or the embedding is stale and IMatch must somehow remember that and update the embedding the next time Ollama is found running.

monochrome

Quote from: Mario on August 17, 2025, 04:51:53 PMWhat do you think?

I defer to you regarding product feature inclusions - first off, you have to maintain them; but you also know more about how people use this.

I do want to point you to SQLite-rembed - which seems to be a good place to start if you just want something up and running fast.

Mario

That's from the same author as the sqlite-vec extension I've talked about. Also seems a bit abandoned.

I'd really wish the SQLite team would do their own vector extension with a decade of maintenance guarantee.
I get query times for 100,000 embeddings in < 0.5s already. SQLite is good for this.

ben

Quote from: Mario on August 17, 2025, 04:51:53 PMI could implement this as an experimental feature and, for the experiment, store the embeddings in a separate database, in the same folder as the IMatch database. This keeps the experimental data out of the IMatch database, so no "harm" can be done to the IMatch database.

This would allow me to actually test this with some databases. And when It's good enough, ship it with an update so users can try when they want and let me know what they think.

Luckily, the hardware requirements for embeddings are far less than for vision models, so even user with a regular graphic card and 4GB VRAM or less should be able to use this.

What do you think?

Sounds fantastic. 
Would absolutely like to test it

bekesizl

Quote from: Mario on August 17, 2025, 04:51:53 PMI could implement this as an experimental feature and, for the experiment, store the embeddings in a separate database, in the same folder as the IMatch database. This keeps the experimental data out of the IMatch database, so no "harm" can be done to the IMatch database.

Then an experimental feature that allows the user to create embeddings manually for the description tag. And when experimental features are enabled and the embeddings database exists, IMatch would offer an additional search mode for the File Window search bar "Natural Language Query" or something.
Sounds like a reasonable compromise for a fantastic new feature.

Mario

I've had a bit of extra time to play with this.

First I got rid of external libraries. I try to keep IMatch's dependencies on 3rd party code minimal.

I've looked into the math needed for calculating the KNN and other vector measures and implemented it myself as extensions for SQLite. IMatch already uses a number of custom "functions" for SQLite, e.g. for face recognition, so I have some experience with this.

IMatch can now, almost as fast, manage embedding vectors in normal tables, which avoids many of the potential pitfalls.
My implementation currently uses Ollama with one of 3 models to calculate the embeddings for a user-definable selection of tags. The tags for which embeddings exist in the database can be queried using "natural language" queries.

The AI magic in the embeddings allows IMatch to find descriptions referencing automobile, var, vehicle, ride etc. when you search for car. It will find descriptions containing terms like man, woman, boy, girl, model, actor, actress ... when you search for person. Not sure yet how far this can go, because I first need to implement it and let it process a good-sized database with 20,000 to 30,000 descriptions to see what is possible.

When this becomes available, users need to trigger an initial "embedding" run after setting up which tags they want to include ans some details about Ollama and the model. During this run, IMatch produces embeddings for all files with metadata for one or more of these tags.

Later IMarch automatically updates the embedding when the metadata of a file changes or new files are added.

I don't really like the dependency on an external service (Ollama / LM Studio) for this. But it takes large teams to build something that works like Ollama and LM Studio and that is way out of my league. I'm just me.

Since embeddings work via the background processing queue, the files to update just remain in the queue when Ollama is not running or unwell. Like AutoTagger deals with this situation.

Anyway, I think a couple of days or a week to implement this will be worth it, just to see how good this is for fuzzy searches where the user does not know the exact words and regexp won't do.

The model I currently like best produces embedding  vectors with 768 dimensions (!). I've got the calculations fairly fast thanks to IMatch requiring AVX anyway, which means I could use the dedicated AVX routines for fast matrix math available in Intel/AMD processors. Performance looks OK so far, more than fast enough for the IMatch use cases.

monochrome

:o

Oh, wow, that went fast!

I think using Ollama for the first implementation is 100% the right way. We don't know how useful this even is.

For future reference, I found this: https://github.com/skeskinen/bert.cpp/blob/master/examples/main.cpp - a C++ library to run sentence-transformers models in C++. Seems simple enough for a built-in alternative to Ollama. It's based on the much larger - and well supported - ggml library and is about 2000 lines of code. 

Mario

Bert is part of llama.cpp for a while, and Ollama is based on llama.cpp anyways. They just added and maintain a ton of code on top of llama.cpp (same as LM Studio) to make all this work reliably, deal with offloading and splitting and more.

Mario

#17
OK, prototype implementation is working. I have a menu command that produces embeddings for descriptions for all files with a description in the database. Other tags could be processed too, when this becomes a feature. Maybe keywords, or AI description or traits, whatever.

I've got it working reasonably fast, between 2,000 and 3,000 images processed per minute. The embedding models are quite small (the biggest one has only 600 MB) and it seems Ollama handles multiple threads very well in this case. IMatch uses up to 8 parallel threads when creating embeddings, keeping within the max GPU limits set by the performance profile.

And I have a dialog where I can type in a search to search the description.

Internally, the search text is converted into an embedding (via Ollama) and then IMatch calculates the "closest" embeddings matching the search text in the database. The 50 best matches are displayed in a Result Window. The search dialog remains open, which means you can search again and the Result Window updates.

For interested Users: What Does it Actually Do?

Using Ollama and a specialized AI model trained on large bodies of text, IMatch feeds the description into the model and gets a vector with 768 dimensions back. This vector is stored in the database.

When you search for a description, IMatch creates an embedding from your search text, again using Ollama. This also produces a vector with 768 dimensions. Now IMatch uses some fancy algorithms to compare the vectors stored in the database with your search vector.

The idea is that the model is smart and produces vectors close to each other for similar terms. For example, "ball" and "balls" are close, "sport" and "running" or "car", "auto", "automobile" and "vehicle".

IMatch can only create embeddings from your descriptions, this does not "produce" any new information. But, when it works, it allows you to find images with similar descriptions to your search term, even if the description does not use exactly the same words you have used in your search term. Searching for woman also finds descriptions containing women or girl or female etc.
Searching for Mexican food also finds descriptions mentioning chilli. And so on.

If this is actually helpful for users, and worth the extra effort and AI time, depends on many things, I guess. How the AI descriptions and other contents your database vary, how you search, if you have good keywords with synonyms to use in searches etc.

Update

I've got this working "good enough" and added all the options the user might want to change to a dialog where you can also run queries to see if this works (for you).

Here is an example from today.
My query was "Mexican food" (search XMP description) and the new "natural language query based on AI embeddings 0.1" brought up all images in my test database which match the description.

The interesting part is that it also brought up images showing chili dishes where the description does not contain the terms food or Mexican.  See the screen shot below.

The embeddings and "AI magic" IMatch employs for this makes a connection between the query "Mexican food" and the terms in the description and decides that they are relevant.

Look at the descriptions. No mention of food or Mexican but they are still found. IMatch could not do this until now.

You cannot view this attachment.

Let me know what you thing. Worth releasing as a Experimental Feature? Would you test it and provide feedback?
The models are way less demanding on GPU memory and should run fine even on CPU only. Ollama takes care for that.