Imagine that winter is coming to the south of the planet, that you are going on a vacation trip to Patagonia, and you want to buy some cozy clothes. You go to that nice searcher page that says “do no evil” and write in the search field “Jackets for Patagonia weather,” not thinking of a particular brand, but the search results are for THAT particular brand. What happened? Probably a lack of context (is Patagonia a brand or a place?). One can argue that the database can be refined by adding categories to names so that you can store “apple” as a fruit and as a company or “Patagonia” as a place and a brand. It will add some complexity to the DB model and probably take more for queries to execute, but it is doable.

Now, you saw a picture of a nice jacket online, and you want to search by image. What kind of magic happens under the hood for a database to be able to resolve queries when the search condition is an image? A hash version of images? But how does one compare shape and color? How does one know it is a jacket and not a lab coat? And how do you store all that data in an efficient and searchable way? Enter Vector Databases.

Vector Databases

Math detour

Vectors are mathematical objects that represent quantities with both magnitude and direction. It can be represented as arrays of numbers, where each element corresponds to a specific dimension:

For example, the abovevector is represented by (-2,3). This is a 2-D vector; it can also be a 3-D vector:

This one represents (2,1,3). As you can see, there is an additional dimension, but you can actually go to N-Dimensions! This is the subject covered by Linear Algebra.

But what does that have to do with databases? Actually, it is not the database itself; it is the data that is represented by vectors. And that data is stored in databases.

Vectors and databases

In vector databases, vectors represent complex data, such as text, images, sounds, etc., in a format that a machine can understand. It transforms qualitative attributes into quantitative features that can be easily compared using math operations.

If we go back to the jacket example, each qualitative attribute of the jacket, like Size M, Color: Blue, Sleeves: Long, Fabric: cotton, etc., can become a value of the whole vector and can become a number.

Another good example is facial recognition.

Imagine each facial feature as a dimension (mouth, nose, eyes, that wrinkle that you hide). The unique combination of these features for each person creates their facial vector.

When you upload a picture to the futuristic facial recognition system, it compares its vector to the stored vector of known people. If it finds a close match, Bingo! Person recognized! 

Creating vectors, also known as EMBEDDINGS


Source: https://tge-data-web.nyc3.digitaloceanspaces.com/docs/Vector%20Databases%20-%20A%20Technical%20Primer.pdf

Embeddings are vector representations of data points that capture their semantic meaning. That is critical, the “semantic meaning.” Why? Because then you can use the meaning of a text and compare it with the meaning of an image! That’s how a machine can decide that a picture of a winter jacket is properly described by the phrase “winter jacket.” This is something called “semantic search.

How are embeddings generated? MACHINE LEARNING. The actual heavy lift of the embedding process happens within a trained model. One can create its own model or use a pre-trained one that fits its needs. A good open source library to embed text and images is the “Sentence Transformer.” However, its usage is outside the scope of this article.

Store and search of Embeddings

Now we know that vector databases store…vectors. But those vectors represent entities and contain description and meaning. One can be tempted to have the most dimensions possible in order to describe the object down to the lowest detail. This is called High-Dimensional spaces. 

High-dimensional data

This refers to data that has a large number of attributes. A song is described by Rhythm, instruments, tempo, pitch, length, topic, key, etc. Each of them is a new dimension.

As the number of dimensions increases, the volume of vector space grows. By a lot. High dimensionality can store a great amount of details, BUT it comes with a cost when managing and searching this data. This is called The Curse of Dimensionality. As the dimensionality increases, the distance between data points becomes less meaningful. The computational complexity of searching similar points becomes so expensive that it can be unreasonable. To alleviate this problem, dimensionality reduction techniques are key. Things like “Feature Selection,” “Hashing,” and “Principal Component Analysis (PCA)” are useful for this task. 

Distance metrics

OK, I have embeddings in my Vector Database. Now what? Well, now: More math.

Distance metrics are math functions that determine the “distance” between two points in a vector space. Different distance metrics can mean relationships between different aspects of the data.

The most popular are:

  • Euclidean Distance: Very sensitive to the magnitude of vectors
  • Manhattan Distance: Useful for image retrieval and financial analysis
  • Cosine Distance: Great to evaluate text similarity where the frequency of words is less important than the actual meaning.
  • Jaccard Similarity: Used to find similar customers based on purchase history

Search similarity: The ultimate goal


Source: https://weaviate.io/blog/distance-metrics-in-vector-search

In the end, all this fuss is about finding similar things. Fast and accurate. This is achieved through algorithms like K-Nearest Neighbors (KNN) and Approximate Nearest Neighbor (ANN), each with its pros and cons. These algorithms are part of the database engine, already implemented and ready to use. But as with any other database, indexes are key to improving execution times. 

By creating indexes on specific dimensions, vector databases can significantly accelerate similarity search ops. Some popular indexing options are:

  • Flat index
  • Inverted File Index (IVF)
  • Approximate Nearest Neighbors Oh Yeah (ANNOY)
  • Product Quantization (PQ)
  • Hierarchical Navigable Small World (HNSW)

Vector databases and vector store

A vector database is designed from scratch to store, index, and query vector data. Examples are Pinecone and Quadrant

Vector store is a type of database used for regular traditional storage and search but which also has the ability to store and retrieve vector data: ClickHouse (https://clickhouse.com/blog/vector-search-clickhouse-p1) and PostgreSQL with the pg_vector extension (See a great walk-through here: https://www.percona.com/blog/create-an-ai-expert-with-open-source-tools-and-pgvector/)

Wrapping up

AI and Machine Learning are pushing the boundaries of the technology, not only on the breakthrough achievements that final users see (ChatGPT, Bard, Copilot, etc.) but also on what is happening under the hood. ML-trained models require specific databases for their needs, and the industry is keeping up with those requirements. What lies ahead is an enduring test and the convergence with the open source space. Exciting times!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Atita

Good writeup! Btw it’s Qdrant and not Quadrant (although pronounced the way you wrote 😉 )