Imagine that winter is coming to the south of the planet, that you are going on a vacation trip to Patagonia, and you want to buy some cozy clothes. You go to that nice searcher page that says “do no evil” and write in the search field “Jackets for Patagonia weather,” not thinking of a particular brand, but the search results are for THAT particular brand. What happened? Probably a lack of context (is Patagonia a brand or a place?). One can argue that the database can be refined by adding categories to names so that you can store “apple” as a fruit and as a company or “Patagonia” as a place and a brand. It will add some complexity to the DB model and probably take more for queries to execute, but it is doable.

Now, you saw a picture of a nice jacket online, and you want to search by image. What kind of magic happens under the hood for a database to be able to resolve queries when the search condition is an image? A hash version of images? But how does one compare shape and color? How does one know it is a jacket and not a lab coat? And how do you store all that data in an efficient and searchable way? Enter Vector Databases.

Vector Databases

Math detour

Vectors are mathematical objects that represent quantities with both magnitude and direction. It can be represented as arrays of numbers, where each element corresponds to a specific dimension:

For example, the abovevector is represented by (-2,3). This is a 2-D vector; it can also be a 3-D vector:

This one represents (2,1,3). As you can see, there is an additional dimension, but you can actually go to N-Dimensions! This is the subject covered by Linear Algebra.

But what does that have to do with databases? Actually, it is not the database itself; it is the data that is represented by vectors. And that data is stored in databases.

Vectors and databases

In vector databases, vectors represent complex data, such as text, images, sounds, etc., in a format that a machine can understand. It transforms qualitative attributes into quantitative features that can be easily compared using math operations.

If we go back to the jacket example, each qualitative attribute of the jacket, like Size M, Color: Blue, Sleeves: Long, Fabric: cotton, etc., can become a value of the whole vector and can become a number.

Another good example is facial recognition.

Imagine each facial feature as a dimension (mouth, nose, eyes, that wrinkle that you hide). The unique combination of these features for each person creates their facial vector.

When you upload a picture to the futuristic facial recognition system, it compares its vector to the stored vector of known people. If it finds a close match, Bingo! Person recognized!

Creating vectors, also known as EMBEDDINGS

Source: https://tge-data-web.nyc3.digitaloceanspaces.com/docs/Vector%20Databases%20-%20A%20Technical%20Primer.pdf

Embeddings are vector representations of data points that capture their semantic meaning. That is critical, the “semantic meaning.” Why? Because then you can use the meaning of a text and compare it with the meaning of an image! That’s how a machine can decide that a picture of a winter jacket is properly described by the phrase “winter jacket.” This is something called “semantic search“.

How are embeddings generated? MACHINE LEARNING. The actual heavy lift of the embedding process happens within a trained model. One can create its own model or use a pre-trained one that fits its needs. A good open source library to embed text and images is the “Sentence Transformer.” However, its usage is outside the scope of this article.

Store and search of Embeddings

Now we know that vector databases store…vectors. But those vectors represent entities and contain description and meaning. One can be tempted to have the most dimensions possible in order to describe the object down to the lowest detail. This is called High-Dimensional spaces.

High-dimensional data

This refers to data that has a large number of attributes. A song is described by Rhythm, instruments, tempo, pitch, length, topic, key, etc. Each of them is a new dimension.

As the number of dimensions increases, the volume of vector space grows. By a lot. High dimensionality can store a great amount of details, BUT it comes with a cost when managing and searching this data. This is called The Curse of Dimensionality. As the dimensionality increases, the distance between data points becomes less meaningful. The computational complexity of searching similar points becomes so expensive that it can be unreasonable. To alleviate this problem, dimensionality reduction techniques are key. Things like “Feature Selection,” “Hashing,” and “Principal Component Analysis (PCA)” are useful for this task.

Distance metrics

OK, I have embeddings in my Vector Database. Now what? Well, now: More math.

Distance metrics are math functions that determine the “distance” between two points in a vector space. Different distance metrics can mean relationships between different aspects of the data.

The most popular are:

Euclidean Distance: Very sensitive to the magnitude of vectors
Manhattan Distance: Useful for image retrieval and financial analysis
Cosine Distance: Great to evaluate text similarity where the frequency of words is less important than the actual meaning.
Jaccard Similarity: Used to find similar customers based on purchase history

Search similarity: The ultimate goal

Source: https://weaviate.io/blog/distance-metrics-in-vector-search

In the end, all this fuss is about finding similar things. Fast and accurate. This is achieved through algorithms like K-Nearest Neighbors (KNN) and Approximate Nearest Neighbor (ANN), each with its pros and cons. These algorithms are part of the database engine, already implemented and ready to use. But as with any other database, indexes are key to improving execution times.

By creating indexes on specific dimensions, vector databases can significantly accelerate similarity search ops. Some popular indexing options are:

Flat index
Inverted File Index (IVF)
Approximate Nearest Neighbors Oh Yeah (ANNOY)
Product Quantization (PQ)
Hierarchical Navigable Small World (HNSW)

Vector databases and vector store

A vector database is designed from scratch to store, index, and query vector data. Examples are Pinecone and Quadrant.

Vector store is a type of database used for regular traditional storage and search but which also has the ability to store and retrieve vector data: ClickHouse (https://clickhouse.com/blog/vector-search-clickhouse-p1) and PostgreSQL with the pg_vector extension (See a great walk-through here: https://www.percona.com/blog/create-an-ai-expert-with-open-source-tools-and-pgvector/)

Wrapping up

AI and Machine Learning are pushing the boundaries of the technology, not only on the breakthrough achievements that final users see (ChatGPT, Bard, Copilot, etc.) but also on what is happening under the hood. ML-trained models require specific databases for their needs, and the industry is keeping up with those requirements. What lies ahead is an enduring test and the convergence with the open source space. Exciting times!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Atita

3 months ago

Good writeup! Btw it’s Qdrant and not Quadrant (although pronounced the way you wrote 😉 )

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

An Introduction to Vector Databases

Math detour

Vectors and databases

Creating vectors, also known as EMBEDDINGS

Store and search of Embeddings

High-dimensional data

Distance metrics

Search similarity: The ultimate goal

Vector databases and vector store

Wrapping up

Related

Related Blog Articles

RECOMMENDED ARTICLES

MySQL 8.4 First Peek

LDAP Authentication in PgBouncer Through PAM

Redis, Valkey, and Percona’s Ongoing Support of Open Source

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

An Introduction to Vector Databases

Math detour

Vectors and databases

Creating vectors, also known as EMBEDDINGS

Store and search of Embeddings

High-dimensional data

Distance metrics

Search similarity: The ultimate goal

Vector databases and vector store

Wrapping up

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

MySQL 8.4 First Peek

LDAP Authentication in PgBouncer Through PAM

Redis, Valkey, and Percona’s Ongoing Support of Open Source

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation