2023 was the year of Artificial Intelligence (AI). A lot of companies are thinking about how they can improve user experience with AI, and the most usual first step is to use company data (internal docs, ticketing systems, etc.) to answer customer questions faster and (or) automatically.

In this blog post, we will explain the basic concepts and how somebody can build their own AI expert with the company’s data. To look under the hood at how things work, we will deliberately avoid non-open source tooling. This will also allow anyone to try it out for free and, at the same time, convert the code to any paid APIs, like OpenAI.

The goal

We will execute the following steps:

  1. Provision the infrastructure
  2. Capture company data and store it in PostgreSQL with pgvector
  3. Ask questions and generate responses
    • The prompt will consist of two parts:
      • User question
      • Context – our data
    • LLM will use the prompt to generate a response to the user’s question

Create an AI Expert With Open Source

To ease the readability, we will show code concepts in the blog post, whereas full examples can be found in the GitHub repository: blog-data/percona-ai-pgvector. I will provide links to the corresponding lines.

Glossary and terms

This section explains some terms that are used in this blog post, and that might be new to the reader.

Vector embeddings

It is a numerical representation of data that captures its meaning and relationship. Essentially, it is a list of numbers. You can encode any data as vector embeddings – text, pictures, video, voice – and perform a semantic search to find matches and similarities. Embeddings are usually stored in vector databases, like pgvector. Once you have vector embeddings for your text, you can easily calculate how similar they are using math operations (like cosine distance). You can pre-process your company documents and generate embeddings, then search through those using the user’s input.

Large Language Model (LLM)

LLMs are what powers AI. It is a collection of deep learning algorithms that can execute various tasks – understand text, generate responses, convert the data, etc. GPT3 and GPT4 from OpenAI are examples of LLMs.

Tokens

In the context of Large Language Models (LLMs), the term “token” refers to a chunk of text that the model reads or generates. A token is typically not a word; it could be a smaller unit, like a character or a part of a word, or a larger one, like a whole phrase. This depends on the model.

Inputs and outputs are going through the process of tokenization – splitting texts into smaller units. For example, OpenAI uses “Byte-Pair Encoding (BPE). The phrase “Hello, World!” is thirteen characters, but only four tokens:

  • “Hello”
  • “,”
  • ” world”
  • “!”

Furthermore, these tokens are given numerical identifiers that, as you might have guessed, are placed into a vector. So the phrase will look like this:

This vector is used to create an embedding or as a prompt.

Hugging Face hub

Hugging Face is an open source company that builds solutions and tools for machine learning and AI. They have a hub that exposes thousands of LLMs, data sets, and demo apps. We will use models and libraries that Hugging Face provides.

Provision infrastructure

pgvector

pgvector is a PostgreSQL extension that allows storing vector embeddings. You can install it with Percona Distribution for PostgreSQL. I will install it on Kubernetes using Percona Operator for PostgreSQL and its latest experimental Custom Extension feature. This feature allows you to build an extension, store it in an S3 bucket, and instruct the operator to use it without the need to rebuild container images.

You can find the relative manifests and pre-built extension in the k8s-operator folder in the repo.

In my Custom Resource manifest (cr.yaml), I create the user and the database for our experiments and also instruct the operator to load the extension from the bucket:

Preparing the database

Once the database is up and running, let’s create the table:

vectoris the new data type introduced by pgvector. We create an embedding column with the vector type and 1024 vector dimensions. Different models have different dimensions. For example, OpenAIs text-embedding-ada-002  has 1535 dimensions. We are going to use UAE-Large-V1  model that has 1024 embedding dimensions and is a current leader in the Massive Text Embedding Benchmark (MTEB) leaderboard. You should pick your model carefully from the very beginning, as changing it later requires you to re-create vector embeddings for your data.

pgvector introduces three new operators that can be used to calculate similarity:

  • <-> – Euclidean distance
  • <#> – negative inner product
  • <=> – cosine distance

Cosine distance, which calculates the angle between vectors, is the usual choice in most AI products. For example, you can do smth like:

This query will output the top three URLs ordered by similarity, so the closest ones to the user search. For simplicity, we create the function match_documents  that does the magic for us. You can find it in python/01-pg-provision.py.

GPU and python

GPUs are great at multiprocessing, which is essential for creating vector embeddings. You can parse documents and create embeddings without GPU, but it might take 10x more time. Public clouds provide a variety of GPU-powered machines and operating system images for deep learning. For Linux machines, you will need to install Nvidia drivers and CUDA.

Python is widely popular for machine learning tasks. It has a lot of libraries and tools for it that are accessible through pip.

In the install script, you can find various commands for Ubuntu that are going to install the necessary GPU drivers and Python libraries. We will discuss these libraries in detail below.

Generate embeddings from company data

Load embedding model

To generate embeddings, we will use sentence_transformers  library from Hugging Face.

As promised, we are using UAE-Large-V1  model. It will be automatically fetched from the Hugging Face hub when you run the script for the first time.

device=’cuda’  instructs the code to use GPU. You can switch to cpu instead, but it will result in suboptimal performance.

Scrape the company documents

This strictly depends on the data you have. In our example, we scrape Percona documentation and blog posts that are both publicly available. We use BeautifulSoup Python package that simplifies HTML and XML parsing. You might also want to add logic to remove noise from the documents before converting them into embeddings, such as repeatable menus, various data banners, etc. Pure content.

Data chunks

You can take a huge article and convert it into vector embedding, but that might harm your semantic search. It is recommended to split the text into smaller chunks – paragraphs, sentences, or sometimes even words. The decision on how you want to split the data should be driven by the model you are going to use to generate answers in the end and the data itself.

For example, our documentation is stored in markdown. In our code, we use langchain’s library MarkdownTextSplitter to generate chunks out of these files. See in 02-put.py.

Load data into pgvector

Once we create the embeddings, it is time to put them into pgvector. We use psycopg2 to connect to PostgreSQL and pgvector library to work with vectors themselves:

Gluing it all together

Conceptually, our code looks like this:

You can find the code that we used to parse the data and put it into pgvector in python/02-put.py.

Search and answer

A quick search through vectors

Once vector embeddings are in the database, we can do a semantic search with the match_documents function that we created before.

Our python/03-quick-search.py does the following:

  1. Takes an input string (user question)
  2. Converts it into vector embedding
  3. Calls match_documents function 
  4. Returns the top five semantically relevant URLs for the input string. It can also return the content, as it is also in the database.

Answer the question with context

Now we have the context (our company data), and it all boils down to sending the proper prompt – user question + context.

We are going to use (see 04-context-search.py) a high-level HuggingFace pipeline abstraction wrapper with the t5 model. It is a basic pre-trained model for various use cases – question answering, text generation, etc.

Depending on the pipeline and the model, the prompt will look different. But for text2text pipeline, it will look like this:

Where query – is the question that the user asked and context – is the data that we fetched with our semantic search from the database.

Things to consider

Model training

In our examples, we do not perform any model training and assume it can provide answers from the context we provide. It works for demo purposes but is suboptimal for real-world use cases. It is recommended to train the model on your data first. This is a topic for a separate blog post.

Reusing scripts

The scripts in the GitHub repo are given as an example. Language models and the way you structure your data in the databases strictly depend on your use cases. Do not blindly reuse the scripts.

Using public APIs

We are deliberately using open source tools. But you can easily use public APIs for embedding generation and question answering. For example, for embedding generation, you can use OpenAI API:

Wrap up

Storing embeddings in pgvector is a great choice, as your teams don’t need to learn new technologies or change the existing PostgreSQL libraries. With open source tools, you can easily create your own AI or chatbot trained on proprietary company data. This will boost user experience and automate various business processes.

Try out Percona Operator for PostgreSQL with Custom Extensions feature with pgvector. With it, you will have your database up and running in seconds.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments