Create an AI Expert With Open Source Tools and pgvector

2023 was the year of Artificial Intelligence (AI). A lot of companies are thinking about how they can improve user experience with AI, and the most usual first step is to use company data (internal docs, ticketing systems, etc.) to answer customer questions faster and (or) automatically.

In this blog post, we will explain the basic concepts and how somebody can build their own AI expert with the company’s data. To look under the hood at how things work, we will deliberately avoid non-open source tooling. This will also allow anyone to try it out for free and, at the same time, convert the code to any paid APIs, like OpenAI.

The goal

We will execute the following steps:

Provision the infrastructure
Capture company data and store it in PostgreSQL with pgvector
Ask questions and generate responses
- The prompt will consist of two parts:
  - User question
  - Context – our data
- LLM will use the prompt to generate a response to the user’s question

Create an AI Expert With Open Source

To ease the readability, we will show code concepts in the blog post, whereas full examples can be found in the GitHub repository: blog-data/percona-ai-pgvector. I will provide links to the corresponding lines.

Glossary and terms

This section explains some terms that are used in this blog post, and that might be new to the reader.

Vector embeddings

It is a numerical representation of data that captures its meaning and relationship. Essentially, it is a list of numbers. You can encode any data as vector embeddings – text, pictures, video, voice – and perform a semantic search to find matches and similarities. Embeddings are usually stored in vector databases, like pgvector. Once you have vector embeddings for your text, you can easily calculate how similar they are using math operations (like cosine distance). You can pre-process your company documents and generate embeddings, then search through those using the user’s input.

Large Language Model (LLM)

LLMs are what powers AI. It is a collection of deep learning algorithms that can execute various tasks – understand text, generate responses, convert the data, etc. GPT3 and GPT4 from OpenAI are examples of LLMs.

Tokens

In the context of Large Language Models (LLMs), the term “token” refers to a chunk of text that the model reads or generates. A token is typically not a word; it could be a smaller unit, like a character or a part of a word, or a larger one, like a whole phrase. This depends on the model.

Inputs and outputs are going through the process of tokenization – splitting texts into smaller units. For example, OpenAI uses “Byte-Pair Encoding (BPE). The phrase “Hello, World!” is thirteen characters, but only four tokens:

“Hello”
“,”
” world”
“!”

Furthermore, these tokens are given numerical identifiers that, as you might have guessed, are placed into a vector. So the phrase will look like this:

[15496, 11, 995, 0]

1	[15496, 11, 995, 0]

This vector is used to create an embedding or as a prompt.

Hugging Face hub

Hugging Face is an open source company that builds solutions and tools for machine learning and AI. They have a hub that exposes thousands of LLMs, data sets, and demo apps. We will use models and libraries that Hugging Face provides.

Provision infrastructure

pgvector

pgvector is a PostgreSQL extension that allows storing vector embeddings. You can install it with Percona Distribution for PostgreSQL. I will install it on Kubernetes using Percona Operator for PostgreSQL and its latest experimental Custom Extension feature. This feature allows you to build an extension, store it in an S3 bucket, and instruct the operator to use it without the need to rebuild container images.

You can find the relative manifests and pre-built extension in the k8s-operator folder in the repo.

# upload pgvector to your S3 bucket
# install the Operator
kubectl apply -f bundle.yaml --server-side

# apply secrets for S3 bucket
kubectl apply -f s3-secret.yaml

# apply init.yaml - this enables the extensions and creates schema
kubectl apply -f init.yaml

# deploy the database cluster
kubectl apply -f cr.yaml

# upload pgvector to your S3 bucket

# install the Operator

kubectl apply -f bundle.yaml --server-side

# apply secrets for S3 bucket

kubectl apply -f s3-secret.yaml

# apply init.yaml - this enables the extensions and creates schema

kubectl apply -f init.yaml

# deploy the database cluster

kubectl apply -f cr.yaml

In my Custom Resource manifest (cr.yaml), I create the user and the database for our experiments and also instruct the operator to load the extension from the bucket:

...
 users:
    - name: vector
      databases:
        - vector-db
...
extensions:
    image: percona/percona-postgresql-operator:2.3.0
    imagePullPolicy: Always
    storage:
      type: s3
      bucket: MY_TEST_BUCKET
      region: us-west-2
      secret:
        name: ext-secret
    custom:
    - name: pgvector
       version: 0.5.1

...

users:

- name: vector

databases:

- vector-db

...

extensions:

image: percona/percona-postgresql-operator:2.3.0

imagePullPolicy: Always

storage:

type: s3

bucket: MY_TEST_BUCKET

region: us-west-2

secret:

name: ext-secret

custom:

- name: pgvector

version: 0.5.1

Preparing the database

Once the database is up and running, let’s create the table:

create table perconavec (
    id bigserial primary key,
    content text,
    url text,
    embedding vector(1024)
  );

create table perconavec (

id bigserial primary key,

content text,

url text,

embedding vector(1024)

);

vectoris the new data type introduced by pgvector. We create an embedding column with the vector type and 1024 vector dimensions. Different models have different dimensions. For example, OpenAIs text-embedding-ada-002 has 1535 dimensions. We are going to use UAE-Large-V1 model that has 1024 embedding dimensions and is a current leader in the Massive Text Embedding Benchmark (MTEB) leaderboard. You should pick your model carefully from the very beginning, as changing it later requires you to re-create vector embeddings for your data.

pgvector introduces three new operators that can be used to calculate similarity:

<-> – Euclidean distance
<#> – negative inner product
<=> – cosine distance

Cosine distance, which calculates the angle between vectors, is the usual choice in most AI products. For example, you can do smth like:

select
  1 - (perconavec.embedding <=> query_embedding),
  perconavec.url
from perconavec
order by perconavec.embedding <=> query_embedding
limit 3;

select

1 - (perconavec.embedding <=> query_embedding),

perconavec.url

from perconavec

order by perconavec.embedding <=> query_embedding

limit 3;

This query will output the top three URLs ordered by similarity, so the closest ones to the user search. For simplicity, we create the function match_documents that does the magic for us. You can find it in python/01-pg-provision.py.

GPU and python

GPUs are great at multiprocessing, which is essential for creating vector embeddings. You can parse documents and create embeddings without GPU, but it might take 10x more time. Public clouds provide a variety of GPU-powered machines and operating system images for deep learning. For Linux machines, you will need to install Nvidia drivers and CUDA.

Python is widely popular for machine learning tasks. It has a lot of libraries and tools for it that are accessible through pip.

In the install script, you can find various commands for Ubuntu that are going to install the necessary GPU drivers and Python libraries. We will discuss these libraries in detail below.

Generate embeddings from company data

Load embedding model

To generate embeddings, we will use sentence_transformers library from Hugging Face.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('WhereIsAI/UAE-Large-V1')

def create_embedding(content):
    embeddings = model.encode([content], device='cuda', show_progress_bar=True)
    return(embeddings)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('WhereIsAI/UAE-Large-V1')

def create_embedding(content):

embeddings = model.encode([content], device='cuda', show_progress_bar=True)

return(embeddings)

As promised, we are using UAE-Large-V1 model. It will be automatically fetched from the Hugging Face hub when you run the script for the first time.

device=’cuda’ instructs the code to use GPU. You can switch to cpu instead, but it will result in suboptimal performance.

Scrape the company documents

This strictly depends on the data you have. In our example, we scrape Percona documentation and blog posts that are both publicly available. We use BeautifulSoup Python package that simplifies HTML and XML parsing. You might also want to add logic to remove noise from the documents before converting them into embeddings, such as repeatable menus, various data banners, etc. Pure content.

from bs4 import BeautifulSoup

# from blog post - remove noisy divs
def extract_text_from_blog(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, features="html.parser")

  # remove noisy content
    for div in soup.find_all("div", {"id": "jp-relatedposts"}):
        div.decompose()
    for div in soup.find_all("div", {"class": "share-wrap"}):
        div.decompose()
    for div in soup.find_all("div", {"class": "comments-sec"}):
        div.decompose()

    text = soup.find("div", {"class": "blog-content-inner"}).get_text()
    lines = (line.strip() for line in text.splitlines())
    return 'n'.join(line for line in lines if line)

from bs4 import BeautifulSoup

# from blog post - remove noisy divs

def extract_text_from_blog(url):

html = requests.get(url).text

soup = BeautifulSoup(html, features="html.parser")

# remove noisy content

for div in soup.find_all("div", {"id": "jp-relatedposts"}):

div.decompose()

for div in soup.find_all("div", {"class": "share-wrap"}):

div.decompose()

for div in soup.find_all("div", {"class": "comments-sec"}):

div.decompose()

text = soup.find("div", {"class": "blog-content-inner"}).get_text()

lines = (line.strip() for line in text.splitlines())

return 'n'.join(line for line in lines if line)

Data chunks

You can take a huge article and convert it into vector embedding, but that might harm your semantic search. It is recommended to split the text into smaller chunks – paragraphs, sentences, or sometimes even words. The decision on how you want to split the data should be driven by the model you are going to use to generate answers in the end and the data itself.

For example, our documentation is stored in markdown. In our code, we use langchain’s library MarkdownTextSplitter to generate chunks out of these files. See in 02-put.py.

Load data into pgvector

Once we create the embeddings, it is time to put them into pgvector. We use psycopg2 to connect to PostgreSQL and pgvector library to work with vectors themselves:

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(CONN_STRING)
cur = conn.cursor()
register_vector(conn)

cur.execute('INSERT INTO perconavec (content, url, embedding) VALUES (%s,%s,%s)', (sentence, page['source'], embeddings[0],))

import psycopg2

from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(CONN_STRING)

cur = conn.cursor()

register_vector(conn)

cur.execute('INSERT INTO perconavec (content, url, embedding) VALUES (%s,%s,%s)', (sentence, page['source'], embeddings[0],))

Gluing it all together

Conceptually, our code looks like this:

# Load the model
model = SentenceTransformer('WhereIsAI/UAE-Large-V1')

# parse the web
for page in pages:

  # remove noisy data and/or extract the text
  text = extract_text_from_page(page)

  # split the data into chunks
  chunks = split_text_into_chunks(text)

  # generate embeddings and store in the database
  for chunk in chunks:

    # create embedding vectors from chunks
    embeddings = create_embedding(chunk) 

    # store in the databse
    cur.execute('INSERT INTO perconavec (content, url, embedding) VALUES (%s,%s,%s)', (chunk, page['source'], embeddings[0],))

# Load the model

model = SentenceTransformer('WhereIsAI/UAE-Large-V1')

# parse the web

for page in pages:

# remove noisy data and/or extract the text

text = extract_text_from_page(page)

# split the data into chunks

chunks = split_text_into_chunks(text)

# generate embeddings and store in the database

for chunk in chunks:

# create embedding vectors from chunks

embeddings = create_embedding(chunk)

# store in the databse

cur.execute('INSERT INTO perconavec (content, url, embedding) VALUES (%s,%s,%s)', (chunk, page['source'], embeddings[0],))

You can find the code that we used to parse the data and put it into pgvector in python/02-put.py.

Search and answer

A quick search through vectors

Once vector embeddings are in the database, we can do a semantic search with the match_documents function that we created before.

Our python/03-quick-search.py does the following:

Takes an input string (user question)
Converts it into vector embedding
Calls match_documents function
Returns the top five semantically relevant URLs for the input string. It can also return the content, as it is also in the database.

$ python3 03-quick-search.py 'How to configure MySQL replication?'
...
0.8186284570586514 https://www.percona.com/blog/mysql-myisam-active-active-clustering-looking-for-trouble/
0.7665825272179729 https://www.percona.com/blog/innodb-undelete-and-sphinx-support/
0.7567072114946343 https://www.percona.com/blog/mysql-to-use-or-not-to-use/
0.7539063903781531 https://www.percona.com/blog/filtered-mysql-replication/
0.7192081516498201 https://www.percona.com/blog/filtered-mysql-replication/

$ python3 03-quick-search.py 'How to configure MySQL replication?'

...

0.8186284570586514 https://www.percona.com/blog/mysql-myisam-active-active-clustering-looking-for-trouble/

0.7665825272179729 https://www.percona.com/blog/innodb-undelete-and-sphinx-support/

0.7567072114946343 https://www.percona.com/blog/mysql-to-use-or-not-to-use/

0.7539063903781531 https://www.percona.com/blog/filtered-mysql-replication/

0.7192081516498201 https://www.percona.com/blog/filtered-mysql-replication/

Answer the question with context

Now we have the context (our company data), and it all boils down to sending the proper prompt – user question + context.

We are going to use (see 04-context-search.py) a high-level HuggingFace pipeline abstraction wrapper with the t5 model. It is a basic pre-trained model for various use cases – question answering, text generation, etc.

Depending on the pipeline and the model, the prompt will look different. But for text2text pipeline, it will look like this:

    prompt = f"""
    question: {query}
    context: {context}
    """
    qa = pipeline('text2text-generation', do_sample=True, model=model_name, tokenizer=model_name, top_k=3)

prompt = f"""

question: {query}

context: {context}

"""

qa = pipeline('text2text-generation', do_sample=True, model=model_name, tokenizer=model_name, top_k=3)

Where query – is the question that the user asked and context – is the data that we fetched with our semantic search from the database.

$ python3 04-context-search.py "What is Everest?"
...
[{'generated_text': 'Percona Everest is an open source private database-as-a-service'}]

$ python3 04-context-search.py "What is Everest?"

...

[{'generated_text': 'Percona Everest is an open source private database-as-a-service'}]

Things to consider

Model training

In our examples, we do not perform any model training and assume it can provide answers from the context we provide. It works for demo purposes but is suboptimal for real-world use cases. It is recommended to train the model on your data first. This is a topic for a separate blog post.

Reusing scripts

The scripts in the GitHub repo are given as an example. Language models and the way you structure your data in the databases strictly depend on your use cases. Do not blindly reuse the scripts.

Using public APIs

We are deliberately using open source tools. But you can easily use public APIs for embedding generation and question answering. For example, for embedding generation, you can use OpenAI API:

def create_embedding(content):

#  embeddings = model.encode([content], device='cuda', show_progress_bar=True)
  response = openai.Embedding.create(model= "text-embedding-ada-002",input=[content])
  embeddings = response["data"]
  return(embeddings)

def create_embedding(content):

# embeddings = model.encode([content], device='cuda', show_progress_bar=True)

response = openai.Embedding.create(model= "text-embedding-ada-002",input=[content])

embeddings = response["data"]

return(embeddings)

Wrap up

Storing embeddings in pgvector is a great choice, as your teams don’t need to learn new technologies or change the existing PostgreSQL libraries. With open source tools, you can easily create your own AI or chatbot trained on proprietary company data. This will boost user experience and automate various business processes.

Try out Percona Operator for PostgreSQL with Custom Extensions feature with pgvector. With it, you will have your database up and running in seconds.

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Create an AI Expert With Open Source Tools and pgvector

The goal

Glossary and terms

Provision infrastructure

pgvector

Preparing the database

GPU and python

Generate embeddings from company data

Load embedding model

Scrape the company documents

Data chunks

Load data into pgvector

Gluing it all together

Search and answer

A quick search through vectors

Answer the question with context

Things to consider

Model training

Reusing scripts

Using public APIs

Wrap up

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Trying out the PostgreSQL pg_tde Tech Preview Release

Bringing Percona Experts to a City Near You

PostgreSQL Database Security Best Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation