Hadi Fadlallah
Creating a full-text index using Sphinx search engine

Building full-text indexes using the Sphinx search engine

August 4, 2020 by

In the previously published article, Getting started with Sphinx search engine, we talked about the Sphinx search engine and how to install it on the Windows operating system. In this article, we will talk about building full-text indexes using Sphinx. We will be covering seven topics:

  1. The Sphinx configuration file
  2. Defining data sources
  3. Defining full-text indexes
  4. Indexer and Searchd services options
  5. Database preparation
  6. Indexing
  7. Using Linked servers to access indexes

The Sphinx configuration file

As we mentioned in the previous article, a full-text index is defined within a configuration file called “Sphinx.conf” which is composed of three main sections:

  1. Data Source definitions: where we should specify the data we need to index; the database management system, the connection string, the source table, or SQL command
  2. Full-text indexes definitions: where we should define the full-text indexes we need to build
  3. Program-wide settings: this section is dedicated to the indexer and searchd services

Defining data sources

Each source must be declared in the configuration file using the source block, as shown below:

source source_example
{
type = …
sql_host = …
sql_user = …
sql_pass = …
sql_db = …
mssql_winauth = …
mssql_unicode = …
}

After the keyword “source” you should write the source name (we used source_example), then within the source block, you should define the following parameters:

  • Type: The data source type. Known types: odbc, mssql (Microsoft SQL Server), mysql
  • Sql_host: The data source host/instance
  • Sql_user: The authentication user
  • Sql_pass: The authentication password
  • Sql_db: The database name
  • Mssql_winauth: Use Windows authentication?
  • Mssql_unicode: Use Unicode?

We can define the data source in one block (class) or, you can write multiple classes and use inheritance as we will do in this article.

First, we will add the “base” source block to define the connection string parameters as following:

source base
{
type = mssql
sql_host = Mypc\SQLDATA
sql_user =
sql_pass =
sql_db = AdventureWorks2017
mssql_winauth = 1
mssql_unicode = 1
}

Since we will be using Windows authentication, then there is no need to specify the user and password parameters.

After defining the connection string class, we will create another class that inherits from the “base” class, where we will specify the data structure. Each column can be defined as an attribute or as a field.

Attributes are used for filtering; there are not full-text indexed. Besides, fields are full-text indexed and can be used for filtering. For each column, we should specify the data type using the Sphinx search engine keywords

In this example, we will read the name, price, and modified date from the [Production].[Product] table.

source product_base: base
{
sql_field_string = name
sql_attr_float = price
sql_attr_timestamp = date
}

The name column is defined as a field, while price and date are defined as attributes.

After defining the data structure, we should define the SQL command used as a data source using the “sql_query” parameter as following:

source product : product_base
{
sql_query = \
Select \
ProductID as ‘id’, \
Name as ‘name’, \
ListPrice as ‘price’, \
ModifiedDate_TS as ‘date’ \
From Production.Product
}

It is mandatory to select the primary key (ProductID) as the first column in the SQL command, even if it is not defined within the data structure block (Id column – similar to defined attributes – is not full-text indexed). Note that only one main query can be defined within the full-text index.

As we mentioned before, we can put all code block below within one block as follows:

source product
{
type = mssql
sql_host = Mypc\SQLDATA
sql_user =
sql_pass =
sql_db = AdventureWorks2017
mssql_winauth = 1
mssql_unicode = 1
sql_field_string = name
sql_attr_float = price
sql_attr_timestamp = date
sql_query = \
Select \
ProductID as ‘id’, \
Name as ‘name’, \
ListPrice as ‘price’, \
ModifiedDate_TS as ‘date’ \
From Production.Product
}

Defining full-text indexes

To define a full-text index using Sphinx, we should specify the following properties within the “index” code block.

  • Source: The data source class defined within the configuration file
  • Path: The directory path where the index files will be stored (This path must exist)
  • Charset_type: The index character set used

The index code block should look like the following:

Index product
{
source = product
path = E:/sphinx/sphinx-3.2.1/data/index/product
charset_type = utf-8
}

Note that more attributes can be used within the index and data source blocks. We can learn more about these attributes from the official sphinx documentation.

Indexer and Searchd services options

As we mentioned in the previous article, Sphinx has two main services:

  • Indexer: The service used to build full-text indexes
  • Searchd: The daemon used to search the created full-text indexes

As we also mentioned that Searchd service parameters should be defined within the configuration file as follows.

searchd
{
listen = 9306:mysql41
pid_file = E:/sphinx/sphinx-3.2.1/data/searchd.pid
log = E:/sphinx/sphinx-3.2.1/data/log/log.txt
query_log = E:/sphinx/sphinx-3.2.1/data/log/query_log.txt
binlog_path = E:/sphinx/sphinx-3.2.1/data/binlog/
}

These parameters were explained in the previous article within the “Setting up Sphinx” section.

Also, we can configure the Indexer service using the Indexer source block. Some important properties can be set, such as:

  • mem_limit: Indexing RAM usage limit. Optional, default is 128M. Enforced memory usage limit that the indexer will not go above
  • max_iops: Maximum I/O operations per second, for I/O throttling. Optional, default is 0 (unlimited)

indexer
{
mem_limit = 520M
}

Database preparation

One thing worth mentioning is that Sphinx cannot handle the date and DateTime data types. All date fields should be converted into UNIX_Timestamp (integer value). We can use the following SQL function to convert dates into Unix_Timestamp.

To use ModifiedDate as an attribute within the Sphinx search engine index, we should create a new filed of type integer (ModifiedDate_TS) and fill it using the function we created above:

The result should be as follows:

Converting the datetime column to Unix timestamp

Figure 1 – Converting the datetime column to Unix timestamp

Indexing

Now we should open the command prompt utility and use the following command to start building full-text indexes:

E:\Sphinx\sphinx-3.2.1\bin\indexer –all –config E:\sphinx\sphinx-3.2.1\etc\sphinx.conf –rotate –print-queries

The indexer service generates the following output:

Creating a full-text index using Sphinx search engine

Figure 2 – Creating a Sphinx full-text index

If no error messages are shown, the full-text index is created successfully (It is recommended to check for warning messages).

To check the created index, we can open the MySQL command prompt utility using the following command:

mysql -h 127.0.0.1 -P 9306

When the MySQL command prompt is open, we can query the created index using the following command:

The result should be as following:

Querying the full-text index created using the Sphinx search engine

Figure 3 – Querying the created full-text index

Connecting to Sphinx search engine using SQL Server linked server object

The last thing we will explain in this article is how to use a linked server object to connect to the Sphinx search engine and to execute queries.

To establish a connection with Sphinx from the SQL Server management studio, we should first install the MySQL ODBC connector from the following link.

After installation is completed, use the following lines of code to create a linked server to the Sphinx:

After the linked server is created, we can use OPENQUERY() option to send queries to the Sphinx engine as follows:

We may receive the following error:

OLE DB provider “MSDASQL” for linked server “SPHINX_SEARCH” returned message “[MySQL][ODBC 8.0(a) Driver]Driver does not support server versions under 4.1.1”.

As mentioned in the official Sphinx search engine forum, this error is thrown since the MySQL server version is not specified within the configuration file, while the default value is under version 4.4.1. To solve this problem, we added the following line within the “searchd” code clock in the configuration file:

Linked server error

Figure 4 – Linked server error

mysql_version_string = 8.0.20

The whole block should look like the following:


searchd
{
listen = 9306:mysql41
pid_file = E:/sphinx/sphinx-3.2.1/data/searchd.pid
log = E:/sphinx/sphinx-3.2.1/data/log/log.txt
query_log = E:/sphinx/sphinx-3.2.1/data/log/query_log.txt
binlog_path = E:/sphinx/sphinx-3.2.1/data/binlog/
mysql_version_string = 8.0.20
}

Note that we can use the “mysql -V” command to get the MySQL Server version installed on your machine.

Now, we should restart the SphinxSearch service and try to rerun the query. The result should look like:

Result shown using linked server

Figure 5 – Querying Sphinx using linked server object

Why connecting to Sphinx using SQL Server linked servers? The answer is that in many cases, we need to benefit from the Sphinx full-text indexes and integrate the result with other data stored in SQL Server. Note that it may be an excellent choice to go with a similar scenario when we need to minimize processing on SQL Server databases, and indexes are stored on a separate machine.

Conclussion

In this article, we provided a step-by-step guide for building full-text indexes in SQL Server databases using the Sphinx search engine. We covered several topics such as setting up the Sphinx configuration file, the database preparation needed before indexing, building full-text indexes, and how to use SQL Server linked server object to connect to Sphinx and execute queries.

In the next article in this series, we will talk about the Manticore search engine, which is a separate product built from Sphinx’s second release (2.x).

Table of contents

Getting started with Sphinx search engine
Building full-text indexes using the Sphinx search engine
Manticore search: a continuation of the Sphinx search engine
Hadi Fadlallah
Indexes, Search

About Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies. Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional. On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques. Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website. View all posts by Hadi Fadlallah

168 Views