This blog was originally published in February 2019 and was updated in September 2023.

Web applications rely on databases to run the internet, powering everything from e-commerce platforms to social media networks to streaming services. MySQL is one of the most popular database management systems, playing a pivotal role in the functionality and performance of web applications.

In today’s blog, I’ll take a look at MySQL collation and charset settings to shed light on how they impact the performance of web applications and how to use them to effectively communicate with your users.

Understanding Character Sets and Encoding in MySQL

Character sets and encoding in MySQL play a vital role in how data is stored and retrieved in a database. A character set is a collection of characters with unique representations for each character, such as letters, numbers, and symbols, that define how data is stored and how it is interpreted.

Character encoding refers to the method used to represent characters as binary data for storage and transmission. It specifies how characters are converted into binary code and vice-versa. 

The choice of character set and encoding impacts not only efficiency but also how the data appears to users.

How Character Sets Affect Data Storage and Retrieval

You can specify the character set for each column when you create a table, indicating the set of characters allowed in that column. This affects the type and range of characters that can be inserted into the column.

When data is inserted into the database, it is converted into the specified character set’s binary representation and stored accordingly. When retrieving data, MySQL converts the binary representation back into characters according to the character set and encoding rules. This conversion ensures that the data appears correctly to users and can be processed and displayed as intended.

An Example Illustrating Character Set Concepts

If you have a MySQL database for a multilingual website, you might use the UTF-8 character set, which supports characters from various languages, including English, Chinese, Arabic, and others. Using UTF-8 encoding, you can store and retrieve data in these languages seamlessly, ensuring that text displays correctly for users worldwide.

However, if you use a character set that doesn’t support specific characters, such as storing Arabic text in a Chinese character set, there would be issues with the display of the data.

MySQL and Percona work better together. Download and install Percona Distribution for MySQL today!

MySQL Collation and its Relationship with Character Sets

Collation refers to a set of rules and conventions that dictate how character data is compared and sorted, playing a crucial role in determining the order in which data is retrieved from the database and how various string operations, such as searching and filtering, are performed. 

Collation is closely intertwined with character sets, defining how characters within a specific character set are ordered. To ensure consistent results, it’s important to select compatible collations for your chosen character sets. Incompatibilities between character sets and collations can lead to unexpected sorting and comparison outcomes, which could lead to issues in database operations and application functionality. 

Choosing the Appropriate Character Set

When deciding on the character set for your data, several important considerations should guide your decision-making process. First and foremost, you should take into account the nature of your data and your target audience. Is your data primarily composed of Latin-based text, or do you expect a diverse audience that requires support for various languages? Additionally, it’s crucial to ensure that the character set you select aligns with the encoding used in your web application, ensuring seamless communication between your database and the application layer.

If your application serves a global audience, it’s often wise to opt for a Unicode character set like UTF-8. Unicode provides comprehensive support for various languages and scripts, making it an excellent choice for multilingual applications. However, if your application predominantly serves a single language or region, choosing a character set optimized for that specific context can lead to more efficient data storage and improved overall performance.

Dealing with the complexities of multilingual data may involve not only choosing the right character set but also ensuring that your database design, application code, and collation settings are configured to handle multilingual content effectively. It’s essential to plan for data input, storage, and retrieval in a manner that accommodates working with diverse character sets and languages while maintaining a seamless user experience.

Impact of Charset and Collation on Indexing Strategies

Charset and collation choices in MySQL can have a significant impact on indexing strategies. These choices influence how data is stored, sorted, and compared within the database, which directly affects how indexes function.

When it comes to indexing, one of the key considerations is the length of indexed values — especially for text columns. Different character sets have different storage requirements for characters, with some requiring more bytes to represent certain characters. For example, UTF-8, a widely used character set, can use one to four bytes per character, while Latin1 typically uses one byte per character. This difference in storage size can lead to variations in the size of index entries, which can impact efficiency.

Collation determines how string comparisons are executed, which affects how indexes are used in sorting and searching operations. For example, if your application requires case-insensitive searches, selecting a case-insensitive collation can be more efficient than using a case-sensitive one. However, it’s essential to note that different collations can have different performance implications. Some collations may be faster for sorting operations but slower for searching, while others may be optimized for specific languages or uses.

To illustrate, let’s look at a real-world example. Suppose you’re in the process of building a multilingual system that needs to support a diverse array of languages. In this scenario, opting for a UTF-8 character set is a logical choice, as it provides comprehensive language support, and you could employ a collation that facilitates case-insensitive searches, enhancing user-friendly information retrieval.

Alternatively, if you find yourself developing an application where space constraints are a concern, you may opt for a more space-efficient character set like Latin1. Here, it becomes crucial to select a collation that strikes a balance between sorting and searching performance, ensuring efficient data handling.

Ultimately, the influence of character sets and collations on your indexing strategies should meet the unique requirements and priorities of your application, ensuring optimal database performance and a positive user experience.

Are you looking to optimize database efficiency and reduce the risk of downtime? This on-demand webinar covers everything you need to know. Watch the recorded webinar.

Charset and Collation Effects on Query Execution

Choosing the correct charset and collation settings in a MySQL database can significantly impact query execution time and overall database performance. When not appropriately configured, they can become bottlenecks in query execution. For example, if your database uses a character set that doesn’t match the character set of the data sent in a query, MySQL may need to perform character set conversions on the fly, resulting in slower query execution times.

In addition, collation settings can affect string comparisons, which are common in queries involving WHERE clauses or JOIN operations. Using an inefficient collation can lead to suboptimal query performance. For instance, if you have a case-insensitive collation but most of your queries are case-sensitive, it slows down the query.

To demonstrate the importance of optimizing charset and collation settings, here’s an example. Say you have a web application with a user database using a UTF-8 character set and a case-insensitive collation, and the application queries user data to validate login credentials. By switching to a UTF-8 character set with a case-sensitive collation, user authentication queries can be significantly faster, as they no longer require case-insensitive comparisons. This optimization improves query execution times and provides a much better user experience.

Testing Read-Only CPU Intensive Workloads

Following my post MySQL 8 is not always faster than MySQL 5.7, this time I decided to test very simple read-only CPU-intensive workloads when all data fits memory. In this workload, there are NO IO operations, only memory and CPU operations.

My Testing Setup

Environment specification

  • Release | Ubuntu 18.04 LTS (bionic)
  • Kernel | 4.15.0-20-generic
  • Processors | physical = 2, cores = 28, virtual = 56, hyperthreading = yes
  • Models | 56xIntel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz<
  • Memory Total | 376.6G
  • Provider | packet.net x2.xlarge.x86 instance

I will test two workloads, sysbench oltp_read_only and oltp_point_select varying amount of threads

sysbench oltp_read_only --mysql-ssl=off --report-interval=1 --time=300 --threads=$i --tables=10 --table-size=10000000 --mysql-user=root run

sysbench oltp_point_select --mysql-ssl=off --report-interval=1 --time=300 --threads=$i --tables=10 --table-size=10000000 --mysql-user=root run

The results for OLTP read-only (latin1 character set):

MySQL 5.7.25MySQL 8.0.15
threadsthroughputthroughputthroughput ratio
11241.181114.41.11
44578.184106.691.11
1615763.6414303.541.10
2421384.5719472.891.10
3225081.1722897.041.10
4832363.2729600.261.09
6439629.0935585.881.11
12838448.2334718.421.11
25636306.4432798.121.11

The results for point_select (latin1 character set):

point selectMySQL 5.7.25MySQL 8.0.15
threadsthroughputthroughputthroughput ratio
131672.5228344.251.12
4110650.798296.461.13
16390165.41347026.491.12
24534454.55474024.561.13
32620402.74554524.731.12
48806367.3718350.871.12
641120586.03972366.591.15
1281108638.47960015.171.15
2561038166.63891470.111.16

We can see that in the OLTP read-only workload, MySQL 8.0.15 is slower by 10%, and for the point_select workload MySQL 8.0.15 is slower by 12-16%.

Although the difference is not necessarily significant, this is enough to reveal that MySQL 8.0.15 does not perform as well as MySQL 5.7.25 in the variety of workloads that I am testing.

However, it appears that the dynamic of the results will change if we use the utf8mb4 character set instead of latin1.

Let’s compare MySQL 5.7.25 latin1 vs. utf8mb4, as utf8mb4 is now the default CHARSET in MySQL 8.0.

But before we do that, let’s take a look also at COLLATION.

MySQL 5.7.25 uses a default collation utf8mb4_general_ci, However, I read that to use proper sorting and comparison for Eastern European languages, you may want to use the utf8mb4_unicode_ci collation. For MySQL 8.0.5, the default collation is

So let’s compare each version latin1 vs utf8mb4 (with default collation). First 5.7:

Threadsutf8mb4_general_cilatin1latin1 ratio
42957.994578.181.55
2413792.5521384.571.55
6424516.9939629.091.62
12823977.0738448.231.60

So here we can see that utf8mb4 in MySQL 5.7 is really much slower than latin1 (by 55-60%)

And the same for MySQL 8.0.15

MySQL 8.0 defaultcollations

Threadsutf8mb4_0900_ai_ci (default)latin1latin1 ratio
43968.884106.691.03
2418446.1919472.891.06
6432776.3535585.881.09
12831301.7534718.421.11

For MySQL 8.0 the hit from utf8mb4 is much lower (up to 11%)

Now let’s compare all collations for utf8mb4

For MySQL 5.7

MySQL 5.7 utf8mb4

utf8mb4_general_ci (default)utf8mb4_binutf8mb4_unicode_ciutf8mb4_unicode_520_ci
42957.993328.82157.611942.78
2413792.5515857.299989.969095.17
6424516.9928125.1616207.2614768.64
12823977.0727410.9415970.614560.6

If you plan to use utf8mb4_unicode_ci, you will get an even further performance hit (comparing to utf8mb4_general_ci).

And for MySQL 8.0.15

MySQL 8.0 utf8mb4

utf8mb4_general_ciutf8mb4_binutf8mb4_unicode_ciutf8mb4_0900_ai_ci (default)
43461.83628.013363.73968.88
2416327.4517136.1615740.8318446.19
6428960.6230390.2927242.7232776.35
12827967.2529256.8926489.8331301.75

So now let’s compare MySQL 8.0 vs MySQL 5.7 in utf8mb4 with default collations:

mysql 8 and 5.7 default collation

MySQL 8.0 utf8mb4_0900_ai_ciMySQL 5.7 utf8mb4_general_ciMySQL 8.0 ratio
43968.882957.991.34
2418446.1913792.551.34
6432776.3524516.991.34
12831301.7523977.071.31

So there we are. In this case, MySQL 8.0 is actually better than MySQL 5.7 by 34%

After Testing Conclusions

There are several observations to make:

  • MySQL 5.7 outperforms MySQL 8.0 in latin1 charset
  • MySQL 8.0 outperforms MySQL 5.7 by a wide margin if we use utf8mb4 charset
  • Be aware that utf8mb4  is now the default MySQL 8.0, while MySQL 5.7 has latin1 by default
  • When running comparisons between MySQL 8.0 and MySQL 5.7, be aware of what charset you are using, as it may affect the comparison a lot.

Best Practices for Charset and Collation Optimization

Optimizing charset and collation settings in MySQL involves careful planning and execution to ensure compatibility, efficiency, and long-term maintenance. It’s recommended to choose character sets and collations that align with your application’s needs, so consider the types of data your database will store and process, as well as the languages and character sets used in your application. For multilingual applications, UTF-8 is a popular choice due to its broad character support. When selecting collations, opt for those that match your application’s case sensitivity requirements to prevent unnecessary overhead.

When making charset and collation settings changes to existing databases, it can be a little trickier. Back up your data before making any modifications, conduct a thorough analysis of the database’s current state and the potential impact of changes, and test the changes in a controlled environment to ensure they won’t disrupt operations.

Also, you should regularly review and update your charset and collation settings as your application evolves and data requirements change. Robust monitoring solutions, like Percona Monitoring and Management, can track query performance, identify bottlenecks, and ensure that charset and collation settings continue to align with your application’s demands.

Looking to upgrade to MySQL 8.0 or stay on 5.7? Percona can help.

Choosing the appropriate character set and collation for your data involves considerations of your data’s nature, your target audience, and your application’s needs, and highlights the role that optimized charset and collation settings play in query execution times, the user experience, and overall database performance.

Percona offers comprehensive support solutions to ensure a smooth transition to MySQL 8.0 or EOL support for 5.7.

Move to MySQL 8.0  Get Post-EOL Support For MYSQL 5.7

Subscribe
Notify of
guest

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Interesting results. I guess that the performance drop comes from additional convertions, do you confirm?
So what is the right setup if I need latin1 on mysql8?

vadimtk

Camille,

You can set latin1 in MySQL 8 in different ways.

1. Global , in my.cnf
character_set_server=latin1
collation_server=latin1_swedish_ci

2. Per table.
CREATE TABLE t (c CHAR(20) CHARACTER SET latin1
3. Per column
col1 VARCHAR(5) CHARACTER SET latin1

Karl

This is something I’ve thought a lot about recently when designing tables.

Since we can specify charset and collation per column, does it not make a lot of sense to be careful to set latin1 as the charset for varchar columns that are used either for joins or being indexed? This is of course only possible if we know that the column is ANSI only which it often is for many things such as guids and hashes.

We have a system with a lot of indexes and joins being done on varchars. That’s just the nature of what’s being done and can’t really be helped by normalization, at least not with a performance benefit. It’s obvious that indexes are a lot larger with utf8mb4 meaning that (I assume) a lot less data can be kept in RAM. It’s interesting that MySQL chose this default, is it really that big a problem that people can’t store emoticons into their tables by default compared to performance in high usage scenarios?

It would be very interesting to see benchmarks on the impact of index size, performance with data size larger than RAM and join performance on latin1 vs utf8mb4.

Karl

Ok I take it back, I don’t actually join on varchars. However almost all our lookups are on secondary indexes on varchar(64) columns. That’s just the way it has to be because this data comes from external sources.

But the lookup speed / data set larger than RAM issue with indexes on varchar columns with different charsets is still something that would be very interesting to hear more about!

Gaetano Giunta

It would be interesting to know if there is any difference in performance, for MySQL 5.7, between the ‘historical’ 3 byte utf8_general_ci and the ‘modern’ utf8mb4.
This could be f.e. a driving factor in deciding about upgrading exiting databases (which have no stringent need to support characters outside the bmp at the moment)