In the world of modern web applications, it is increasingly important to support a diverse range of languages and character sets. With the rise of globalization, the need to store and process multilingual data has become essential. MySQL, one of the most popular relational database management systems, recognizes this need and has introduced utf8mb4 in its 8.0 version as a game-changer. In this blog post, we will explore utf8mb4 and its advantages in MySQL 8.0, backed by practical examples.

Understanding utf8mb4

Before diving into the benefits, let’s clarify what utf8mb4 represents. In MySQL, “utf8” refers to a character encoding that supports the Unicode character set using a maximum of three bytes per character. However, the original utf8 implementation in MySQL does not cover all Unicode characters. utf8mb4, on the other hand, is a modified version of utf8 that supports the complete Unicode character set, including emojis and other supplementary characters, by using a maximum of four bytes per character.

The original utf8 implementation in MySQL only supports characters from the Basic Multilingual Plane (BMP), which is about 90% of all Unicode characters. utf8mb4, on the other hand, supports the entire Unicode character set, including emojis and other supplementary characters. It does this by using a maximum of four bytes per character instead of the three bytes used by utf8.

Here is a table showing the difference between utf8 and utf8mb4:

Featureutf8utf8mb3utf8mb4
Maximum number of bytes per character334
Characters supportedBasic Multilingual Plane (BMP)BMPBMP + Supplementary Plane
Default in MySQLYesYesYes (since MySQL 8.0)
Deprecation statusDeprecatedDeprecatedNot deprecated

Note: Historically, MySQL used the character set utf8 as an alias for utf8mb3. However, starting with MySQL 8.0.28, utf8mb3 is only used in the output of SHOW statements and in Information Schema tables when referring to that character set. In the future, utf8 is expected to become a reference to utf8mb4. To avoid any ambiguity, it is recommended to explicitly specify utf8mb4 when referring to that character set.

As you can see, the main difference between utf8, utf8mb3, and utf8mb4 is the maximum number of bytes per character. utf8 and utf8mb3 can only store characters in the Basic Multilingual Plane (BMP), while utf8mb4 can also store characters in the Supplementary Plane. This means that utf8mb4 can support a wider range of characters, including emojis, mathematical symbols, and other special characters.

Another difference between the three character sets is their default status in MySQL. utf8 is the default character set in MySQL 5.7 and earlier, while utf8mb3 is the default character set in MySQL 8.0. However, utf8mb4 is the default character set in MySQL 8.0.28 and later.

Finally, utf8 and utf8mb3 are deprecated in MySQL 8.0. This means that they will eventually be removed from MySQL, so it is recommended to use utf8mb4 instead.

So, if you need to store all Unicode characters, including emojis and other supplementary characters, then you should use utf8mb4. However, if you only need to store characters from the BMP, then utf8 may be sufficient.

Here is an example comparison of utf8 and utf8mb4 using MySQL tables and queries:

MySQL 5.7

Table:

Inserts three rows into the users table, including the emoji.

The error message encountered, “ERROR 1366 (HY000): Incorrect string value: ‘xF0x9Dx8Cx86’ for column ‘name’ at row 3,” suggests that there is an issue with the character encoding being used for the ‘name’ column in the ‘users’ table. The error occurred while trying to insert the Unicode character ‘?’ into the ‘name’ column.

MySQL 8.0

Table:

This table uses the utf8mb3 character set for both the name and email columns. This means that the table can store all characters from the BMP, but it cannot store emojis or other supplementary characters.

Query:

Like the previous example, the error message you encountered, “ERROR 1366 (HY000): Incorrect string value: ‘xF0x9Dx8Cx86’ for column ‘name’ at row 3,” suggests that there is an issue with the character encoding being used for the ‘name’ column in the ‘users’ table. The error occurred while trying to insert the Unicode character ‘?’ into the ‘name’ column.

This query inserts the first two rows into the users table. The first two rows contain simple text data, while the third row contains an emoji. The emoji will not be stored correctly in the database because the utf8 character set cannot store emojis.

Output:

SELECT * FROM users;

This query will select the two rows from the users table. The output of the query will be a list of all rows in the users table, including the name, email, and ID of each user. The third row with emoji cannot store, and it errored out while inserting, because the utf8 character set cannot store emojis.

Table:

To ensure proper storage of emojis, let’s create the table columns using the utf8mb4 character set. Afterward, we can proceed to check if the emoji insertion works correctly.

Query:

This table uses the utf8mb4 character set for both the name and email columns. This means that the table can store all characters from the full Unicode character set, including emojis and other supplementary characters.

This query inserts three rows into the users table. The first two rows contain simple text data, while the third row contains an emoji. The emoji will be stored correctly in the database because the utf8mb4 character set can store emojis.

Output:

SELECT * FROM users;

This query will select all rows from the users table. The output of the query will be a list of all rows in the users table, including the name, email, and ID of each user. The emoji will be stored as an emoji because the utf8mb4 character set can store emojis.

Conclusion

As you can see, the utf8mb4 character set can store all characters from the full Unicode character set, including emojis and other supplementary characters. This makes it a good choice for storing complex text data, text searches, and comparisons. The utf8 character set, on the other hand, can only store characters from the BMP. This means that it cannot store emojis or other supplementary characters.

In general, it is recommended to use utf8mb4 for all new applications. This will ensure that your data can be stored and processed correctly, regardless of the characters that it contains.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments