As businesses grow and develop, the requirements that they have for their data platform grow along with it. As such, one of the more common questions I get from my clients is whether or not their system will be able to endure an anticipated load increase. Or worse yet, sometimes I get questions about regaining normal operations after a traffic increase caused performance destabilization.

As the subject of this blog post suggests, this all comes down to proper capacity planning. Unfortunately, this topic is more of an art than a science, given that there is really no foolproof algorithm or approach that can tell you exactly where you might hit a bottleneck with server performance. But we can discuss common bottlenecks, how to assess them, and have a better understanding as to why proactive monitoring is so important when it comes to responding to traffic growth.

Hardware considerations

The first thing we have to consider here is the resources that the underlying host provides to the database. Let’s take a look at each common resource. In each case, I’ll explain why a 2x increase in traffic doesn’t necessarily mean you’ll have a 2x increase in resource consumption.

Memory

Memory is one of the easier resources to predict and forecast and one of the few places where an algorithm might help you, but for this, we need to know a bit about how MySQL uses memory.

MySQL has two main memory consumers. Global caches like the InnoDB buffer pool and MyISAM key cache and session-level caches like the sort buffer, join buffer, random read buffer, etc.

Global memory caches are static in size as they are defined solely by the configuration of the database itself. What this means is that if you have a buffer pool set to 64Gb, having an increase in traffic isn’t going to make this any bigger or smaller. What changes is how session-level caches are allocated, which may result in larger memory consumption.

A tool that was popular at one time for calculating memory consumption was mysqlcalculator.com. Using this tool, you could enter in your values for your global and session variables and the number of max connections, and it would return the amount of memory that MySQL would consume. In practice, this calculation doesn’t really work, and that’s due to the fact that caches like the sort buffer and join buffer aren’t allocated when a new connection is made; they are only allocated when a query is run and only if MySQL determines that one or more of the session caches will be needed for that query. So idle connections don’t use much memory at all, and active connections may not use much more if they don’t require any of the session-level caches to complete their query.

The way I get around this is to estimate the amount of memory consumed on average by sessions as such…

({Total memory consumed by MySQL} – {sum of all global caches}) / {average number of active sessions}

Keep in mind that even this isn’t going to be super accurate, but at least it gives you an idea of what common session-level memory usage looks like. If you can figure out what the average memory consumption is per active session, then you can forecast what 2x the number of active sessions will consume.

This sounds simple enough, but in reality, there could be more to consider. Does your traffic increase come with updated code changes that change the queries? Do these queries use more caches? Will your increase in traffic mean more data, and if so, will you need to grow your global cache to ensure more data fits into it?

With the points above under consideration, we know that we can generally predict what MySQL will do with memory under a traffic increase, but there may be changes that could be unforeseen that could change the amount of memory that sessions use.

The solution is proactive monitoring using time-lapse metrics monitoring like what you would get with Percona Monitoring and Management (PMM). Keep an eye on your active session graph and your memory consumption graph and see how they relate to one another. Checking this frequently can help you get a better understanding of how session memory allocation changes over time and will give you a better understanding of what you might need as traffic increases.

CPU

When it comes to CPU, there’s obviously a large number of factors that contribute to usage. The most common is the queries that you run against MySQL itself. However, having a 2x increase in traffic may not lead to a 2x increase in CPU as, like memory, it really depends on the queries that are run against the database. In fact, the most common cause of massive CPU increase that I’ve seen isn’t traffic increase; it’s code changes that introduced inefficient revisions to existing queries or new queries. As such, a 0% increase in traffic can result in full CPU saturation.

This is where proactive monitoring comes into play again. Keep an eye on CPU graphs as traffic increases. In addition, you can collect full query profiles on a regular basis and run them through tools like pt-query-digest or look at the Query Analyzer (QAN) in PMM to keep track of query performance, noting where queries may be less performant than they once were, or when new queries have unexpected high load.

Disk space

A 2x increase in traffic doesn’t mean a 2x increase in disk space consumption. It may increase the rate at which disk space is accumulated, but that also depends on how much of the traffic increase is write-focused. If you have a 4x increase in reads and a 1.05X increase in writes, then you don’t need to be overly concerned about disk space consumption rates.

Once again, we look at proactive monitoring to help us. Using time-lapse metrics monitoring, we can monitor overall disk consumption and the rate at which consumption occurs and then predict how much time we have left before we run out of space.

Disk IOPS

The amount of disk IOPS your system uses will be somewhat related to how much of your data can fit into memory. Keep in mind that the disk will still need to be used for background operations as well, including writing to the InnoDB redo log, persisting/checkpointing data changes to table spaces from the redo log, etc. But, for example, if you have a large traffic increase that’s read-dependent and all of the data being read in the buffer pool, you may not see much of an IOPS increase at all.

Guess what we should do in this case? If you said “proactive monitoring,” you get a gold star. Keep an eye out for metrics related to IOPS and disk utilization as traffic increases.

Before we move on to the next section, consider the differences in abnormal between disk space and disk IOPS. When you saturate disk IOPS, your system is going to run slow. If you fill up your disk, your database will start throwing errors and may stop working completely. It’s important to understand the difference so you know how to act based on the situation at hand.

Database engine considerations

While resource utilization/saturation are very common bottlenecks for database performance, there are limitations within the engine itself. Row-locking contention is a good example, and you can keep an eye on row-lock wait time metrics in tools like PMM. But, much like any other software that allows for concurrent session usage, there are mutexes/semaphores in the code that are used to limit the number of sessions that can access shared resources. Information about this can be found in the semaphores section in the output of the “SHOW ENGINE INNODB STATUS” command.

Unfortunately, this is the single hardest bottleneck to predict and is based solely on the use case. I’ve seen systems running 25,000+ queries per second with no issue, and I’ve also seen systems running ~5,000 queries per second that ran into issues with mutex contention.

Keeping an eye on metrics for OS context switching will help with this a little bit, but unfortunately this is a situation where you normally don’t know where the wall is until you run right into it. Adjusting variables like innodb_thread_concurrency can help with this in a pinch, but when you get to this point, you really need to look at query efficiency and horizontal scaling strategies.

Another thing to consider is configurable hard limits like max_connections, where you can limit the upper bound of the number of connections that can connect to MySQL at any given time. Keep in mind that increasing this value can impact memory consumption as more connections will use more memory, so use caution when adjusting upward.

Conclusion

Capacity planning is not something you do once a year or more as part of a general exercise. It’s not something you do when management calls you to let you know a big sale is coming up that will increase the load on the hosts. It’s part of a regular day-to-day activity for anyone that’s operating in a database administrator role.

Proactive monitoring plays a big part in capacity planning. I’m not talking about alert-based monitoring that hits your pager when it’s already too late, but evaluating metrics usage on a regular basis to see what the data platform is doing, how it’s handling its current traffic, etc. In most cases, you don’t see massive increases in traffic all at once; typically, it’s gradual enough that you can monitor as it increases and adjust your system or processes to avoid saturation.

Tools like PMM and the Percona Toolkit play a big role in proactive monitoring and are open source for free usage. So if you don’t have tools like this in place, this comes in at a price point that makes tool integration easier for your consideration.

Also, if you still feel concerned about your current capacity planning, you can always reach out to Percona Managed Services for a performance review or query review that will give you a detailed analysis of the current state of your database along with recommendations to keep it as performant as possible.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Subscribe
Notify of
guest

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Maxwell Stevenson

Hey Peter,

Nice stuff!
Can you tell me how can I see issues on the database server, using PMM in a simple and fast way?