2022 was an exciting year for Percona Monitoring and Management (PMM). We’ve added and improved many features, including Alerting and Backup Management. These updates are designed to keep databases running at peak performance and simplify database operations. But as companies grow and see more demand for their databases, we need to ensure that PMM also remains scalable so you don’t need to worry about its performance while tending to the rest of your environment.

PMM2 uses VictoriaMetrics (VM) as its metrics storage engine. Percona’s co-Founder Peter Zaitsev wrote a detailed post about migration from Prometheus to VictoriaMetrics, One of the most significant differences in terms of performance of PMM2 comes with the usage for VM, which can also be derived from performance comparison on node_exporter metrics between Prometheus and VictoriaMetrics.

Planning for resources of a PMM Server host instance can be tricky because the numbers can change depending on the DB instances being monitored by PMM. For example, a higher number of data samples ingested per second or a monitoring database with a huge number of tables (1000+) can affect performance; similarly, the configuration for exporters or a custom metric resolution can also have an impact on the performance of a PMM server host. The point is that scaling up PMM isn’t linear, and this post is only meant to give you a general idea and serve our users as a good starting point when planning to set up PMM2.

The VictoriaMetrics team has also published some best practices, which can also be referred to while planning for resources for setting up PMM2.

Home Dashboard for PMM2
We have tested PMM version 2.33.0 with its default configuration, and it can monitor more than 1,000 MySQL services, with the databases running with the default sysbench Read-Write workload.  We observed that the overall performance of PMM monitoring 1,000 database services was good, and no significant resource usage spikes were observed; this is a HUGE increase in performance and capacity over previous versions!  Please note that the focus of these tests was around standard metrics gathering and display, we’ll use a future blog post to benchmark some of the more intensive query analytics (QAN) performance numbers.

Capacity planning and setup details

We used a dedicated 32-core CPU and 64GB of RAM for our testing.

CPU Usage for PMM Server Host System

The CPU usage averaged 24% utilization, as you can see in the above picture.

 

Memory Utilization for PMM Server Host System

Virtual Memory utilization was averaging 48 GB of RAM.

VictoriaMetrics maintains an in-memory cache for mapping active time series into internal series IDs. The cache size depends on the available memory for VictoriaMetrics in the host system; hence planning for enough RAM on the host system is important for better performance to avoid having a high percentage of slow inserts.

If we talk about overall disk usage for the instance monitoring 1,000 Database services, the average disk usage per datapoint comes out to be roughly .25 Bytes, or you should plan storage roughly between 500 GB – one TB for a default 30 day retention period.

Average Datapoints Size

Average datapoints size

Stats on Victoria Metrics

We recommended having at least two GB RAM and a two-core system for PMM Server as a bare minimum requirement to set up monitoring your database services. With this minimum recommended setup, you can monitor up to three databases comfortably, possibly more, depending on some of your environment’s already mentioned factors.

Based on our observations and various setups we have done with PMM, overall, with a reasonably powerful pmm-server host system (eight+ GB RAM and eight+ cores), it is the most optimum to target monitoring 32 databases per core or 16 databases per one GB RAM, hence keeping this in mind is really helpful while planning resources for your respective monitoring setups.

Number of Database Services MonitoredMin Recommend Requirement
0-250 services8 Core, 16 GB RAM
250-500 services16 Core, 32 GB RAM
500-1000 services32 Core, 64 GB RAM

PMM scalability improved dramatically through UX and performance research

In earlier versions of PMM2, the Home Dashboard could not load with more than 400 DB services, resulting in a frustrating experience for users. Interacting with UI elements such as filters and date pickers was previously impossible. We conducted thorough research to improve scalability and the user experience on our Home Dashboard for 1,000 database services. Our findings revealed that the design of the Home Dashboard heavily impacted scalability and poor UX on the UI resulting in unresponsive pages.

We redesigned the Home Dashboard as a solution, and the results were significant. The new dashboard provides a much better user experience with more critical information being displayed and scalability for environments up to 1000 DB services. The overall load time has improved dramatically, going from 50+ seconds to roughly 20 seconds, and there are no longer any unresponsive errors on the UI. Users can now interact with filters on other dashboards seamlessly as well!

There are still some limitations we’re working on addressing

  • Instance Overview Dashboards, which are shipped with PMM, do not work well with such a large number of instances, so it is recommended not to rely on them when such a high number of databases are being monitored. They would work well only with a maximum of 400 database services.
  • There is a known issue around the request “URI too Large” pop-up message that is visible because of some large query requests, this also leads to an issue with setting a big time range for observing metrics from the monitored Database.  Our team is planning to implement a fix for this soon.
  • QAN takes 50+ seconds to load up when 400+ database services are monitored. Also, the overall interaction with QAN feels laggy when searching and applying filters across a big list of services/nodes. Our team is working on improving the overall user experience of the QAN App, which will soon be fixed in future releases of PMM.

Not a formula but a rule of thumb

Overall resource usage in PMM depends on the configuration and workload, and it may vary for different setups so it’s difficult to say, “for monitoring this number of DB services, you need a machine of that size.” This post is meant to show how the PMM server scales up and performs with the default setup and all database host nodes configured in default metrics mode (push).

We plan to also work on another follow-up post on the performance and scalability where we would highlight results for different dashboards and QAN, showcasing the improvements we have made over the last few PMM releases.

Tell us what you think about PMM!

We’re excited about all of the improvements we’ve made, but we’re not done yet! Have some thoughts about how we can improve PMM, or want to ask questions? Come talk to us in the Percona Forums, and let us know what you’re thinking!

PMM Forum

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments