MongoDB is a non-relational document database that provides support for JSON-like storage. It provides a flexible data model allowing you to easily store unstructured data. First released in 2009, it is the most used NoSQL database and has been downloaded more than 325 million times.

MongoDB is popular with developers as it is easy to get started with. Over the years, MongoDB has had many features introduced that have turned the database into a robust solution able to store terabytes of data for applications.

As with any database, developers and DBAs working with MongoDB should look at how to optimize the performance of their database, especially nowadays with cloud services, where each byte processed, transmitted, and stored costs money. The ability to get started so quickly with MongoDB means that it is easy to overlook potential problems or miss out on simple performance improvements. In this piece, we’ll look at ten essential techniques you can apply to make the most of MongoDB for your applications.

Tip #1:

Always enable authorization for your production environments

Securing your database from the beginning prevents unauthorized access and security breaches. This is especially important in today’s times of escalated hacking attempts.

The bigger the database, the bigger the damage from a leak. There have been numerous data leaks due to the simple fact that authorization and authentication are disabled by default when deploying MongoDB for the first time. While it is not a performance tip, it is essential to enable authorization and authentication right from the start, as it will save you any potential pain over time due to unauthorized access or data leakage.

When you deploy a new instance of MongoDB, the instance has no user, password, or access control by default. In recent MongoDB versions, the default IP binding changed to 127.0.0.1, and a localhost exception was added, which reduced the potential for database exposure when installing it. 

However, this is still not ideal from a security perspective. The first piece of advice is to create the admin user and restart the instance again with the authorization option enabled. This prevents any unauthorized access to the instance. 

To create the admin user:

 

Tip #2:

Always upgrade to the latest patch set for any given major version

Sometimes companies have to stay in versions that have gone EOL (End-of-Life) due to legacy application or driver dependency issues. It is recommended to upgrade your stack as soon as possible to take advantage of improvements and any security-related software updates.

It is critically important not to use versions that are labeled with “critical issues” or “not recommended for production use” warnings. Those releases have been found to have to include bugs that can cause serious performance impacts or cause issues with data integrity – potentially causing data corruption or data loss.

It should seem obvious, but one of the most common issues we see with production instances is due to developers running a MongoDB version that is actually not suitable for production in the first place. This can be due to the version being outdated, such as with a retired version that should be updated to a newer iteration that contains all the necessary bug fixes. Alternatively, we can choose a version that is too early and not yet tested enough for production use.

This can be due to a few different reasons. As developers, we are normally keen to use our tools’ latest and greatest versions. We also want to be consistent over all the stages of development, from initial build and test through to production, as this decreases the number of variables we have to support, the potential for issues, and the cost to manage all those instances.

For some, this can mean using versions that are not signed off for production deployment yet. For others, it can mean sticking with a specific version that is tried and trusted. This is a problem from a troubleshooting perspective when an issue is fixed in a later version of MongoDB that is approved for production but has not been deployed yet. Alternatively, you may forget about that database instance that is ‘just working’ in the background and miss when you need to implement a patch!

In response to this, you should regularly check if your version is suitable for production using the release notes of each version. For example, MongoDB 5.0 provides the following guidance in its release notes: https://www.mongodb.com/docs/upcoming/release-notes/5.0/ 

The guidance here would be to use MongoDB 5.0.11, as this has the required updates in place. If you don’t update to this version, you will run the risk of losing data.

While it might be tempting to stick with one version, keeping up with upgrades is essential to prevent problems in production. You may want to take advantage of newly added features, but you should put these through your test process to see if there are any problems that might affect your overall performance before moving into production too.

Lastly, you must check the MongoDB Software Lifecycle Schedules and anticipate the upgrades of your clusters before the end of life of each version: https://www.mongodb.com/support-policy/lifecycles

End of life versions do not receive patches, bug fixes, or any kind of improvements. This can leave your database instances exposed and vulnerable.

From a performance perspective, getting the right version of MongoDB for your production applications involves being ‘just right’ – not too near the bleeding edge that you encounter bugs or other problems, but also not too far behind that you miss out on vital updates.

Tip #3:

Always use Replica Sets with at least three full data-bearing nodes for your production environments

A replica set is a group of MongoDB processes that maintains the same data on all the nodes used for an application. It provides redundancy and data availability for your data. When you have multiple copies of your data on different database servers —  or even better, in different data centers worldwide — replication provides a high level of fault tolerance in case of a disaster. 

MongoDB replica sets consist of PRIMARY and SECONDARY nodes. The PRIMARY node receives writes from the application server and then automatically replicates or copies the data to all SECONDARY nodes via the built-in replication process, which apply changes via the oplog. Replication provides redundancy, and having multiple copies of the same data increases data availability.

It is best practice to have an odd number of replica set members to maintain voting quorum. The minimum number of replica set members for a replica set is three nodes. A replica set can currently have up to 50 nodes. Only seven of those members can have voting rights. Those votes determine which member becomes PRIMARY in case of problems, failovers, or elections.

The basic replica set would consist of the following:

  • PRIMARY 
  • SECONDARY 
  • SECONDARY

All the nodes of the replica set will work together, as the PRIMARY node will receive the writes from the app server, and then the data will be copied to the secondaries. If something happens to the primary node, the replica set will elect a secondary as the new primary. To make this process work more efficiently and ensure a smooth failover, it is important to have the same hardware configuration on all the nodes of the replica set. Additionally, secondary nodes can be used to improve or direct reads by utilizing options such as secondaryPreferred or nearest to support hedged reads. 

After you deploy a replica set on production, it is important to check the health of the replica and the nodes. MongoDB has two important commands to check this in practice.

rs.status() provides information on the current status of the replica set using data derived from the heartbeat packets sent by the other members of the replica set. It’s very useful as a tool to check the status of all the nodes in a replica set.

rs.printSecondaryReplicationInfo() provides a formatted report of the replica set status. It’s very useful to check if any of the secondaries are behind the PRIMARY on data replication, as this would affect your ability to recover all your data in the event of something going wrong. If secondaries are too far behind the primary, then you can end up losing a lot more data than you are comfortable with.

However, these commands provide point-in-time information rather than continuous monitoring for the health of your replica set. In a real production environment, or if you have many clusters to check, running these commands can become time-consuming and annoying, so we recommend using a monitor system for your clusters, like Percona Monitoring and Management, to monitor them.

Tip #4:

Avoid the use of Query types or Operators that can be expensive

Sometimes the simplest way to search for something in a database is to use a regular expression or $regex operation. A lot of developers choose this option but it can actually harm your search operations at scale. Instead, you must avoid the use of $regex queries, especially when your database is big.

A $regex query consumes a lot of CPU time and it will normally be extremely slow and inefficient. Creating an index doesn’t help so much and sometimes the performance is worse than without indexes. 

For example, we can use a $regex query on a collection of 10 million documents and use .explain(true) to view how many milliseconds the query takes.

Without an index:

We can see in this example that the index creation didn’t help to improve the $regex performance.

It’s common to see a new application using $regex operations for search requests. This is because neither the developers nor the DBAs notice any issues in performance to begin with because the size of the collections is small, and the users at the beginning of any application are very few.

However, when the collections become bigger, and more users start to use the application, the $regex operations start to slow down the cluster and become a nightmare for the team. Over time, as your application scales and more users want to carry out search requests, the level of performance can drop significantly.

Rather than using $regex queries, you can use text indexes to support your text search.

This is more efficient than $regex but needs you to prepare in advance by adding text indexes to your data sets. They can include any field whose value is a string or an array of string elements. A collection can only have one text search index, but that index can cover multiple fields.

Using the same collection of the example, we can test how many milliseconds take the same query using text search:

In practice, the same query with text search took four seconds less than the $regex query. Four seconds in “database time,” let alone online application time, is a lot.

To conclude, if you can solve the query using Text Search, use it instead of using $regex and reserve $regex for use cases where they are really necessary and always limit the use of other query patterns and operators such as findAndModify and $NIN.  

Tip #5:

Think wisely about your index strategy

Putting some thought into your queries at the start can have a massive impact on performance over time. First, you need to understand your application and the kinds of queries that you expect to process as part of your service. Based on this, you can create an index that supports them. 

Indexing can help to speed up read queries, but it comes with an extra cost of storage, and they will slow down write operations. Consequently, you will need to think about which fields should be indexed so you can avoid creating too many indexes.

For example, if you are creating a compound index, the Equality, Sort, Range rule is a must to follow, and using an index to sort the results improves the speed of the query.

Similarly, you can always check if your queries are really using the indexes that you have created with .explain(). Sometimes we see a collection with indexes created, but the queries either don’t use the indexes or instead use the wrong index entirely. It’s important to create only the indexes that will actually be used for the read queries. Having indexes that will never be used is a waste of storage and will slow down write operations.

When you look at the .explain() output, three main fields are important to observe. For example:

In this example, no indexes are being used. That is because the number of keys examined is 0 while the number of documents examined is 207254. Ideally, the query needs to have the ratio nreturned/keysExamined=1. For example:

As a final piece of advice, if with .explain() you see a particular query using a wrong index, you can force the query to use a particular index with .hint() This overrides MongoDB’s default index selection and query optimization process, allowing you to specify the index that is used or to carry out a forward collection or reverse collection scan. 

Percona Distribution for MongoDB: Performant, enterprise-grade MongoDB software free of licensing fees and lock-in.

Tip #6:

Watch for changes in query patterns, application changes, or index usage over time

Every database is unique and particular to that application, and they grow and change through time. Nobody knows how an application will grow over time or how the queries will change over time too. Whatever assumptions you make, your prediction will inevitably be wrong, so it is essential to check your database and indexes over time.

For instance, you may plan a specific query optimization approach and a particular index, but after one year, you realize a few queries are using that index, and it’s not necessary anymore. Carrying on with this approach will cost you more in storage while not providing any improvements in application performance.

For this reason, it’s necessary to perform query optimizations and look at the indexes for each collection frequently.

MongoDB has some tools to do query optimization, such as the database profiler or the .explain() method; we recommend using them to find which queries are slow, how the indexes are being used by the queries, and where you may need to improve your optimizations.

Alongside removing indexes that are not used efficiently, look out for duplicate indexes that you don’t need to run as well.

We use some scripts to check if there are duplicate indexes or if there are any indexes that are not being used. You can check and use them from our repository: 

https://github.com/percona/support-snippets/tree/master/mongodb/scripts

Similarly, you can look at how many results you want to get from a query, as providing too many results can have a performance and time impact. You can limit the number of query results with .limit(), as sometimes you only need the first five results of a query rather than tens or hundreds of responses.

Another useful approach is to use projections to get only the necessary data. If you need only one field of the document, use projection instead of getting the entire document and then filter on the app side.

Lastly, if you need to order the results of a query, be sure that you are using an index and take advantage of it to improve your efficiency.

Tip #7:

Don’t run multiple mongoD on the same server

Even if it’s possible to run multiple MongoD on the same server using different processes and ports, we strongly recommend not doing this.

When you run multiple MongoD processes on the same server, it’s very difficult to monitor them and the resources they are consuming (CPU, RAM, Network, etc.). Consequently, if there is any problem, it is extremely difficult to find out what is going on and get to the root cause of that issue.

We have a lot of cases where customers have found a resource problem on the server, and because they are running multiple instances of MongoD it’s impossible to troubleshoot the problem because discovering which specific process has the problem is more difficult.

Similarly, sometimes we can see that developers have implemented a sharded cluster to scale up their application data, but then multiple shards are running on the same server. In these circumstances, the router will send a lot of queries to the same node. This may overload the node leading to poor performance, which is the opposite of what the sharding strategy wants to solve.

The worst-case scenario for these kinds of deployments involves replica sets. Imagine running a replica set for resiliency and availability, and then more than one member of the replica set is running on the same server. This is a recipe for potential disaster and data loss if the physical server that supports those nodes has a problem. Rather than architecting your application for resiliency, you would have made the whole deployment more likely to fail.

If you have a larger sharded environment where you are using many MongoS (query router) processes, it is possible to run multiple MongoS processes on the same nodes. If you choose to do so, it is recommended to run those in separate containers or VMs to avoid “noisy neighbor” resource contention issues and to aid in troubleshooting.

Tip #8:

Employ a reliable and robust backup strategy

So, you have a cluster with replication, but do you want to sleep better? Do backups of your data frequently!

Having backups of your data is a good strategy and allows you to restore the data from an earlier moment if you need to recover it from an unplanned event.

There are different options to backup your data:

Mongodump / Mongorestore: Mongodump reads data from MongoDB and creates a BSON file that Mongorestore can use to populate a MongoDB database. There are efficient tools for backing up small MongoDB deployments. On the plus side, you can select a specific database or collection to back up efficiently, and this approach doesn’t require stopping writes on the node. However, this approach doesn’t back up any indexes you have created, so when restoring, you would need to re-create those indexes again. Logical backups are, in general, very slow and time-consuming, so you would have to factor that time into your restore process. Lastly, this approach is not recommended for sharded clusters that are more complex deployments.

Percona Backup for MongoDB is an open source, distributed, and low-impact solution for consistent backups of MongoDB sharded clusters and replica sets. It enables backups for MongoDB servers, replica sets, and sharded clusters. It can support logical, physical, and point-in-time recovery backups and backups to anywhere, including AWS S3, Azure, or filesystem storage types.

However, it does require initial setup and configuration on all the nodes that you would want to protect.

Physical / File system Backups  You can create a backup of a MongoDB deployment by making a copy of MongoDB’s underlying data files.

You can use different methods for this type of backup, from manually copying the data files, Logical Volume Management (LVM) based snapshots to cloud-based snapshots. These are usually faster than logical backups and can be copied or shared to remote servers. This approach is especially recommended for large datasets, and it is convenient while building a new node on the same cluster.

However, you cannot select a specific database or collection when restoring, and you cannot do incremental backups. 

Lastly, running a dedicated node is recommended for taking the backup as it requires halting writes, affecting application performance.

Any chosen solution must include testing your restore process periodically.

Tip #9:

Know when to shard your replica set and why choosing a shard key is important

Sharding is the most complex architecture you can deploy with MongoDB

As your database grows, you will need to add more capacity to your server. This can involve adding more RAM, more I/O capacity, or even more powerful CPUs to handle processing. This is called vertical scaling. However, if your database grows so much that it goes beyond the capacity of a single machine, then you may have to split the workload up. Several things might lead to this — for instance, there may not be a physical server large enough to handle the workload, or the server instance would cost so much that it would be unaffordable to run. In these circumstances, you need to start thinking about sharding your database, which is called horizontal scaling.

Horizontal Scaling involves dividing the database over multiple servers and adding additional servers to increase capacity as required. For MongoDB, this process is called sharding, and it relies on a sharding key to manage how workloads are split up across machines.

Choosing a sharding key must be the most difficult task on MongoDB. It’s necessary to think, study the datasets and queries, and plan ahead before choosing the key because it’s very difficult to revert the shard once it has been carried out. For versions of MongoDB before version 4.2, assigning a shard key is a one-way process that cannot be undone. For versions of MongoDB over 4.4, it is possible to refine a shard key, while MongoDB 5.0 and above can change the shard key with the reshardCollection command.

If you choose a bad shard key, then a large percentage of documents may go to one of the shards and only a few to another. This will make the sharded cluster unbalanced, affecting performance over time. This typically happens when a key that grows monotonically is chosen to shard a collection, as all the files over a given value would go to one shard rather than being distributed evenly.

Alongside looking at the value used to shard data, you will also need to think about the queries that will take place across the shard. The queries must use the shard key so the MongoS distributes the queries across the sharded cluster. If the query doesn’t use the shard key, then the MongoS will send the query to every shard of the cluster, affecting performance and making the sharding strategy inefficient.

“For versions of MongoDB over 4.4, it is possible to refine a shard key, while MongoDB 5.0 and above can change the shard key with the reshardCollection command.

The ability to reshard is something that developers and DBAs have been requesting for a long time. However, be aware that the resharding process is very resource intensive and can have negative performance impacts, especially when dealing with applications that have heavy write patterns.

Tip #10:

Don’t throw money at the problem

Last, but not least, it’s very common to see teams throwing money at the problems that they have with their databases.

Adding more RAM, more CPU, and moving to a larger instance or a bigger machine can overcome a performance problem. However, carrying this kind of action out without analyzing the real problem or the bottleneck can lead to more of the same kinds of problems in the future. Instead of immediately reaching for the credit card to solve the problem, sometimes it is better to think laterally and imagine a better solution. In most cases, the answer is not spending more money on resources but looking at optimizing your implementation for better performance at the same level.

While cloud services make it easy to scale up instances, the cost can quickly mount up. Worse, this is an ongoing expense that will carry on over time. By looking first at areas like query optimization and performance, it’s possible to avoid additional spending. For some of the customers we worked with, they were able to downgrade their EC2 instances, saving their companies a lot of money monthly.

As a general recommendation, adopt a cost-saving mindset where you can take your time to analyze the problem and think of a better solution than starting with cloud expansions.

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

 

Download Percona Distribution for MongoDB Today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments