Redis® Transactions & Long-Running Lua Scripts

5 min read
Redis® Transactions & Long-Running Lua Scripts

SHARE THIS ARTICLE

Redis offers two mechanisms for handling transactions – MULTI/EXEC based transactions and Lua scripts evaluation. Redis Lua scripting is the recommended approach and is fairly popular in usage.

Our Redis™ customers who have Lua scripts deployed often report this error – “BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE”. In this post, we will explain the Redis transactional property of scripts, what this error is about, and why we must be extra careful about it on Sentinel-managed systems that can failover.

Redis Lua Scripts Diagram - ScaleGrid Blog

Transactional Nature of Redis Lua Scripts

Redis “transactions” aren’t really transactions as understood conventionally – in case of errors, there is no rollback of writes made by the script.

Atomicity” of Redis scripts is guaranteed in the following manner:

  • Once a script begins executing, all other commands/scripts are blocked until the script completes. So, other clients either see the changes made by the script or they don’t. This is because they can only execute either before the script or after the script.
  • However, Redis doesn’t do rollbacks, so on an error within a script, any changes already made by the script will be retained and future commands/scripts will see those partial changes.
  • Since all other clients are blocked while the script executes, it is critical that the script is well-behaved and finishes in time.

The ‘lua-time-limit’ Value

It is highly recommended that the script complete within a time limit. Redis enforces this in a weak manner with the ‘lua-time-limit’ value. This is the maximum allowed time (in ms) that the script is allowed to run. The default value is 5 seconds. This is a really long time for CPU-bound activity (scripts have limited access and can’t run commands that access the disk).

However, the script is not killed when it executes beyond this time. Redis starts accepting client commands again, but responds to them with a BUSY error.

If you must kill the script at this point, there are two options available:

  • SCRIPT KILL command can be used to stop a script that hasn’t yet done any writes.
  • If the script has already performed writes to the server and must still be killed, use the SHUTDOWN NOSAVE to shutdown the server completely.

It is usually better to just wait for the script to complete its operation. The complete information on methods to kill the script execution and related behavior are available in the documentation.

Behavior on Sentinel-Monitored High Availability Systems

Sentinel-managed high availability systems add a new wrinkle to this. In fact, this discussion applies to any high availability system that depends on polling the Redis servers for health:

  • Long-running scripts will initially block client commands. Later when the ‘lua-time-limit’ has passed, the server will start responding with BUSY errors.
  • Sentinels will consider such a node as unavailable, and if this persists beyond the down-after-milliseconds value configured on the Sentinels, they will determine the node to be down.
  • If such a node is the master, a failover will be initiated. A replica node might get promoted and could start accepting new connections from clients.
  • Meanwhile, the older master will eventually complete executing the script and come back online. However, Sentinel will eventually reconfigure it as a replica and it will begin syncing with the new master. Any data written by the script will be lost.

Demonstration

We set up a sensitive high availability system to demonstrate this failover behavior. The setup has 2 Redis servers running in a master/replica configuration that is being monitored by a 3-sentinel quorum.

The lua-time-limit value was set to 500 ms so that it starts responding to clients with errors if a script runs for longer than 500 ms. The down-after-milliseconds value on the Sentinels is set to 5 seconds so that a node which reports errors is marked DOWN after 5 seconds.

We execute the following Lua script on the master:

local i = 0
while (true)
do
local key = "Key-" .. i
local value = "Value-" .. i
redis.call('set', key, value)
i = i + 1
redis.call('time')
end

This keeps writing entries into the Redis master. We subscribe to the events on one of the sentinels to observe the behavior.

The script is initiated on the master:

$ redis-cli -a  --eval test.lua
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

Here is a truncated sequence of activities as seen on Sentinel:

3) "+vote-for-leader"
4) "9096772621089bb885eaf7304a011d9f46c5689f 1"
1) "pmessage"
2) "*"
3) "+sdown" <<< master marked DOWN
4) "master test 172.31.2.48 6379"
1) "pmessage"
2) "*"
3) "+odown"
4) "master test 172.31.2.48 6379 #quorum 3/2"
1) "pmessage"
2) "*"
3) "-role-change" << role change initiated
4) "slave 172.31.28.197:6379 172.31.28.197 6379 @ test 172.31.2.48 6379 new reported role is master"
1) "pmessage"
2) "*"
3) "+config-update-from"
4) "sentinel 9096772621089bb885eaf7304a011d9f46c5689f 172.31.2.48 26379 @ test 172.31.2.48 6379"
1) "pmessage"
2) "*"
3) "+switch-master"
4) "test 172.31.2.48 6379 172.31.28.197 6379"

Later, when the old master is brought online, it is changed to a replica:

3) "-role-change"
4) "slave 172.31.2.48:6379 172.31.2.48 6379 @ test 172.31.28.197 6379 new reported role is master"
1) "pmessage"
2) "*"
3) "-sdown"
4) "slave 172.31.2.48:6379 172.31.2.48 6379 @ test 172.31.28.197 6379"
1) "pmessage"
2) "*"
3) "+role-change"
4) "slave 172.31.2.48:6379 172.31.2.48 6379 @ test 172.31.28.197 6379 new reported role is slave"

All the data written to the old master via the script is lost.

Recommendations

  • You must know the characteristics of your long-running scripts in advance before deploying them in production.
  • If your script regularly breaches the lua-time-limit, you must review the script thoroughly for possible optimizations. You can also break it down into pieces that complete in acceptable durations.
  • If you must run scripts that breach the lua-time-limit, consider scheduling these scripts during periods where other client activity will be low.
  • The value of the lua-time-limit can also be increased. This would be an acceptable solution if other client applications that execute in parallel with the script can tolerate receiving extremely delayed responses rather than a BUSY error and retrying later.

Additional considerations on Sentinel-monitored high availability systems:

  • If the scripts are only doing read operations and you have replicas available, you can move these scripts to the replicas.

Change the Sentinel parameter down-after-milliseconds to a value that will ensure that failovers aren’t initiated. You must do this only after careful consideration because increasing the value drastically will compromise the high availability characteristics of your system. This could also cause genuine server failures to be ignored.

More tips for you

Get To Know the Redis Database: Iterating Over Keys

The ability to iterate cheaply over the Redis key space is very important to familiarizing yourself with the database contents. Learn the various key space iteration options available in Redis. Learn more


Top Redis Use Cases by Core Data Structure Types

Redis can act like a database, a cache, or a message broker and does not store data in well-defined database schemas which constitute tables, rows, and columns. Instead, Redis stores data in data structures which makes it very flexible to use. Learn more


6 Crucial Redis Monitoring Metrics You Need To Watch

How do you ensure your Redis deployment is healthy and meeting your requirements? You need to know which monitoring metrics to watch and a tool to monitor these critical server metrics to ensure its health. Learn more

For more information, please visit www.scalegrid.io. Connect with ScaleGrid on LinkedIn, X, Facebook, and YouTube.
Table of Contents

Stay Ahead with ScaleGrid Insights

Dive into the world of database management with our monthly newsletter. Get expert tips, in-depth articles, and the latest news, directly to your inbox.

Related Posts

Redis vs Memcached in 2024

Choosing between Redis and Memcached hinges on specific application requirements. In this comparison of Redis vs Memcached, we strip away...

multi cloud plan - scalegrid

Plan Your Multi Cloud Strategy

Thinking about going multi-cloud? A well-planned multi cloud strategy can seriously upgrade your business’s tech game, making you more agile....

hybrid cloud strategy - scalegrid

Mastering Hybrid Cloud Strategy

Mastering Hybrid Cloud Strategy Are you looking to leverage the best private and public cloud worlds to propel your business...

NEWS

Add Headline Here