BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Magic Pocket: Dropbox’s Exabyte-Scale Blob Storage System

Magic Pocket: Dropbox’s Exabyte-Scale Blob Storage System

Bookmarks
49:07

Summary

Facundo Agriel dives into the architecture of Magic Pocket, some early key design patterns, and the challenges of operating such a system at this scale.

Bio

Facundo Agriel is currently the tech lead for Dropbox's exabyte-scale blob storage system. This team manages everything from customized storage machines with many petabytes of capacity to the client APIs other teams use internally. Prior to working at Dropbox, Facundo worked at Amazon on a variety of scheduling problems for Amazon's last mile delivery team.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Agriel: I'm Facundo Agriel. I'm a software engineer at Dropbox. I'm going to be talking about Magic Pocket, which is an exabyte scale blob storage system.

Magic Pocket is used to store all of Dropbox's customer data. We also have many internal use cases. At its core, Magic Pocket is a very large key-value store where the value can be arbitrarily sized blobs. Our system has over four-nines of availability. We operate across three geographical regions in North America. Our writes are optimized for 4-megabyte blobs. Writes are immutable. We also optimize for cold data. Our system overall manages over tens of millions of requests per second. A lot of the traffic behind the scenes comes from verifiers and background migrations. Currently deployed, we have more than 600,000 storage drives. All this also utilizes thousands of compute machines.

OSD (Object Storage Device)

Let's talk about the most important component of the Magic Pocket, which are the object storage devices, which we call OSDs. This is an actual OSD in one of our data centers. You could see all the drives in here. Typically, per storage machine, we have about 100 disks per OSD. We utilize SMR technology. Each storage device has over 2 petabytes of capacity. Let's talk about why we use SMR technology. SMR stands for Shingled Magnetic Recording. It's different from the conventional Perpendicular Magnetic Recording or PMR drives that allow for random writes across the whole disk. With SMR, the tradeoffs are that you offer increased density by doing sequential writes instead of random writes. As you can see here, you're squeezing SMR drives together, which causes the head to erase the next track when you walk over it. In this case, you can still read from it, it's just that you can't just randomly write in any place that you want. This is actually perfect for our cases, based on the workload patterns that I just told you about. This actually ends up working really well for us. SMR drives also have a conventional zone, outside the diameter that allows for caching of random writes if you need to. This conventional zone, it's typically less than about 1% of the total capacity of the drive.

Architecture of Magic Pocket

Let's talk about the architecture of Magic Pocket now. At a very high-level view of our system, we operate out of three zones. We operate out of a West Coast, Central zone, and the East Coast. The first subsection here is a pocket. A pocket is a way to represent a logical version of everything that makes up our system. We can have many different instances of Magic Pocket. Some versions of that could be a test pocket. Something in like a developer desktop, you can run an instance of Magic Pocket. We also have a stage pocket, which is before our production zone. Databases and compute are actually not shared between pockets, so they operate completely independently of each other.

Zone

Now that we have a high-level view of what a pocket is, let's talk about the different components. Let's get into what a zone is. Within a zone, we have the first service here, which is the frontend service, and this is the service that we expose to our clients. All of our clients interact with us through this frontend service. This is what they call it to make any requests. The types of requests that our clients typically make are a PUT request with the key and a blob, a GET request, given some key. They might make a delete call. They might scan for what hashes are available in the system. They might want to update some metadata on a specific key as well. When a GET request comes in, what we first have to consult is the hash index. The hash index is a bunch of sharded MySQL databases, and everything is sharded by this hash. A hash is basically the key for a blob. We simply just take the SHA256 of that blob. Internally, we call those blocks, which are parts of a file, pieces of a file, typically, no more than 4-megabyte chunks. The index, what you'll find is that a hash is mapped to a cell, a bucket, and we also have a checksum as well for that specific hash. A cell is another isolation unit. This is where all of the storage devices actually live. The index table just points to a specific cell and a bucket, so another level on direction where to go to in the cell. The cells can be quite large, so they can be over 100 petabytes in size, of customer data. They do have some limitations of how much they can grow. As the system as a whole, if we are low in capacity, we just simply open up a new cell. That's how our system is able to horizontally scale forever with some limitations, of course.

Let's talk about what else we have in a zone. Another component that we have in the zone is the cross-zone replicator. The idea behind the cross-zone replicator is that within Magic Pocket, we do cross-zone replication. That means we store your data in multiple regions. This is done asynchronously in the background. Once a commit happens and you know where your data is, we automatically queue it up into the cross-zone replicator, so it can be moved over to the other zone somewhere else. We also have a control plane. The control plane basically manages traffic coordination within a zone, for example, if we have migrations in the background, it generates a plan in order to not impede with live traffic. We also use it to manage reinstallations of specific storage machines. For example, we might want to do a kernel upgrade or an operating system upgrade, the control plane will take care of that for us. It also manages cell state information. The cells have certain attributes, as well about maybe type of hardware that's involved in there, and so on. That's what it's doing.

Cell

We talked about a hash is mapped to a cell and a bucket. Let's get into what a cell and a bucket is. Within the cell, we have a bucket, and now we already went into the cell. The first thing we have to do if we want to fetch our blob back is that we have to consult this component called the bucket service. The bucket service knows about a few things. It knows about buckets, and volumes. Anytime you want to, let's say, as we're doing a fetch here, we first find the bucket. That bucket is actually mapped to a volume, and the volume is mapped to a set of OSDs. The volume can be open or closed. There's a generation associated with it. Also, volumes are of different types. Volume could be in some replicated or erasure coded state. When we ask for a bucket, that volume will tell us which specific OSD has our blob. Let's say we go to bucket one, map to volume 10. Then within the volume, we find this OSD here. We will simply just need to ask this OSD about our specific blob, and it'll hand it back to us. That basically completes the way that we retrieve a blob from Magic Pocket. If you want to do a write, it's a little bit different, but essentially the same. It's just that the buckets are pre-created for us. Simply, the frontend service needs to figure out what buckets are open for it to write to. It will write to buckets that are open and ready to be written. In this case here, if it's an open volume, ready to be written into, what we do is we may write to this set of OSDs, so within a volume, it will be mapped to these four. That's where your data will be stored for that specific volume.

Let's talk about a couple other things within the cell. Another component that's very important, is this coordinator. The coordinator is not in the path of a request for a PUT or GET call. It actually works in the background. What it does is it's managing all of the buckets and volumes, as well as all the storage machines themselves. It's constantly health checking storage machines. It's asking them what information do they know, and reconciling information from the storage machines with the bucket service in the bucket database. Other things that it does is that it will do erasure coding, it will do repairs. For example, if this storage machine goes out, you will need to move data to another machine. It's in charge of doing that. It'll optimize data by moving things around within the cell. Or if there's a machine that needs to go offline, it will also take care of that too, by moving data around to some other machine. The way that it moves data around, it doesn't actually copy data itself, it's actually all done by this component here called the volume manager. The volume manager basically is in charge of moving data around, or when we need to move data, when to recreate the new volumes, and so on. I talked a little bit about some of the background traffic that we have. A lot of verification steps also happen within the cell, as well as outside of the cell. We'll talk about that as well.

Buckets, Volumes, Extents

Let's talk a little bit more about buckets and volumes, and what those are in more detail. We have the concept of these three components. We have buckets, volumes, and extents. You can think about a bucket as a logical storage unit. I mentioned if we wanted to do a write, we write into a bucket which is associated with the volume, and then finally, this other concept called an extent. The extent is just about 1 or 2 gigabytes of data found on a disk. If we want to do a write, we simply figure out what open buckets are. Assuming we found this bucket 1, we have to get the set of OSDs associated with those. Then when we do the writes, we simply make a write within the extents themselves. The extent information and volume information in the buckets are all managed by the coordinator I talked about before. If any data is missing or things like that, you can always get it from the other remaining storage machines, and the coordinator will be in charge of finding new placement for that extent that was deleted. Buckets, think about as logical storage, typically 1 to 2 gigs. Volume composed of one or more buckets, depending on the type, a set of OSDs. Type, again, whether it's replicated or erasure coded, and whether it's open or not. Once we close a volume, it's never opened up again.

How to Find a Blob in Object Storage Devices

Let's talk about how we find the actual blob in a storage machine. We have the blob, we want to fetch the blob. We know now the volume. We know that the volume is associated. These OSDs are supposed to have our blob. Simply, what we do is that we store the address of these different OSDs, and we talk directly to those OSDs. The OSDs, when they load up, they actually load all the extent information and create an in-memory index of which hashes they have to the disk offset. For example, if you had some hash of blob foo, it'll be located in disk offset 19292. In this case, this volume is of type replicated. It's actually not erasure coded. We have the full copy available in all the four OSDs mentioned here. This is for fetching the block. If you want to do the PUT, it'll be of the same type, it'll be 4x replicated, and you'll simply do a write to every single OSD itself. We do the requests in parallel. We don't act back until the write has been completed on all of the storage machines.

Erasure Coding

Let's talk about the difference between a replicated volume and an erasure coded volume, and how we handle it. Obviously, 4x replication, and we have two zones that we replicate the volume to. That can be quite expensive, so overall this would be 8x replication. That's very costly. We want to be able to lower our replication factor to make our systems more efficient. This is where erasure codes come in. When a volume is in a replicated state, surely after the volume was almost full, it will be closed, and it will be eligible to be erasure coded. Erasure codes are able to help us reduce the replication costs, with similar durability as straight-up replication. In this case, we have an erasure code, let's say this is Reed Solomon, 6 plus 3. We have 6 OSDs, and we have 3 parities over here. We call this a volume group or grouping. You'll have a single blob in one of the data extents. Then, if you lose any one of these OSDs, you can reconstruct it from the remaining parities and the other data extents.

Let's go into a little bit more detail here on erasure coding. In this case, as I mentioned, you can read from other extents in parity for the reconstruction. As you can imagine, this area becomes really interesting. There's going to be a lot of variations of erasure codes out there with many different tradeoffs around overhead. Let's talk briefly about that. For erasure codes, they can be quite simple. You can use something like XOR, where you can reconstruct from the XOR of other things, or you can use very custom erasure codes. There's a lot of tradeoffs, for example, if you want to do less reads, you might have higher overhead. If you want to tolerate more failures, the overhead, so your replication factor is likely to increase, if you want to tolerate multiple failures within that volume group. This is a very interesting paper by Microsoft called Erasure Coding in Windows Azure Storage by Huang and others. This came out a few years ago, but it's super interesting. It's something very similar that we actually do within Magic Pocket, as well.

I mentioned the example before, the example with Reed Solomon 6, 3, so 6 data extents and 3 parities. With the codes that they came up with called least reconstruction codes, they have a concept of optimizing for read cost. In Reed's element 6, 3, the read costs, if you have any failures is going to be the full set of all the extents here. Your read penalty will be 6 here. They came up with the equivalent codes such that for one type of data failures, you can have the same read costs, but a lower storage overhead. In this example here, their storage overhead, their replication factor is roughly 1.33x, where the same replication factor with Reed Solomon is going to be 1.5. It may not seem like a big savings, but at a very large scale, this ends up being quite a lot that you end up saving. Of course, this optimizes for one failure within the group, which is typically what you will see in production. If you can repair quickly enough, you will hardly see more than one type of failure within a volume group. Those types of failures are a lot rare. They typically don't happen that often. It's ok to make the tradeoff here with that. Yes, just to iterate, 1.33x overhead for the same as a Reed Solomon code. A super interesting paper. You can even continue to lower your replication factor. You can see here it far outpaces Reed Solomon codes for lower overhead, but similar reconstruction read costs. Typically, for LRC 12, 2, 2, you can tolerate any three failures within the group, but you can't actually tolerate any four failures. Only some of them can be actually reconstructed, which is simply not possible.

Can We Do Better?

Can we do better than this? Notice that we have a 2x overhead for cross-zone replication. Even though our internal zone replication factor is quite good, and ideal, we still have this multi-region replication that we do. What else can we do? A while ago, the team made some really good observations about the type of data that we have in our systems. The observation was that retrievals are for 90% of data uploaded in the last year. Even as you go here, through the graph, you can see that 80% of the retrievals happened within the first 100 days. This is quite interesting, which means that we have a bunch of data that's essentially cold and not accessed very much. We want to actually optimize for this workload. We have a workload with low reads. We want similar latency to what we have today for the rest of Magic Pocket. The requests don't have to be in the hot part of requests, meaning we don't have to do live writes into the cold storage. It could happen at some point later. We want to keep the same similar durability and availability guarantees, but again lower that replication factor from 2x further down. Another thing that we can do here is we can make use of more than one of our regions.

Cold Storage System

Now I'm going to talk about our cold storage system, and how that works at a very high level. The inspiration came from Facebook's warm blob storage system. There was a paper that was written a few years ago, and had a very interesting idea there. The idea is as follows. Let's say you have a blob, and you split it in half. The first half is blob1, and the second half is blob2. Then you take the third parts, which is going to be the XOR of blob1 and blob2. We call those fragments. Those fragments will be individually stored in different zones, you have blob1, blob2, and blob1 XOR blob2. If you need to get the full blob, you simply need any one combination of blob1 and blob2, or blob1 XOR blob2, or blob1 and the XOR over here. You need to have any of the two regions to be available to do the fetch. If you want to do a write, you have to have all regions fully available. Again, this is fine, because the migrations are happening in the background, so we're not doing them live.

Cold Storage Wins

Let's talk about some of the wins that we get from our cold storage. We went from 2x replication to 1.5x replication. This is a huge amount of savings. This is nearly 25% savings from a 2x replication that we had. Another win around durability is that the fragments themselves, they're still internally erasure coded. The migration, as I said, is done in the background. When you do a fetch from multiple zones, that actually endures a lot of overhead on the backbone bandwidth. What we do is we hedge requests, such that we send the request to the two closest zone from where the originating service is at. Then if we don't hear a response from the 2 zones, or some period of a few 100 milliseconds, we actually fetch from the remaining zone, and that is able to save quite a lot on the backbone itself.

Release Cycle

Let's move over to some operational discussions on Magic Pocket. The first component that I want to talk about here is how we do releases. Our release cycle is around four weeks, end-to-end, across all our zones. We first start off with a series of unit or integration tests before the changes can be committed. Unit and integration tests typically run a full version of Magic Pocket with all of our dependencies. You can run this in a developer box fully. We also have a durability stage. The durability tester itself runs a series, a longer suite of tests with full verification of all the data. We'll do a bunch of writes, and we'll make sure that the data is all fully covered. It's about one week per stage here. This is to allow the verifications to happen in each individual zone. Typically, we can do all the verifications for metadata checks within one week. The release cycle is basically automated end-to-end. We have checks that we do as we push code changes forward. They'll automatically get aborted or they will not proceed if there's any alerts and things like that. Just for some exceptional cases do we have to take control.

Verifications

Let's get into verifications. We have a lot of verifications that happen within our system. The first one is called the Cross-zone verifier. The idea behind this is that we have clients upstream from us that know about data, so files, and how that maps to specific hashes in our system. The Cross-zone verifier is essentially making sure that these two systems are in sync all the time. Then we have an Index verifier. The Index verifier is scanning through our index table. It's going to ask every single storage machine if they know about this specific blob. We won't actually fetch the blob from disk, we'll just simply ask, do you have it based on what it recently loaded from its extents that it's storing. Then we have the Watcher. The Watcher is a full validation of the actual blobs themselves. We do sampling here. We don't actually do this for all of our data. We validate this after one minute, an hour, a day, and a week. Then we have a component called the Trash inspector. This is making sure that once an extent is deleted, that all the hashes in the extents have actually been deleted. It's a last-minute verification that we do. Then we also scrub or scan through the extents information checking a checksum that's on the extents themselves.

Operations

Let's go to more operations. We deal with lots of migrations. We operate out of multiple data centers, so there's migrations to move out of different data centers all the time. We have a very large fleet of storage machines that we manage. You have to know what is happening all the time. There's lots of automated chaos going on, so we have tons of disaster recovery events that are happening to test the reliability of our system. Upgrading Magic Pocket at this scale, is just as difficult as the system itself, so just not counting Magic Pocket. Some of the operations that we do is around managing background traffic. Background traffic accounts for most of our traffic and disk IOPS. Let's say the disk scrubber is constantly scanning through all this, and checking the checksum on the extents. We do a couple things, so traffic by service can be categorized into different traffic tiers. Live traffic is prioritized by the network. We're ok with dropping background traffic. I talk about the control plane, but it generates plans for a lot of the background traffic based on any forecasts that we have about a data center migration that we need to do. The type of migration that we're doing, maybe it's for cold storage, and so on.

Around failures, couple interesting notes, 4 extents are repaired every second. The extents can be anywhere from 1 to 2 gigs in size. We have a pretty strict SLA on repairs, less than 48 hours. It's fine if we go over this, but typically, because the 48 hours is baked into our durability model, we want to keep this as low as possible. OSDs get allocated into our system automatically based on the size of the cell, the current utilization. If there's any free pool within the data center they're operating in. We have lots of stories around fighting ongoing single points of failures, like SSDs of a certain variety filling all around the same time. We have to manage all those things.

Another note on migrations, we have lots of them. Two years ago, we migrated out of the SJC region. There was a ton of planning that went behind the scenes for something like this to happen. I'll show you a quick graph of a plan for our migration out of SJC. This is the amount of essentially data that we had to migrate out of SJC over the period of time. The light red line is the trend line. The blue line is about what we were expecting. Initially, when we started the migration, it was going really badly, over here. Then over time, we got really good. Then we had this like really long tail end that we didn't really know how to address. For these migrations that are very large, hundreds of petabytes in size, there's a lot of planning that goes on behind the scenes. We have to give ourselves extra time to make sure that we can finish it in time.

Forecasting

Forecasting is another very important part of managing a storage system of this scale. Storage is growing constantly. Sometimes we have unexpected growth we need to account for and absorb into our system. We may have capacity crunch issues due to supply chain going bad, let's say like COVID disruptions. We always need to have a backup plan as soon as we figure out that there's problems up ahead, because it takes so long to actually get new capacity ordered and delivered to a data center. It's not instantaneous. Finally, we do try and our forecasts are actually directly fed into the control plane to perform these migrations based on what our capacity teams tell us.

Conclusion

In conclusion, a couple of notes that have helped us manage Magic Pocket are, protect and verify your system. Always be verifying your system. This has a very large overhead. It's worth having this end-to-end verification. It's incredibly empowering for engineers to know that the system is always verifying itself for any inconsistencies. At this scale, it's actually very much and preferred to move slow. Durability is one of the most important things that we care about within our system, so moving slowly, steadily, ok, waiting for those verifications to happen before you deploy anything. Always thinking about risk and what can happen is very important to another mindset that we have to keep in mind. Keeping things simple, very important. As we have very large-scale migrations, if you have too many optimizations, there's too much that you have to keep in mind for a mental model, that makes things difficult when you have to plan ahead or debug issues. We always try and prepare for the worst. We always have a backup plan, especially for migrations, or if there's any single point of failures within our system, or when we're deploying changes, make sure that things are not a one-way door.

Questions and Answers

Does Dropbox maintain its own data center, or do you partner with hyperscalers like AWS for colocation?

We do a little bit of both actually. In North America regions, we actually lease our data centers. In other regions where it doesn't make sense, we actually utilize, for example, S3 and AWS for compute as well. A bit of a mixer.

Anand: You leverage S3?

Agriel: Yes. For example, in some European regions, some customers there want their data actually to exist locally, in their region, for various compliance reasons. In those cases, we actually utilize S3 because we don't have data center presence there.

Anand: Then, do you use any type of compression of the actual data before storing?

Agriel: We don't compress the data before storing it. Once the data is actually uploaded and in our servers, that's when we do compression and encryption. The data is encrypted and compressed at rest. There's obviously tradeoffs with that. If you were to compress data on the clients, we have many different types of clients from desktops to mobile. It could actually be quite expensive for the user to do that.

Anand: Just curious for SMR disk, how does it cost compared to other options?

Agriel: We work very closely with different hardware vendors to basically adopt always latest hard drive technology. A few years ago, probably maybe five, six years ago now, we started utilizing SMR technology very heavily. I talked about the tradeoffs. Yes, compared to PMR, so traditional drives, SMR is a lot cheaper, given the density. I can't talk about exactly the costs on a per gigabyte basis, but it is significantly cheaper. It works so well for us, and we save a ton of money by adopting SMR technology. We actually published some information about the read speed and IOPS performance on SMR versus PMR. In the end for us, because we have sequential reads and writes, it didn't make too much of a difference. There's some latency differences with SMR, but given our workloads, it's actually pretty comparable to PMR. Because again, we don't do these random writes, and so on, so works very well for us.

Any other disk type other than SMR you may also consider in the future?

Again, so the industry is moving to various technologies in the future. SMR is hitting capacity limits over the next few years. To give you some idea, the latest drives today with SMR are coming out this year, at about 26 terabytes per drive, which is huge. Different vendors obviously looking out three, four, or five years from now, it looks like laser-assisted technology is going to be the next big thing. Look out for HAMR, for example, technology, and so on. Those drives are expected to increase density to 40 terabytes within the next few years. That looks like that's the next big thing that's coming for new drive technology.

Anand: Do you have tiered storage, just less recently accessed storage, you do that behind the scenes?

Agriel: We don't do tiered storage. What we used to have is we used to utilize SSD drives as a cache for writing on a per OSD device basis. The problem that ends up happening is that that became our limiting bottleneck for writes. We actually ended up getting rid of that recently, and we're able to get a lot more write throughput across the fleet, in order to save on that. The other thing that I talked about is colder writes, which we utilize a cold storage tier. It's still on SMR drive, backed by SMR. Then we have explored, for example, using SSDs more heavily, and so on. That's actually something that we might leverage. It's just actually incredibly hard to do caching well, at this scale, for various reasons. Productions, and verifications, inconsistencies, all that stuff is really hard once you go through the motions. That's one of the limiting factors. For a cold storage tier, because we built it on top of existing infrastructure, it was quite straightforward to build on top of that.

Anand: You've seen the system evolve for a few years now. What do you have on the horizon that you'd like to see built, or you'd like to build or design into the system?

Agriel: I think, for us, there's always a few different things. The steady state status quo that we're looking into is continue to scale the system for continuous growth. We're growing at double digits per year. That typically means that in three, four years, we essentially have to double the overall capacity of our system. When that happens, that has a lot of implications for our system. A lot of things that you thought would be able to continually scale, you actually run into various types of limited vertical scaling limits. That's something that we're actively looking into for next year. The other ones are around having more flexibility around our metadata. We use MySQL behind the scenes, sharded MySQL, but that has its own set of problems around being able to manage MySQL at that scale. If you want to easily add new columns, and so on, that's also a huge pain. If you want to continue to scale that up, you also have to completely do a split. That's costly. Most likely your metadata stack will change next year. That's what we're looking at. Then again, the last one is hardware advancements and supporting, for example, HAMR once it comes out, and being able to get ahead of the curve on that is something that we're always continually supporting as well.

Anand: Do you have any war stories that come to your mind, parts of this system that woke you up late at night, burned you, and you guys had to jump on it and come up with maybe a short-term solution, long-term solution?

Agriel: There's a lot of interesting stories that we've had to deal with over the years around memory corruption, and so on, and finding areas within our system that haven't had proper protection. For example, corruptions are happening all the time, but because of these protections, verifications that are going on behind the scenes, we don't really notice them. The system fails gracefully. Once in a while, we'll get woken up in the middle of the night for some new random corruption that we didn't know about. One of them actually happened recently. Data is fine, because it is replicated in many regions. Even if the data is corrupted, the data still continues to live in what we call trash, which is another protection mechanism where the data is deleted, but it's self-deleted, you can still recover it. We had an all-hands-on-deck situation a few weeks ago, where this happened and it took many late nights to figure out where exactly things might be. Obviously, we don't log everything, so being able to figure out where the problem came from, is very difficult. I think we tracked it down to one or two places now.

Anand: Trash was filling up faster than you could get rid of it, because the soft deletes were not real deletes, you were running up storage.

Agriel: We always keep data around in trash for seven days. That's part of the problem. It's the memory corruption. We found just a single hash, single piece of data we found that it was corrupted. I was just mentioning like being able to track that stuff down is very difficult, even though you may have many different places where we do various checks, and so on, along the way.

 

See more presentations with transcripts

 

Recorded at:

Jul 07, 2023

BT