Stuff The Internet Says On Scalability For February 14th, 2020

Wake up! It's HighScalability time:

Visualize the huge scale of Deep Time by identifying key reference points along the way.

Do you like this sort of Stuff? Without your support on Patreon Stuff won't happen. I also wrote Explain the Cloud Like I'm 10 for everyone needing to understand the cloud (who doesn't?). On Amazon it has 93 mostly 5 star reviews (152 on Goodreads). Please be a real cloud hero and recommend it.

Number Stuff:

  • #1: California has the most tech jobs and shows no signs of losing its dominance. Amazon is the #1 US tech employer.
  • 54 million: peak transactions per second on a DynamoDB table during prime day.
  • 200ms: time saved by enabling HTTP/2 server push. Send Link headers and your application will load 25-50% faster.
  • 30 miles: quantum entanglement over fiber.
  • 2 billion: WhatsApp users, up from 1.5 billion 2 years ago.
  • 1400%: increase in job postings for AR and VR. Blockchain only increased 9%.
  • 29: average lines of code produced per day by @antirez over 10 years working on redis.
  • 70%: US smart speaker market owned by Amazon Echo.
  • 85%: reduction in storage space by streaming apps to your phone.
  • 30%: of users pay for a gamin subscription service.
  • 34 hours: time it takes to communicate to Voyager1 from Earth.
  • 135,000: pneumatic tube messages Sears processed per day, back in the day.
  • 450: kilometers of pneumatic tubes running under the streets of Paris in 1945. Canisters traveled at about 22 MPH. You could even set an address and the tube would route itself.
  • 4.3 billion: records stolen in data breaches over the last 15 years.
  • $1.5 billion: Alibaba cloud revenue for the quarter, 62% growth rate, solidly in forth place.

Quotable Stuff:

  • @jimmy_wales: We [Wikipedia] already store data. In a database [as apposed to blockchain]. It works well.
  • @QuinnyPig: Myth: Your AWS bill is a function of how many customers you have. Fact: Your AWS bill is a function of how many engineers you have.
  • @ali01: It is mindblowing that roughly the same protocol design for the Internet Protocol (IP) has taken the internet from nothing to ~15B connected devices and has survived for 40 years through the period of fastest technological change in history. What lessons can we draw? A thread. The internet has succeeded to the degree that it has for many reasons, but the most important one is the economic flywheel that it created: more providers of bandwidth, leads to more developers building applications, leads to more users who (via those apps) demand more bandwidth. The core enabler of this flywheel—sustained through a period during which our basic technology for moving bits from A to B has improved by 1,000,000x—has been the radical minimalism and generality of IP. Let’s walk through how it worked.
  • cldellow: If your data is in S3, my experience is that you can push ~20-40MB/core/sec on most instances. OP is probably talking about an x1e.32xlarge. According to Daniel Vassalo's S3 benchmark [1], it can do about 2.7GB/sec. So your 4TB DB might take ~30min to fetch. Bandwidth is free, you'd pay $2 for the 30 min of compute, and some fractions of pennies for the few hundred S3 requests.
  • thedance: A small computer with 1 SSD will take at least 10-20 minutes to make a pass over 1TB of data, if everything is perfectly pipelined.
  • auiya: Was at Google in 2007 when the launch(es) were happening. The commenter at the bottom of the article is correct, the existential threat Google faced wasn't from Apple, it was from US mobile carriers. Google wanted leverage over the carriers to bring Google products to feature phones more easily. They had no intention of actually releasing Android at the time, and it wasn't close to being ready for release anyway. Apple then released iPhone and beat Google to the punch, and I remember us all thinking "oh shit this is WAY better than the janky, blackberry-esque prototypes we have in development"[1]. So everyone at Google scrambled to get something out the door in response, and the compromise was the HTC G1/Dream on the cheapest carrier at the time (T-mobile). I was using a Nokia N95 at the time, and while the Dream was laughably bad, I did appreciate the bits of the Dream which compared favorably, like mobile web browsing and application development. I still feel Android is mostly a reactive product with many clumsy holdovers from v1, but I digress.
  • Simon Peyton Jones: I think that by and large 99.9% of the population has no visceral, sort of, gut feel for just how complicated, remarkable and fragile our software infrastructure is. The search box looks simple but there’s the millions of lines of code… If you could see that in a way that you can see an aircraft carrier or some complicated machine that you can see inside, you’d have a more visceral sense for how amazing it is that it works at all, still less that it works so well. But you don’t get that sense from a computer program because it’s so tiny, right? All of my intellectual output for my entire life, including GHC, would easily fit on a USB stick.
  • siffland: Funny story, about 6 years ago we got a HP DL980 server with 1TB of memory to move from an Itanium HP-UX server. The test database was Oracle and about 600GB in size. We loaded the data and they had some query test they would run and the first time took about 45 minutes (which was several hours faster than the HPUX), They made changes and all the rest of the runs took about 5 minutes for their test to complete. Finally someone asked me and i manually dropped the buffers and cache and back to about 45 minutes.
  • @QuinnyPig: For highly tuned, highly specific workload constraints like this I'd question whether running MySQL on top of EC2 instances makes more sense. RDS hides the knobs and dials you usually don't need, but in this case it sounds like you need it.
  • @Nick_Craver~ "What's the schema of the JSON payload like?" "Well, it depends..." F*ck. PLEASE STOP DOING THAT. Don't make it an object if there's one and an array if there's more, etc. Please be consistent. It screws a lot of platforms and efficiency when there's a dynamic schema.
  • Foremski: what does this say about the effectiveness of Google's ads? They aren't very good and their value is declining at an astounding and unstoppable pace.
  • Geoff Huston: We are by no means near the end to the path in the evolution of subsea cable systems, and ideas on how to improve the cost and performance abound. Optical transmission capability has increased by a factor of around 100 every decade for the past three decades and while it would be foolhardy to predict that this pace of capability refinement will come to an abrupt halt, it also has to be admitted that sustaining this growth will take a considerable degree of technological innovation in the coming years.
  • crimsonalucard: Functional programming is about function composition. If you haven't picked up on this notion, than you haven't fully understood FP.
  • @rishmishra: For microservices, which are better--shallow or deep healthchecks? Shallow: Return 200 OK bc the service itself is up Deep: Return 200 OK after checking if the service can access S3, Postgres, Redis, Elasticsearch, Graphite, Sentry, and other dependencies.
  • mantap: The incredible thing about this is that when NASA gave the contract to both Boeing and SpaceX, SpaceX was supposed to be the risky new startup and Boeing was supposed to be the safe "nobody ever got fired for choosing IBM" backup option. How times have changed.
  • Bill Fefferman: In conclusion, we are currently at a critical juncture in the field of quantum computing in which quantum computational theory and experiment are being synergized in a revolutionary new way.   The Google/UCSB result is certainly impressive and is evidence of significant technical progress.  Even if it would take only 2.5 days rather than 10,000 years to simulate with a classical supercomputer, it is still the case that a small, 53-qubit machine is competitive with the largest classical supercomputer currently under construction (costing $0.5B).  A vibrant quantum computing industry is emerging, with Google focused on pushing the boundaries of experimental prototypes and IBM developing a significant user base by deploying more manufacturable machines in its quantum cloud service.  While it is unclear exactly what the future holds, it is quite clear we are well on the exciting road to a coming quantum age.
  • echelon: Microservices are for companies with 500+ engineers. They help teams manage complex interdependencies by creating strong ownership boundaries, strong product definition, loose coupling, and allow teams to work at their own velocities and deployment cadences. If you're doing microservices with 20 people, you're doing it wrong.
  • jnwatson: The company I work for is 100% remote. 100% remote is far easier than 10% remote, because all of your processes must evolve around that. My general observation is that the worse your processes are before your transition, the worse the conversion will go. "By the skin of your teeth" and informal processes are less efficient. Communication must be documented and asynchronous.
  • Enrico Signoretti~ the hard disk “will disappear from the small datacenter. … No more hard drives in the houses, no more hard drives in small business organisations, no more hard drives in medium enterprises as well. These kinds of organisations will rely on flash and the cloud, or only the cloud probably...upcoming 18 and 20TB capacity disk drives, going on up to 50TB, will increasingly be designed and built for hyperscalers such as Facebook and eBay, and public cloud service providers like AWS and Azure.
  • Memory Guy: UPMEM announced what the company claims are: “The first silicon-based PIM benchmarks.”  These benchmarks indicate that a Xeon server that has been equipped with UPMEM’s PIM DIMM can perform eleven times as many five-word string searches through 128GB of DRAM in a given amount of time as the Xeon processor can perform on its own.  The company tells us that this provides significant energy savings: the server consumes only one sixth the energy of a standard system.  By using algorithms that have been optimized for parallel processing UPMEM claims to be able to process these searches up to 35 times as quickly as a conventional system. Furthermore, the same system with an UPMEM PIM is said to sequence a genome ten times as fast as the system that uses standard DRAM, once again using one sixth as much energy.
  • Aaron Marshall: In both businesses, the days of “growth at all costs” seem to be over. (In fact, that’s what Khosrowshahi told investors.) It’s all about strategy, discipline, and making money now. In fact, Uber and Lyft seemed to have shifted perspectives in Silicon Valley, on Wall Street, and all over the globe. Investors seem more motivated to find companies with paths to profitability than they were, say, last spring—right before both ride-hail companies debuted on the public markets.
  • Justin Ryan: We recently moved from a REST infrastructure to gRPC at Netflix. That's the new, cool thing to do, and it's been a huge win. There's just productivity gains just around how you would define that schema. The protobuff does that for us. It's been really great, and then the generating the clients. We had a lot of people who were in the habit of making heavyweight clients because it's, "It's REST. There's no way to do it, so I'm going to make my own," and they got really big. The gRPC really let us say, "No, that's it. There's one generated client, that's all you use. It's nice and thin."
  • Ryan Zezeski: But as any software veteran knows, projects often don’t survive the whims of management. No one is fired for picking Linux (these days), but they might be for picking something else. I already experienced this once before, as a core developer of the Riak database. We were rigorous, paying homage to the theoretics of distributed systems, but with a focus on bringing that theory to the masses. So much so that our last CEO said we had to stop doing so much “computer science”. He meant it as an insult, but we wore it as a badge of honor. But hey, MongoDB had a sweet API and BJSON, who cares if it lost your data occasionally [1]. I understand that people like to stick with what is popular. I respect that decision — it is theirs to make. But I’ll never be a part of that crowd. I want to use software that speaks to me, software that solves the problems I have, software guided by similar values to my own. For me, no project does this more than SmartOS and the illumos kernel. It is my Shawshank Redemption in a sea of MCU.
  • Todd Montgomery: This is how it starts. You start thinking about early, upfront, "What are the limitations that I'm going to be working under?" I do this with protocols, I look at the protocol itself because, it doesn't matter what happens in the implementation, if the protocol says that, "This happens, then this has to happen," the natural sequential part. You have to realize that that may be a bottleneck, sure, but you're not going to get around it unless you break it apart somehow. Then, start looking at what things can you break apart and which things can you start to do. Then, you start looking at the things like your caches, you start looking at branch prediction, you start looking at virtual memory, you start looking at all the other things after that.

Useful Stuff:

  • The first installment of our bigger than usual big brain segment for today. I think you'll like how he thinks. And as usual I can only capture a fraction of what's being said, so you'll want to listen to the whole interview, but here's the first part of my gloss on Jim Keller: Moore's Law, Microprocessors, Abstractions, and First Principles~
    • The way an old computer ran is you fetched instructions and you executed them in order. Do the load. Do the add. Do the compare. The way modern computers work is you fetch large numbers of instructions, say a window of 500 instructions, then you find the dependency graph between the instructions, then you execute in independent units those little micrographs. They run deeply out of order. The computer has a bunch of book keeping tables that says what order these operations should appear to finish in. But to go fast you must fetch a lot of instructions and find all the parallelism. 
    • People say computers should be simple and clean, but it turns out the market for simple and clean slow computers is zero.
    • Today modern microprocessors find 10x parallelism. Branch prediction happens at high 90 percent accuracy. 20 years ago you recorded which way a branch went last time and did the same thing next time. That worked 85% of the time. Today something that looks like a neural network is used. You take all the program flows and do deep pattern recognition on how the program is executing. You do that multiple ways and you have something that chooses which way. There's a little supercomputer in your computer that's calculates which way branches go. To get to 85% accurate branch prediction it took 1000 bits. To get to 99% takes 10s of megabits (3x-4x magnitude) to go from a window of 50 instructions to 500 instructions.
    • What people don't understand is that computers produce a deterministic answer even though the execution flow is very undeterministic. If you run a program 100 times it never runs the same way twice, but it gets the same answer every tiime.
    • The best computer architects aren't that interested in people and the best people managers aren't that interested in designing computers. 
    • Most people don't think simple enough. Imagine you want to make a loaf of bread. A recipe tells you exactly what to do. To understand bread you need to understand biology, supply chains, grinders, yeast, physics, thermodynamics. There's so many levels of understanding there. When people build and design things they are frequently executing some stack of recipes. The problem with that is recipes often have a limited scope. If you have a deep understanding of cooking you can cook bread, omelets, etc, there's a different way of viewing everything. When you get to be an expert at something you're hoping to achieve deeper understanding, not just a large set of recipes to go execute. Executing recipes is unbelievably efficient, if that's what you want to do. If it's not what you want to do then you're stuck. That difference is crucial. 
    • If you constantly unpack everything for deeper understanding then you'll never get anything done. If you don't unpack for understanding when you need to you'll do the wrong thing. 
    • If you want to make a lot of progress in computer architecture you need to start from scratch every 5 years.
    • A project first goes up and then shows diminishing returns over time. To get to the next level you need to start a new project. The initial starting point of that new project will be lower than the return of the old project, but it will end higher. You have two kinds of fear: short term disaster and long term disaster. People with a quarter by quarter business objective are terrified of changing anything. People who are building for a long term objective know that the short term limitations block them from long term success. You can do multiple projects at the same time, but you can't make everyone happy.
    • People think Moore's Law is one thing, transistors get smaller. But under the sheets there's literally thousands of innovations that each have their own diminishing return curve. The result has been an exponential improvement. We keep inventing new innovations. If you're an expert on one of those diminishing return curves and you can see its plateau you'll probably tell people this is done. Meanwhile some other group of people are doing something different. That's just normal.
    • A modern transistor is 1000 x 1000 x 1000 atoms. You get quantum effects down at 2-10 atoms. You can imagine a transistor down to 10 x 10 x 10. That's a million times smaller. There are techniques now to put down atoms at a single atomic layer. You can place atoms if you want to. It's just that from a manufacturing process perspective if placing an atom takes 10 minutes and you need to put 10^23 atoms together to make a computer it would take a long time. The innovation stack is very broad.
    • I'm expecting more transistors every 2-3 years by a number large enough that how you think of computer architecture has to change.
    • Done for now. Tune in next week for our next exciting episode...

  • A very thoughtful version of it dependsSTOP!! You don’t need Microservices.
    • Am I here to tell you “Not to use microservices”? Absolutely not!! As a technology advocate and enthusiast, you’re entitled to have your favorites. What makes you prominent, however, is the ability to choose pragmatically, when the options are between “the right choice” and “your favorite choice.”
    • Is your application large enough to be broken down to Microservices?
    • Do you really need to scale Individual components of the Application?
    • Do you have transactions spanned across services?
    • Is there a need of frequent communication between the services?
    • The added complexity. Though the microservices are originally designed to reduce the complexity by breaking down the application into smaller pieces, the architecture in itself, is complex to deploy and maintain.
    • The cost of distribution. 
    • Adaptation of Devops. You cannot maintain and monitor microservices without the presence of a dedicated devops team.
    • Tight integrations. Some applications are tightly coupled by nature. 
    • Lack of experience.
    • End to End testing. 
    • Chaotic data contracts. Carving out data contracts for special needs will cost you time and space.
    • Legacy codebase. Are you sure the RabbitMQ framework you just developed, works well with your legacy application hosted in an IBM AIX server?
    • Debugging distresses. More services = more log files.

  • John Locke needs an update. Tesla Remotely Removed Autopilot Features from Used Model S After ‘Audit’. We need a new theory of property for the modern digital age. Our ownership of almost everything digital is now mediated by a controlling entity. That hasn't been true in the past. This is something people need to consider when thinking through the tangle of issues associated with Huawei. Every proxy, every mediation layer is an attack vector. All Huawei users are subject to update risk. This is not just Huawei of course, it applies to any remotely updatable device. It doesn't matter how thoroughly you vet a codebase now, that code is just one update away from being weaponized through an update. And when people assure you've checked out a system and decided it's secure, they're probably fooling themselves (Researcher Discloses Critical Flaws Affecting Millions of HiSilicon Chips).

  • The second installment of our bigger than usual big brain segment for today. Jeremy Daly with a great two episode—Part 1, Part 2— interview with the NoSQL whisperer himself: Rick Houlihan. I suggest when Rick talks NoSQL you set your podcast player to .5, that way it will appear as real time to us mere mortals. While we hear the generic NoSQL pitch quite often, I just love the bits where Rick talks about lessons he has learned working with real AWS customers on real problems. You can't get that type of wisdom anywhere else. And since Jeremy really knows his stuff, that only heightens the excellence.
    • Everything flows from this: So if you think about the relational database today, it's about normalized data. We're all very familiar with the idea of a normalized data model where you have multiple tables, we have all these relationships, parent child relationships and many to many relationships. And so we built these tables that contain this data and then we have this ad hoc query engine that we write called SQL. We write queries in SQL to return the data that our application needs. So the server, the database actually restructures the data and reformats the data on the fly whenever we need it to satisfy a request. Well, NoSQL on the other hand eliminates that CPU overhead and that's really what the cost of the relational database is and the reason why it can't scale because it takes so much CPU to reformat that data. So with NoSQL, what we're going to do is we're going to actually denormalize the data somewhat and we're going to tune it to what we call the access pattern, tune it to the access pattern to create an environment that allows the server to satisfy the request with simple queries. So we don't actually have to join the data together. So we talk about the modeling and whatnot in my sessions, we can get into how do we do that, but the fundamental crux of the issue here is that the relational database burns a lot of CPU to join the data and produce these materialized views whereas the NoSQL database kind of stores the data that way and makes it easier for the application to use it.
    • 90% of the applications we build have a very limited number of access patterns. They execute those queries regularly and repeatedly throughout day. So that's the area that we're going to focus on when we talk about NoSQL.
    • The biggest mistake. Hands down. We use multiple tables, right? I mean and the bottom line is multi-table designs are never going to be efficient and NoSQL no matter what the scale. I mean, you can have the smallest application that you're working with, you can have the largest application you're working with, it's just going to get worse, right?
    • There is no joint operator, but I can still achieve the same result because a join is essentially a grouping of objects and so that's what we're really doing with the NoSQL table. So it's much, much more time efficient to do an index scan than it is to do nested loop joins and that's really what it comes down to is about cost and efficiency.
    • As a matter of fact, one of my favorite stories is from my days at MongoDB, I was working with a university customer, I think 80% of the data they had in their system was attribute names and 64 terabytes data. If you use kind of meaningful abbreviations, right? Like Customer ID, CID, your developers are going to understand what CID means, these are going to be come kind of just parts of your vocabulary. Please do that. You'll save yourself a fortune in the long run, especially at scale
    • I mean you want to increase the throughput of any NoSQL database, you'd talk about parallel access, right? So in DynamoDB, what we're going to try and do is if your access pattern exceeds, 1,000 WCUs or 3,000 RCUs for a single logical key and now bear in mind that I had ... It sounds like that's not a lot, but I have architected, I don't even know how many thousands of applications at this point on DynamoDB and right, sharding comes into play like, I don't know, less than 1% of the time.
    • So that's basically what I basically recommended to him was, "Hey, create the first version of the quote and then store deltas. Every time someone changes something, just store what changed." Now when you go use the same query where customer ID equals X starts with quote ID, but what you're getting is the top level quote and all the deltas and then the client side, you can just quickly apply the deltas and show them the current version and then whenever they need to see the previous versions, you just back the deltas off as they back through the various versions of the quote. So this caused a significant decrease in their WCU provisioning after they went from a thousand WCUs provision to 50. That's a 95% reduction. So that was a really good example of how understanding that ... Don't store data you don't need to store. Denormalization does not always mean copying data, right?
    • One of the biggest problems we see in NoSQL and it's facilitated by the databases that support these really, really large objects, right? Things like I think MongoDB supports a 16 megabyte document, right? And the reality is that I don't know, very many access patterns, and again, I've worked with thousands of applications at this point that need to get 16 megabytes of data and it gives you in single request.
    • And I was like, "Don't do that." Just store the items that people booked and then on the client side, when they just say, "Here's the day that I want to book an appointment for, send them down the things in the book then let them figure out what slots are available." I am a big fan of pushing whatever logic I can down to the end point, right? Give them a chunk of data and let them triage this, give them enough data to do the two or three things that I know they're about to do as soon as they make that request, right?
    • Preload some of that data so that they know, I know 99% of the users that come in here when they ask for this, the next thing they hit is that, okay, great or the next thing they hear is one of these three things. Great, guess what? They're going to get all three of those things and it saves round trips to the server and what are we talking about? Most of the time we're talking about pushing down a couple of kilobytes of data, right? 
    • So and with databases, if you think about this, what ... It's not the database that locks you in, it's the data, right? When I deploy 10 terabytes of data someplace, I'm locked in. It doesn't matter if I'm on MongoDB or Cassandra or whatever. If I want to go somewhere else, I've got to move the data. That takes a long time, right?
    • Anyways, when you boil it down to that lowest common denominator, it doesn't really matter if I'm using Cassandra, MongoDB, DynamoDB or Cosmos DB, who cares? It's all ... The data model is the data model and it's all select star where X equals and they all do that just as well as each other.
    • As far as the time series data goes, that's a really good use case for Dynamo and we'll see a lot of people roll that time series data and use that Streams, Lambda processing to update those partitioned analytics, right? Top end, last end, average and all that stuff and then they'll age out the actual item data, right? And do exactly what you said. So they'll TTL that data off the table, it will roll up into S3 go into parquet files and sit in S3 and then when they need to query it, they just select the top level roll-ups out of DynamoDB. If they need to do some kind of ad hoc query, then they do exactly what you said, run the Athena queries and whatnot, but for the summary aggregations, they're still serving that up out of DynamoDB. They're just doing it and it works really well like you said because once those time-bound partitions are loaded and they're loaded, they're in chain.
    • So across customers, I'm seeing a larger number of apps these days. I'm seeing people starting to realize that, "Hey, you know what? We can manage this relational data. It's just it's not non-relational. It's de-normalized." Right? I think people are starting to understand that that non-relational term is a misnomer and that we're really just looking at a different way of modeling the data. So you're starting to see more complex relational data in these NoSQL databases which I'm really glad to see and I expect that trend will continue.
    • I think a thing to remember about that too is you're not paying for the number of items you read, you're paying for the amount of data that you read.
    • It's a choice. It's optimized for the read or for the write, depends on the velocity of the workload. Depends on the nature of the access pattern. There could be times I want to do one versus the other.
    • If you have an OLTP application, you'd be crazy to deploy on anything else because you're just going to pay a fraction of the cost. I mean, literally, whatever that EC2 instance cost you, I will charge you 10% to run the same workload on DynamoDB.
    • Yeah, processing data and memory doesn't cost much money. I mean, the best example is in what the community tells us. I got a tweet from a customer the other day said, he told me just deprecated as MongoDB cluster had three small instances. They were costing about $500 a month. It took him 24 hours to write the code, to migrate the data and migrate the data. He switched it all over to DynamoDB, he's paying $50 a month.
    •  I mean, it's just, it's amazing when you look at it. I mean, when you think about it, it's like, "That actually makes sense because the average data center utilization of a an enterprise application today is about 12%." Right? That means 88% of your money is getting burned into the vapor.

  • Why Kafka Is so Fast? This is an awesome article you can really learn from. Lots of meaty details to steal/borrow.
    • Apache Kafka is optimized for throughput at the expense of latency and jitter, while preserving other desirable qualities, such as durability, strict record order, and at-least-once delivery semantics. 
    • Kafka [can] safely accumulate and distribute a very high number of records in a short amount of time.
    • Log-structured persistence. Kafka utilizes a segmented, append-only log, largely limiting itself to sequential I/O for both reads and writes, which is fast across a wide variety of storage media.  There is a wide misconception that disks are slow; however, the performance of storage media (particularly rotating media) is greatly dependent on access patterns. The performance of random I/O on a typical 7,200 RPM SATA disk is between three and four orders of magnitude slower when compared to sequential I/O.
    • Record batching.  Kafka clients and brokers will accumulate multiple records in a batch — for both reading and writing — before sending them over the network. Batching of records amortizes the overhead of the network round-trip, using larger packets and improving bandwidth efficiency.
    • Batch compression. Especially when using text-based formats such as JSON, the effects of compression can be quite pronounced, with compression ratios typically ranging from 5x to 7x. Furthermore, record batching is largely done as a client-side operation, which transfers the load onto the client and has a positive effect not only on the network bandwidth but also on the brokers’ disk I/O utilization.
    • Cheap consumers. Kafka doesn’t remove messages after they are consumed — instead, it independently tracks offsets at each consumer group level. The progression of offsets themselves is published on an internal Kafka topic __consumer_offsets. Again, being an append-only operation, this is fast. The contents of this topic are further reduced in the background (using Kafka’s compaction feature) to only retain the last known offsets for any given consumer group. Consumers in Kafka are ‘cheap’, insofar as they don’t mutate the log files
    • Unflushed buffered writes.  Kafka doesn’t actually call fsync when writing to the disk before acknowledging the write; the only requirement for an ACK is that the record has been written to the I/O buffer. This is a little known fact, but a crucial one: in fact, this is what actually makes Kafka perform as if it were an in-memory queue — because for all intents and purposes Kafka is a disk-backed in-memory queue (limited by the size of the buffer/pagecache).
    • Client-side optimisations. Kafka takes a different approach to client design. A significant amount of work is performed on the client before records get to the server. This includes the staging of records in an accumulator, hashing the record keys to arrive at the correct partition index, checksumming the records and the compression of the record batch. The client is aware of the cluster metadata and periodically refreshes this metadata to keep abreast of any changes to the broker topology. This lets the client make low-level forwarding decisions; rather than sending a record blindly to the cluster and relying on the latter to forward it to the appropriate broker node, a producer client will forward writes directly to partition masters. Similarly, consumer clients are able to make intelligent decisions when sourcing records, potentially using replicas that geographically closer to the client when issuing read queries.
    • Zero-copy. Kafka uses a binary message format that is shared by the producer, the broker, and the consumer parties so that data chunks can flow end-to-end without modification, even if it’s compressed. Kafka solves this problem on Linux and UNIX systems by using Java’s NIO framework, specifically, the transferTo() method of a java.nio.channels.FileChannel. This method permits the transfer of bytes from a source channel to a sink channel without involving the application as a transfer intermediary. 
    • Avoiding the GC. The heavy use of channels, native buffers, and the page cache has one additional benefit — reducing the load on the garbage collector (GC). The real gains are in the reduction of jitter; by avoiding the GC, the brokers are less likely to experience a pause that may impact the client, extending the end-to-end propagation delay of records.
    • Stream parallelism. Concurrency is ingrained into its partitioning scheme and the operation of consumer groups, which is effectively a load-balancing mechanism within Kafka — distributing partition assignments approximately evenly among the individual consumer instances within the group. Compare this to a more traditional MQ: in an equivalent RabbitMQ setup, multiple concurrent consumers may read from a queue in a round-robin fashion, but in doing so they forfeit the notion of message ordering.

  • The internet is no longer the essential enabler of the tech economy. That title now belongs to the cloud (We Need to Talk About ‘Cloud Neutrality’). This is not true. It confuses the necessary with the sufficient. Remove every cloud tomorrow and companies could be limp back to functionality within a year. Remove the internet and no cloud matters at all—ever. The cloud is a useful recreatable tool. The internet is an irreplaceable platform that is so path dependent it may never be recreatable again.

  • RAM may not be as fast as you'd hoped. Memory Bandwidth Napkin Math
    • New 2020 Numbers Every Programmer Should Know
      • L1 Latency:    1 ns
      • L2 Latency:    2.5 ns
      • L3 Latency:    10 ns
      • RAM Latency:   50 ns
      • (per core)
      • L1 Bandwidth:  210 GB/s 
      • L2 Bandwidth:  80 GB/s
      • L3 Bandwidth:  60 GB/s
      • (whole system)
      • RAM Bandwidth: 45 GB/s 
    • RAM Performance
      • Upper Limit: 45 GB/s
      • Napkin Estimate: 5 GB/s
      • Lower Limit: 1 GB/s
    • Cache Performance — L1/L2/L3 (per core)
      • Upper Limit (w/ simd): 210 GB/s / 80 GB/s / 60 GB/s
      • Napkin Estimate: 25 GB/s / 15 GB/s / 9 GB/s
      • Lower Limit: 13 GB/s / 8 GB/s / 3.5 GB/s
    • Random access from RAM is slow. Catastrophically slow. Less than 1 GB/s slow for both int32. Random access from the cache is remarkably quick. It's comparable to sequential RAM performance. Let this sink in. Random access into the cache has comparable performance to sequential access from RAM. The drop off from sub-L1 16 KB to L2-sized 256 KB is 2x or less. I think this has profound implications.
    • Pointer chasing is 10 to 20 times slower. Friends don't let friends used linked lists. Please, think of the children cache.
    • Thinking about bytes-per-second or bytes-per-frame is another lens to look through.
    • staticfloat: The reason I find random access into cache having the same performance as sequential access as not that profound is because it falls out directly from the above scenario: sequential access into RAM _is_ random access of cache! The reason sequential access to RAM is fast is because the values are in cache due to having fetched an entire cache line; therefore randomly accessing those same cache buckets in a random order is equivalent (from this perspective).

  • Use a messaging system to tie all your components together? You'll want to read Forging SWIFT MT Payment Messages for fun and pr... research! It shows the kind of thinking you'll need to secure your system. The main recommendation is defense in depth: 
    • In the context of Message Queues, there is one particular control which I think is extremely valuable: The implementation of channel specific message signing! This, as demonstrated by SWIFT's LAU control, is a good way in which to ensure the authenticity of a message.
    • As discussed, LAU is - as far as I know at the time of writing - a SWIFT product / message partner specific control. However it's concept is universal and could be implemented in many forms, two of which are:
    • Update your in-house application's to support message signing, natively;
    • Develop a middleware component which performs message signing on each system, locally.

  • Running your new fancy microprocessor on Windows? It may not like that. What was once considered enterprise grade will soon be just another Tuesday. The 64 Core Threadripper 3990X CPU Review: In The Midst Of Chaos, AMD Seeks Opportunity: From our multithreaded test data, there can only be two conclusions. One is to disable SMT, as it seems to get performance uplifts in most benchmarks, given that most benchmarks don’t understand what processor groups are. However, if you absolutely have to have SMT enabled, then don’t use normal Windows 10 Pro: use Pro for Workstations (or Enterprise) instead. At the end of the day, this is the catch in using hardware that's skirting the line of being enterprise-grade: it also skirts the line with triggering enterprise software licensing. Thankfully, workstation software that is outright licensed per core is still almost non-existent, unlike the server realm.

  • Lessons learnt using Serverless Lambda Functions:  
    • For me, the real question is; Am I running a microfunction or a microservice?
    • This is where Serverless is still immature. Diagnosis is limited. Insights are only superficial.
    • That’s the key. While it is easy to quickly develop applications utilising serverless frameworks, not being to run the same code elsewhere is a problem waiting to happen. Adopting clean architecture principles of abstracting away the core use cases of the application from the surrounding interfaces, is the way to go.
    • But how does one efficiently manage resources such as socket connections? Should we create and close all HTTP and JDBC connections per request? That’s not efficient. The thing is, resource management is hard. The more resources a Lambda needs to fulfil a request the trickier it gets to manage those resources. 
    • With ECS and EKS maturing, it is not a bad idea to start with the ECS EC2 launch type, monitor the resource usage of an application over time, tune it, and then finally deploy it to an ECS Fargate launch type. Why not AWS Lambda? At runtime, both Fargate and Lambda have the same risks and opportunities.
    • Modelling an entire domain with multiple aggregates, entities and use cases and then shoe-horning those into a single lambda function is a very bad idea. One can indeed build a Lambda application which is essentially a collection of Lambda functions, but why would I do that? Why would I want to break up the 6 Java methods supported by my Service class into 6 individual lambda functions? Why wouldn’t I deploy the whole domain as a single microservice using a Docker container as opposed to deploying 6 different microfunctions?

  • What are the limits that warrant even considering microservices? james_s_tayler: We need to support an order of magnitude more daily active users than we currently do. I'm not exactly sure how close the current system would actually get to that, but my gut feel is it wouldn't hold up. It does OK as is, but only OK. It's a combination of three different problems working against us in concert. 1) compute layer is multitenant but the databases are single tenant (so one physical DB server can hold several hundred tenant database with each customer having their own). 2) We're locked into some very old dependencies we cannot upgrade because upgrading one thing Cascades into needing to upgrade everything. This holds us back from leveraging some benefits of more modern tech. 3) certain entities in the system have known limits whereby when a customer exceeds a certain threshold the performance on loading certain screens or reports becomes unacceptable. Most customers don't come near those limits but a few do. The few that do sometimes wind up blowing up a database server from time to time affecting other clients.

  • Migrating to CockroachDB
    • While I'm a huge fan of PostgreSQL, I wanted something that provided better out of the box support for high availability. Overall, I'm happy with how the effort turned out and with CockroachDB in general. Because it uses PostgreSQL's wire protocol, existing PostgreSQL drivers should work as-is.
    • CockroachDB belongs to a class of databases referred to as NewSQL. They're called this because they maintain most of what traditional SQL databases offer while adding scalability as a first class citizen. Cockroach not only provides ACID guarantees and SQL support (including joins), but also consistent data distribution via the raft consensus algorithm.
    • At scale, CockroachDB is supposed to perform well. But going from our single PostgreSQL instance to a 3 node cluster, performance is worse. This is true even for simple queries involving 1 node (say, getting a record by id or running locally with a single node). Still a few extra milliseconds of latency is a fair price for better availability. But the performance characteristics aren't the same as a relational database. You won't have access to the same mature monitoring and diagnostic tools, and you'll have access to fewer levers. You should benchmark your most complicated and largest queries.

  • As a user of Usenet in the 1980s and a person who set up their own UUCP node I can say Usenet was great because it was all we had and it worked perfectly for the networked world at the time. It delighted and depressed just like social networks do today. People were awesome and horrible in equal measure. It was not an age of heroes. Usenet lost because federated systems can't change and improve as fast as modern centralized systems. They lose on the features that attract and keep users. Which is why email has, while not lost, it's not what it should be. Is there much in modern social networks that couldn't have been done in email? Not really. SMS is just the subject line of an email. But email was locked in by it's protocol and architecture, whereas Twitter and Facebook are not. I wish it weren't true, and I'd love someone to figure it out, but we've had a constant stream of TikToks and not one new Usenet. Part II: Usenet — Let's Return to Public Spaces.

  • A dark web tycoon pleads guilty. But how was he caught?
    • Early on August 2 or 3, 2013, some of the users noticed “unknown Javascript” hidden in websites running on Freedom Hosting. Hours later, as panicked chatter about the new code began to spread, the sites all went down simultaneously. The code had attacked a Firefox vulnerability that could target and unmask Tor users—even those using it for legal purposes such as visiting Tor Mail—if they failed to update their software fast enough. 
    • While in control of Freedom Hosting, the agency then used malware that probably touched thousands of computers. The ACLU criticized the FBI for indiscriminately using the code like a “grenade.”
    • The FBI had found a way to break Tor’s anonymity protections, but the technical details of how it happened remain a mystery. 

  • Information security has to be proactive, not reactive. The war against space hackers: how the JPL works to secure its missions from nation-state adversaries
    • Viswanathan has focused largely on two key projects: the creation of a model of JPL’s ground data systems — all its heterogeneous networks, hosts, processes, applications, file servers, firewalls, etc. — and a reasoning engine on top of it. This is then queried programmatically. (Interesting technical side note: the query language is Datalog, a non-Turing-complete offshoot of venerable Prolog which has had a resurgence of late.)
    • With the model, ad hoc queries such as “could someone in the JPL cafeteria access mission-critical servers?” can be asked, and the reasoning engine will search out pathways, and itemize their services and configurations. Similarly, researchers can work backwards from attackers’ goals to construct “attack trees,” paths which attackers could use to conceivably reach their goal, and map those against the model, to identify mitigations to apply.
    • His other major project is to increase the JPL’s “cyber situational awareness” — in other words, instrumenting their systems to collect and analyze data, in real time, to detect attacks and other anomalous behavior. For instance, a spike in CPU usage might indicate a compromised server being used for cryptocurrency mining.
    • This is a departure from reactive security measures taken in the past (noticing a problem and then making a call). Nowadays, JPL watches for malicious and anomalous patterns such as a brute-force attack indicated by many failed logins followed by a successful one to machine-learning based detection of a command system operating outside its usual baseline parameters.

  • How badly must you hate something to call it out at the moment of your greatest triumph? Taika Waititi slams Apple’s MacBook keyboards after winning first Oscar. But he's not wrong.

  • Love these "how we evolved" over time articles. It shows once again how complex working systems start from simple working systems. Scaling the Edge: How Booking.com Powers a Global Application Delivery Network with HAProxy: During this talk I’m going to share with you how we scaled our load balancing infrastructure from a pair of load balancers to hundreds of load balancers handling billions of requests a day internally and externally.

Soft Stuff:

  • buraksezer/olric: Distributed, eventually consistent, in-memory key/value data store and cache. It can be used both as an embedded Go library and as a language-independent service. With Olric, you can instantly create a fast, scalable, shared pool of RAM across a cluster of computers.
  • Ravenbrook/mps: This is the Memory Pool System Kit -- a complete set of sources for using, modifying, and adapting the MPS.  This document will give you a very brief overview and tell you where to find more information.

Pub Stuff:

  • Kubernetes for Full-Stack Developers: The book is structured around a few central topics: Learning Kubernetes core concepts; Modernizing applications to work with containers; Containerizing applications; Deploying applications to Kubernetes; Managing cluster operations
  • Canopus: A scalable and massively parallel consensus protocol: The goal in Canopus is to achieve high throughput and scalability with respect to the number of participants. It achieves high throughput mainly by batching, and achieves scalability by parallelizing communication along a virtual overlay leaf only tree (LOT). Canopus trades off latency for throughput. It also trades off fault-tolerance for throughput.
  • A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions~ To overcome this obstacle, researchers at Texas A&M University, in collaboration with computer scientists at Intel Labs, have now developed a complete automated way of identifying the source of errors caused by software updates. Their algorithm, based on a specialized form of machine learning called deep learning, is not only turnkey, but also quick, finding performance bugs in a matter of a few hours instead of days...newer desktops and servers have hundreds of performance counters, making it virtually impossible to keep track of all of their statuses manually and then look for aberrant patterns that are indicative of a performance error. That is where Muzahid’s machine learning comes in. By using deep learning, the researchers were able to monitor data coming from a large number of the counters simultaneously by reducing the size of the data, which is similar to compressing a high-resolution image to a fraction of its original size by changing its format. In the lower dimensional data, their algorithm could then look for patterns that deviate from normal.
  • From models of galaxies to atoms, simple AI shortcuts speed up simulations by billions of times: The technique, called Deep Emulator Network Search (DENSE), relies on a general neural architecture search co-developed by Melody Guan, a computer scientist at Stanford University. It randomly inserts layers of computation between the networks’ input and output, and tests and trains the resulting wiring with the limited data. If an added layer enhances performance, it’s more likely to be included in future variations. Repeating the process improves the emulator. Guan says it’s “exciting” to see her work used “toward scientific discovery.” Muhammad Kasim, a physicist at the University of Oxford who led the study, which was posted on the preprint server arXiv in January, says his team built on Guan’s work because it balanced accuracy and efficiency.