Stuff The Internet Says On Scalability For January 28th, 2022

Never fear, HighScalability is here!


Think your software diagram is complex? This is a single cell modeled using X-ray, nuclear magnetic resonance (NMR), and cryo-electron microscopy datasets. Gael McGill

My Stuff:

  • Love this Stuff? I need your support on Patreon to keep this stuff going.
  • Know anyone who needs to fix their cloud-obliviousness? My book teaches them all they need to know about the cloud. Explain the Cloud Like I'm 10. It has 364 mostly 5 star reviews on Amazon. Here's a 100% antibody free review:
  • Do you like Zone 2, Zone 5, interval, reHIT, or HIIT workouts? I made an app for that. Max reHIT Workout. I’m not just the programmer, I’m a client. I use it 4 times a week and if you want to retrain your mitochondria, get fit, get healthy, and live longer, I think  you’ll like it too.

Number Stuff:

  • 1 septillion: stars in the universe. 40 quintillion black holes. 10 septillion planets. Just a tad under Apple’s yearly revenue.
  • $51.7 billion: Microsoft revenue, cloud revenues up 46%.
  • $123.9 billion. Apple’s revenue for the quarter ending Dec. 25. Up 11% YoY. Services come in at $19.52 billion, up 24% YoY.
  • 5%: of total music streams are generated by the top 200 new tracks. Older music accounts for the rest. Old songs represent 70% of the US music market and their share is growing at the expense of new music. So when you see music catalogs being acquired, this is why.
  • $60 billion: paid by Apple to App Store developers in 2021. So do you really need to make everything digital so much harder with a 30% tax?
  • 3.8 trillion: Hours Spent On Mobile Apps During 2021. Up 30% from 2019.
  • 77TB: data lost to backup error. Remember, backups are really all about the restores.
  • $3.5 Billion: lifetime earnings of creators on Patreon. $1.5B in 2021, representing a 50% uptick in annual earnings when compared to 2020.
  • 2 billion: monthly users for Instagram.
  • $1 billion: in revenue per year by 8 mobile games in 2021.
  • 45 million: global population of developers by 2030, up from 26.9 million in 2021
  • 90%: of the O’Reilly respondents indicated that their organizations are using the cloud. Up 2%. (AWS) (62%), Microsoft Azure (48%), and Google Cloud (33%). Amazon down from 67$% 20% plan to migrate all of their applications. 47% said that their organizations are pursuing a cloud first strategy.
  • 366 million: reddit posts in 2021. 2.3 billion comments. 46 billion up votes.
  • 60 million: EC2 launches each day. Double 2019. IAM handles half a billion API calls per second using a hierarchical edge cache. Over 150M Lambda invocations per minute. Over 200M API Gateway calls per minute. Over 275M ElastiCache hits per minute.
  • $3.6M/minute: Shopify peak sales on Black Friday hosted on GPC. 79% mobile. 30TB/min of egress traffic across their infrastructure.
  • ~14 petabytes: of data across ~18,000 servers in Netflix’s EVCache.
  • 52: SpaceX launches in 2022. 31 last year. 26 in 2020.
  • 2 Tbps: DDoS attack blocked by Cloudflare. Network-layer DDoS attacks increased by 44%
  • 78%: of the web is powered by PHP.

Quotable Stuff:

  • satyrnein: using microservices was like getting drunk: a way to briefly push all your problems out of your mind and just focus on what's in front of you. But your problems didn't really go away, and in fact you just made them worse
  • @houlihan_rick: Non-relational data is a term invented by marketing people to explain to other marketing people how #NoSQL is different from RDBMS. All data is relational. Developers know this. What developers need to know is how to model those relationships efficiently in NoSQL.
  • @muratdemirbas: In distributed systems, even when you forget the question, quorums are the answer.
  • @picocreator: At http://uilicious.com - we do this with multi provider. Our infra, is generally layered according to the services. So our first layer (the main API) which is state heavy, is unfortunately an active/backup setup. The layers below, which are nearly stateless is active/active
  • @bryson3gps: The DynamoDB Global Table. We’ll deploy an active application stack in each the desired regions. Clients are routed by whichever stack is closest (Route53). DynamoDB replicates the data cross-region. It’s a solid pattern that’s serving us well.
  • wim: I'm not saying the cloud doesn't have a place, but I think 99% of the long tail of services don't even need to do migrations at 3am. And running a MySQL instance on a dedicated server with humongous amounts of RAM and speedy NVME drives for $100/month or so is not a bad deal.
  • lewisjoe: But what is really happening in the industry is that, the new age DevOps engineers start their learning with containers and kubernetes as the base truth - and then are hired based on their experience around that ecosystem. This inadvertently leads to an industry full of kubernetes experts who nail every service with k8s hammer and then drive insane amounts of cloud infra bills. I miss the old era cloud where the offerings and the ecosystem were friendly to indie devs as much as they were for BigCorps.
  • 0xbadcafebee: You throw bodies at it. A small bunch of people will be overworked, stressed, constantly fighting fires and struggling to fight technical debt, implement features, and keep the thing afloat. Production is always a hair away from falling over but luck and grit keeps it running. To the team it's a nightmare, to the business everything is fine.
  • @QuinnyPig: This may surprise some people, but my primary concern with  @awscloud  service pricing is how hard it is to predict, not that they're too expensive. I'd even argue a couple services are priced too low.
  • Lee Atchison: You can’t solve scaling and availability by code. You can’t solve it by just wishing it away. It’s a combination of developers and management doing the right things, and putting processes and procedures in place in order to do the right things. All those things together have to work as a cohesive whole, in order for you to be able to build a system that’s highly available and still be able to scale.
  • David Barnard~ The people who are going to pay a lifetime subscription are going to be the ones who use your product for years and years…if you’re going to pick off some of your best highest intent users you’re probably significantly short changing you’re true lifetime value.
  • graveyard hashing: artificially increasing the number of tombstones placed in an array until they occupy about half the free spots. These tombstones then reserve spaces that can be used for future insertions
  • @QuinnyPig: When you're building something, your spend is ALL dev environments. That narrative is sticky. Once you start scaling, you still think of dev environments as "expensive" despite the fact that it's now sub-5% of your  @awscloud bill.
  • bottle_roket: I would say purely from a coding perspective the workload is typical ~40ish hours a week. The stress comes from the biannual performance reviews that grade you explicitly across 4 axis. If you got your projects delivered on time with high quality code that’s just 1 axis. What did you do to improve the codebase? What did you do to help drive the mission of the team? How many code reviews did you do(they count)? What did you do to improve the team culture? I think these things are all important, but everywhere else I have worked a lot of these are more implicit. At FB you need to have bullet points and evidence of these contributions every 6 months to get a satisfactory rating. Couple this typical giant corporation red tape (legal, marketing sign off, metrics reviews) to getting anything released.
  • Ben Adam: When you operate at this type of scale [Amazon], centralization is the enemy of efficiency. This is a paradox. Being efficient on a macro level requires being (very) inefficient at the micro level. For example, almost every organization will build their own tooling to effectively solve the same problems but tailored to their specific use cases. Every org (probably) has its own forecasting system, way of publishing content to the website, etc. A good example was when I joined, I wanted to see any design systems for internal tools. It turned out there were 56! The most surprising thing I encountered when joining was how manual most processes are. It blew my mind how many business critical processes were managed with excel spreadsheets being shared via email chains. It is incredible how flexible and effective Excel is for such a wide variety of use-cases.
  • @TAndersen_nSCIr: Your eyes have much wider brightness range than any video camera ever made, and many many times better than the $1 camera on a Tesla.  That is not a representation of what human eyes would see. It's a cheap video camera.
  • Samuel Gershman: The brain is evolution’s solution to the twin problems of limited data and limited computation. Our mind consists of multiple systems for learning and decision making that only exchange limited amounts of information with one another.
  • @akshatvig: I have used Power of two random choices for load balancing in multiple systems. It works as it uses load info to pick a host, & prevents herd behavior. Reduces cost, latency & SPOF.
  • Orbital Index: WST’s deployment process involves 50 deployments and 344 single-points of failure. As Thomas Zurbuchen said, “Those who are not worried or even terrified about this are not understanding what we are trying to do.
  • Springboard: the secret history of the first real smartphone: Scrappy only gets you so far.
  • dagw: I work in a very boring industry very far away from SV and spinning up and down machines for doing calculations and analysis is a godsend. I have a 'cluster' of 6 machines with 128 GB of RAM each set up and ready that I can start up with a script and only pay for the less than 200 hours a year I actually need them. Sometimes however I need 256 GB of RAM so I just change a parameter in my script and, magic, I have 256 GB of RAM. For other workloads the optimal setup might be 100 1 core machines with 2 GB each. So I type some commands, and now I have that instead. And since it's my 'private' cluster I don't have to worry about queuing my jobs and waiting for the machines to be free. There is no way my department would have bought me those machines as physical hardware.
  • @JoeEmison: Today, we are live on @vercel, having switched from AWS S3+Cloudfront+Lambda@edge. Ultimately, Vercel offers a much better overall experience for both developers and users. (Still using AppSync+Cognito+Dynamo+Lambda for back end). The trickiest part with any hosting solution for us is that (a) we run multiple React applications, and (b) we have a monorepo. We had to change the way we do a number of things, but being able to run Next.js v10+ and use other great features of Vercel made it an easy call.
  • hintymad: I'd like to remind everyone about Uber's experience: no EC2-like functionality until at least 2018, probably even now. Teams would negotiate with CTO for more machines. Uber's container-based solution didn't support persistent volumes for years. Uber's distributed database was based on friendfeed's design and was notoriously harder to use than DynamoDB or Cassandra. Uber's engineers couldn't provision Cassandra instances via API. They had to fill in a 10-pager to justify their use cases. Uber's on-rack router broke back in 2017 and the networking team didn't know about it because their dashboard was not properly set up. Uber tried but failed to build anything even closer to S3. Uber's HDFS cluster was grossly inefficient and expensive. That is, Uber's productivity sucked because they didn't have the out-of-box flexibility offered by cloud.
  • Tbray:  estimates of global IT spending are north of $4T. That math says that 95% of IT isn’t on the cloud yet. Will it all move there? No, but it feels inevitable that the cloud revenue potential is at least 10× today. And where is that 10×? I’ll tell you where it isn’t: In the kind of startup and cloud-native scenarios that led the charge onto the cloud over the last fifteen years. It’s in Establishment IT
  • @kennwhite: I had several former colleagues reach out to me today about the outage yesterday and several joked about the "AWS guy looking for a job". This is something that's widely misunderstood. World class engineering teams embrace blame-free postmortems (COEs in Amazon parlance)
  • @Obdurodon: "Stateless" is just another way of saying you've left maintenance of state to someone more competent.
  • dijit: The truth is somewhere in the middle. I’m a sysadmin, I know hardware. I think it’s a complete myth that hardware is hard: especially compared with the irreducible complexity that is AWS. But: I find myself coming back to the cloud. Why? It costs more and you have less control. Scale up is not as important as it seems and the 10x cost difference would mean scale is not a factor either. But, in my experience, not dealing with an IT department is the main reason.
  • Josef Cruz: Young developers are inexperienced, cheaper and you can “persuade them more quickly to implement even the stupidest ideas without thinking about them.
  • Clive Thompson: It’s a coveted device, with models costing as much as $180 million, that is used in making microchip features as tiny as 13 nanometers at a rapid clip. That level of precision is crucial if you’re Intel or TSMC and want to manufacture the world’s fastest cutting-edge computer processors. The final machine, assembled at ASML’s headquarters in the Netherlands, is the size of a small bus and filled with 100,000 tiny, coordinated mechanisms, including a system that generates a specific wavelength of high-energy ultraviolet light by blasting molten drops of tin with a laser 50,000 times a second. It takes four 747s to ship one to a customer.
  • @0xabad1dea: robot: why are children like that? human: imagine if it took 20 years to build your processor, component by component… but it was trying to execute your operating system the entire time. robot: ada lovelace christ
  • @tef_ebooks: i know i'm old in internet terms because when people tell me about some sort of decentralized platform i can be like, "freenet? oh. so diaspora? oh. you mean salmon? oh." and keep going with every long forgotten and subsequently reimplemented protocol over the last 20 years
  • Gabriel Orozco: My sculptures derive from a concrete experience, from an encounter with something in the world that interests me and with which I establish a relationship. A dialogue. The materials I use vary in relation to the events that affect my life.
  • wpietri: Yeah, "move fast and break things" is a much better ethos for low-stakes software than high-stakes hardware.
  • @MarkSailes3: From cold to 300 requests/sec on #AWSLambda with #Java and #GraalVM. All the code, IAC (CDK) and load test scripts. 650ms (p99) cold starts, <5ms (p50) warm starts.
  • @rakyll: At some point, all the optimizations turn into (1) how to colocate relevant data, (2) how to colocate relevant workloads, (3) how to colocate data with those workloads if necessary.
  • @physicsJ: How do we orbit the Milky Way? We move 230 km every second in orbit around the center (less than 0.1% of light speed). It takes 230 million years to complete 1 Galactic Year at this rate and we bob up/down during it as we're attracted to other stars (bobbing exaggerated here)
  • @MissAmyTobey: Miss Amy on Twitter: ""that architecture is fine I guess, but it's so... boooring" "yes. I am very proud of that aspect in particular
  • marknca: This is why containers have skyrocketed in popularity. Especially compared to serverless designs over the past three years. I see a lot of container-based solutions that would be better as serverless designs. Better in that they would be more efficient, less costly, and scale easier. Why do these container-based solutions keep popping up? Containers hit the sweet spot. They are familiar enough but push the envelope in interesting ways. They allow builders to be more productive using modern development methods. At the same time, they don’t require a new mental model.
  • TuringTest: Under those constraints, you're better off with an anytime algorithm with a good heuristic, which gives a viable good-enough suboptimal solution fast, and which can be left running while the expert analyzes the best solution found so far, to decide if it is acceptable.
  • kmeisthax: The problem is that much of the demand for cheap, shelf-stable backup media has gone away over the last decade or two. Most backup jobs either go to cloud or to disk, not tape. And because of that, tape drives are more expensive than ever, which takes away much of the benefit of the cheap media…Now, let's say you're not a movie studio with a massive digital archival problem. You just want to backup 5 terabytes. Unfortunately, 5 terabytes is so little that tape vendors have forgotten how to count that low. It's far cheaper (not to mention, more performant) to just buy a bunch of disk drives and migrate data between them. Or just pay Amazon to do it. Which is what everyone wound up doing.
  • bpodgursky: I talked to a guy yesterday who was looking forward to Starlink GA so he could get reliable uplink from an arduino sensor array on a boat used to service aquaculture farms in the middle of the ocean. I'm pretty optimistic about how much of the world this is going to connect.
  • @kellabyte: I feel like the document data model is dying a slow death. SQL is having a resurgence in a big way tackling ease of scalability & replication & analytics Mongo is pushing a data model that doesn’t play very nicely with rest of the industry. CosmosDB is struggling w/ high cost.
  • @jeremy_daly: 24. Plan work around your life, not life around your work - I worked 70+ hr weeks in my 20s & early 30s. While I'm sure there's a correlation between that & my "successes", there's also a lot of regret for missed time with family & friends.
  • @softwarejameson: We recently found that the new 2021 M1 MacBooks cut our Android build times in half. So for a team of 9, $32k of laptops will actually save $100k in productivity over 2022. The break-even point happens at 3 months.
  • @slightlylate: My default response to "should we use React?" is "are you social network with a successful product, long sessions, a staffed performance infra team, clear perf go/no-go latency cuts, and the ability to write everything 3+ times?" This makes people upset.
  • Yo-Yo Ma: All tradition is the result of successful innovation.
  • elg: We're a B2B/Enterprise SaaS and most tenants require that we erase all their data at the end of the contract. Some require customer-managed encryption keys. The only way to meet this requirement is to have every tenant isolated in their own database (and their own S3 bucket etc). If data is mixed, when one tenant leaves you must go through all copies of all backups, purge their rows, then re-save the cleaned up backups. Nearly impossible in practice.

Useful Stuff:

  • When you've made $5 Billion you can build one heck of an architecture. Or does it go the other way? How Pokémon GO scales to millions of requests?:
    • During these events, transactions go from 400K per second to close to a million in a matter of minutes as soon as regions come online.
    • You are correct, 5-10TB of data per day gets generated and we store all of it in BigQuery and BigTable.
    • There are lots of services we scale, but Google Kubernetes Engine and Cloud Spanner are the main ones. Our front end service is hosted on GKE and it's pretty easy to scale the nodes there — Google Cloud provides us with all the tools we need to manage our Kubernetes cluster
    • At any given time, we have about 5000 Spanner nodes handling traffic. We also have thousands of Kubernetes nodes running specifically for Pokémon GO, plus the GKE nodes running the various microservices that help augment the game experience. All of them work together to support millions of players playing all across the world at a given moment. And unlike other massively multiplayer online games, all of our players share a single “realm, so they can always interact with one another and share the same game state.
    • When a user catches a Pokémon, we receive that request via Cloud Load Balancing. All static media, which is stored in Cloud Storage, is downloaded to the phone on the first start of the app. We also have Cloud CDN enabled at Cloud Load Balancing level to cache and serve this content. First, the traffic from the user's phone reaches Global Load Balancer which then sends the request to our NGINX reverse proxy. The reverse proxy then sends this traffic to our front-end game service. The third pod in the cluster is the Spatial Query Backend. This service keeps a cache that is sharded by location. This cache and service then decides which Pokémon is shown on the map, what gyms and PokéStops are around you, the time zone you’re in, and basically any other feature that is location based. The way I like to think about it is the frontend manages the player and their interaction with the game, while the spatial query backend handles the map. The front end retrieves information from spatial query backend jobs to send back to the user.
    • We also write the protobuf representation of each user action into Bigtable for logging and tracking data with strict retention policies. We also publish the message from the frontend to a Pub/Sub topic that is used for the analysis pipeline.
    • Everything on our servers is deterministic. Therefore, even if multiple players are on different machines, but in the same physical location, all the inputs would be the same and the same Pokémon would be returned to both users. There’s a lot of caching and timing involved however, particularly for events. It’s very important that all the servers are in sync with settings changes and event timings in order for all of our players to feel like they are part of a shared world.
    • We also have some streaming jobs for cheat detection, looking for and responding to improper player signals. Also for setting up Pokétops and gyms and habitat information all over the world we take in information from various sources, like OpenStreetMap, the US Geological Survey, and WayFarer, where we crowdsource our POI data, and combine them together to build a living map of the world.
    • The only thing that the Niantic SRE team needs to ensure is that they have the right quota for these events, and since these are managed services, there is much less operational overhead for the Niantic team.
    • We use Google Cloud Monitoring which comes built in, to search through logs, build dashboards, and fire an alert if something goes critical.
    • Also Building Uber’s Fulfillment Platform for Planet-Scale using Google Cloud Spanner.

  • Over the years Twitter has been a great source of architecture ideas. We’re seeing the end of an era as Twitter moves, at least partially to GCP. Why not all the way? It is not known. Processing billions of events in real time at Twitter:
    • At Twitter, we process approximately 400 billion events in real time and generate petabyte (PB) scale data every day
    • As our data scale is growing fast, we face high demands to reduce streaming latency and provide higher accuracy on data processing, as well as real-time data serving.
    • The new architecture is built on both Twitter Data Center services and Google Cloud Platform. On-premise, we built preprocessing and relay event processing which converts Kafka topic events to pubsub topic events with at-least-once semantics. On Google Cloud, we used streamThe whole system can stream millions of events per second with a low latency of up to ~10s and can scale up with high traffic in both our on-prem and cloud streaming systems. We use Cloud Pubsub as a message buffer while guaranteeing no data loss throughout our on-prem streaming system. This is followed by deduping to achieve near exactly-once processing.ing Dataflow jobs to apply deduping and then perform real-time aggregation and sink data into BigTable.
    • On Google Cloud, we use a Twitter internal framework built on Google Dataflow for real time aggregation. The Dataflow workers handle deduping and aggregation in real time. The deduping procedure accuracy is dependent on the timed window. We tuned our system to achieve best effort deduping in the deduping window. We proved that high deduping accuracy by simultaneously writing data into BigQuery and querying continuously on the percentage of duplicates, explained below. Lastly, the aggregated counts with query keys are written to Bigtable.
    • This new architecture saves the cost to build the batch pipelines, and for real-time pipelines, we are able to achieve higher aggregation accuracy and stable low latency. Also, we do not need to maintain different real-time event aggregations in multiple data centers.

  • To multi-region or not to multi-region? Only if you're Netflix.

  • Wisdom dump. 42 things I learned from building a production database [at Facebook]:
    • In 2017, I went to Facebook on a sabbatical from my faculty position at Yale. I created a team to build a storage system called Delos at the bottom of the Facebook stack (think of it as Facebook’s version of Chubby). We hit production with a 3-person team in less than a year; and subsequently scaled the team to 30+ engineers spanning multiple sub-teams. In the four years that I led the team (until Spring 2021), we did not experience a single severe outage
    • Lots of good stuff, but I really liked these more non-obvious lessons:
    • Make your project robust to re-orgs. A company management hierarchy is inherently fragile (a tree is a 1-connected graph, after all); socialize the project continuously with managers who might take over in the future. Do whatever it takes to make sure that manager churn does not result in unfair career outcomes for ICs.
    • Do not compete on raw performance or efficiency with other teams; this will escalate into an arms race where both teams waste time optimizing their systems for point workloads, generating apples-to-oranges comparisons, etc. Compete on fundamental design characteristics.
    • For storage systems, bias heavily in the beginning towards consistency and durability rather than availability; these are harder to measure and harder to fix if broken. Because availability is easier to measure, there will be external pressure to prioritize it first; push back.

  • Videos from the Next.js Conf 2021 are now available.

  • Who doesn’t want to lower their development costs? How Slack used Infrastructure Observability for Changing the Spend Curve
    • We modeled how users were using CI and crafted a strategy to increase overall throughput and reduce errors at peak usage through parallelization. We experimented with configurations of instances to understand workload performance and resiliency using end to end metrics. This allowed for an overall increase in total fleet capacity (through oversubscription of executors per instance) while reducing costs by approximately 70% and decreased error rates by approximately 50% (compared to the previous executor and instance type metrics).
    • we drove a magnitude change in our CI infrastructure spend by using three ideas
    • Adaptive capacity to decrease the cost of each test by changing the infrastructure runtime.
    • Circuit breakers to decrease the number of tests by changing the infrastructure workflow.
    • Pipeline changes to decrease the number of tests by changing our user workflows.

  • Avoiding fallback in distributed systems [at AWS]: we favor code paths that are exercised in production continuously rather than rarely. We focus on improving the availability of our primary systems, by using patterns like pushing data to systems that need it instead of pulling and risking failure of a remote call at a critical time. Finally, we watch out for subtle behavior in our code that could flip it into a fallback-like mode of operation, such as by performing too many retries. If fallback is essential in a system, we exercise it as often as possible in production, so that fallback behaves just as reliably and predictably as the primary mode of operating.
    • MalnarThe: We manage a large SaaS on AWS. This outage affected a single Kafka broker that had to be manually replaced. Our customer didn't notice a blip. It's not hard to do multi-AZ, if you have the scale to not make it extra expensive.
    • Frennzy: "We look at mutli-AZ periodically. It's really expensive. We are already under a lot of pressure to hit margin targets; you add in multi-AZ there's no way to do it. And forget multi-cloud. The dev cost is high, and schlepping data between providers isn't cheap either."
    • hvgk: This is one that gets me every time. oh we’ll just roll out our shit in middle-of-nowhere-west-1c if it all goes to crap. First time someone does that DR drill they find out there’s a core service missing they have built their entire empire on. Also there’s contention whenever that happens. Last AWS outage we experienced, entire availability zones had no capacity available. If they did it took forever to spin anything up and migrate volume snapshots over as well as everyone else was doing the same thing at the same time.

  • Videos for GOTO 2021 are now available.

  • Bigger players, with substantial revenue producing infrastructure, will make changes incrementally out of necessity. The result may look hybrid, but the thought process is value driven. JPMorgan Chase & Co:
    • Part of their modernization strategy is migrating some data oriented processes to the cloud, because presumably it’s easier to extract value from data in the cloud given data gravity and all the available service pipelines.
    • Yet at the same time they’re spending $2 billion on brand-new data centers.
    • And they want to go multi-cloud, running loads in AWS, Microsoft, and Google. The cloud will primarily be used for big data on risk, fraud, marketing, capabilities, offers, customer satisfaction, dealing with errors and complaints, and prospecting. Since clouds aren’t lowering their prices, using them to leverage each other is a good idea.
    • Despite what some people say, moving to the cloud can save money compared to on-premise, but savings is not the big win. They run their Card system that manages 60 million accounts on a mainframe that runs in an old data center. They project moving it to the cloud would save $30 million or $40 million a year. But they don’t want to move it to the cloud just to save money, but so all the data can be streamed into its risk, marketing, fraud, real-time systems so magic ML sauce can be applied.
    • Another reason for the cloud: continuous deployment. The mainframe was updated only 4 times a year. In the cloud it’s easy to continuously improve, deploy, and test your system.

  • The great thing about caching is you never know why your data isn’t updating. But if you can get beyond that, here’s a nice series on caching. Serverless Caching Strategies : API layer, DynamoDB DAX, Lambda runtime, App Sync, and CDN.

  • But Cloudflare is a cloud. It’s in the name. You don’t need the cloud.

  • Examining the covidtests.gov architecture. Having ordered a few kits it was interesting to learn they went serverless. No doubt much easier than whatever they were doing before...and the order process was quick and painless. Well done.
    • (Edit: A source tells me the USPS did in fact design and implement the cloud architecture of the site.). Contrast this approach with 2013 and the era of the HealthCare.gov launch when the state of the art in government was to administer servers in private data centers with numerous moving parts and failure modes without the resources and experience to handle intense public demand and traffic.

  • I hereby give my permission to Amazon to add 50 msec to every page render if they would just STOP showing me books I’ve already read! Please. Pretty please with optimized javascript on top. @amilajack: At amazon, every 100ms of latency costed us 1% in sales My team's goal was to cut Amazon.com's load time in half Here's what we did:
    • Load all content above the fold first. We saw a 500ms win when migrating features and content to load ATF, increasing sales by 5%.
    • Server side render everything http://amazon.com is server side rendered and has no client side rendering framework. The potential latency hit didn't justify it. We were stuck with jQuery 1.6.4 SSR React wasn't fast enough for us. This blew my mind.
    • rack critical performance metrics We tracked: • time to load content above the fold • time to first suggestion • time to visual completeness • time to load ads • time to page layout ready • and many more…
    • A/B test all critical changes If any release regressed our sales metrics, we'd revert. Simple as that.
    • Drop unused polyfills
    • Serve browser-specific bundles. We detected browsers server side (via UA sniffing) and served bundles optimized for that browser.

  • Algorithms everywhere.
    • Why This Zig-Zag Coast Guard Search Pattern is Actually Genius. Searching for something in water is different from searching on land because water moves in different directions and can change flow direction over time. So they use a pattern called “Victor Sierra to perform a sector search. It's a circular pattern of triangles around a point. The pattern allows them to find the drift of the water while starting searching immediately. They also use expanded square search, parallel search, creep search, barrier search, track line search, shore line search patterns.
    • An E. coli biocomputer solves a maze by sharing the work: rather than engineering a single type of cell to do all the work, they design multiple types of cells, each with different functions, to get the job done. Working in concert, these engineered microbes might be able to “compute and solve problems more like multicellular networks in the wild.

  • Videos from the J-Fall 2021 conference are now available.

  • Roblox experienced a 73 hour service outage causing over 50 million players to have to do something else. Roblox Return to Service 10/28-10/31 2021.
    • Roblox’s core infrastructure runs in Roblox data centers. We deploy and manage our own hardware, as well as our own compute, storage, and networking systems on top of that hardware. The scale of our deployment is significant, with over 18,000 servers and 170,000 containers…to run thousands of servers across multiple sites, we leverage a technology suite commonly known as the “HashiStack. Nomad…but the caching system, which regularly handles 1B requests-per-second across its multiple layers during regular system operation
    • Much heroic debugging and trial of different solutions that did not work, until…
    • We disabled the streaming feature for all Consul systems, including the traffic routing nodes. The config change finished propagating at 15:51, at which time the 50th percentile for Consul KV writes lowered to 300ms. We finally had a breakthrough. Why was streaming an issue? HashiCorp explained that, while streaming was overall more efficient, it used fewer concurrency control elements (Go channels) in its implementation than long polling. Under very high load – specifically, both a very high read load and a very high write load – the design of streaming exacerbates the amount of contention on a single Go channel, which causes blocking during writes, making it significantly less efficient.
    • But there was more…BoltDB tracks these free pages in a structure called its “freelist. Typically, write latency is not meaningfully impacted by the time it takes to update the freelist, but Roblox’s workload exposed a pathological performance issue in BoltDB that made freelist maintenance extremely expensive.
    • It had been 54 hours since the start of the outage. With streaming disabled and a process in place to prevent slow leaders from staying elected, Consul was now consistently stable. The team was ready to focus on a return to service.
    • An interesting problem after a cold restart is the repopulation of caches…With cold caches and a system we were still uncertain about, we did not want a flood of traffic that could potentially put the system back into an unstable state. To avoid a flood, we used DNS steering to manage the number of players who could access Roblox. This allowed us to let in a certain percentage of randomly selected players while others continued to be redirected to our static maintenance page.
    • Morgawr: Usually the way it works is so that we have multiple clearly-identified and properly-handed-off roles. There's an Incident Commander (IC) role, whose job is to basically oversee the whole situation, there's various responders (including a primary one) whose job is to mitigate/fix the problems usually relating their own teams/platform/infra (networking, security, virtualization clusters, capacity planning, logging, etc. depends on the outage). There's also sometimes a communication person (I forget the role name specifically) whose job is to keep people updated, both internal to the outage (responders, etc) and outsiders (dealing with public-facing comms, either to other internal teams affected by the outage or even external customers). Depending on the size of the outage, the IC may establish a specific "war room" channel (used to be an IRC chatroom, not sure what they use these days though) where most communication from various interested parties will take place. The advantage of a chatroom is that it lets you maintain communication logs and timestamps (useful for postmortem and timeline purposes), and it helps when handing off to the next oncall during a shift change (they can read the history of what happened).
    • benbjohnson: BoltDB author here. Yes, it is a bad design. The project was never intended to go to production but rather it was a port of LMDB so I could understand the internals. I simplified the freelist handling since it was a toy project. At Shopify, we had some serious issues at the time (~2014) with either LMDB or the Go driver that we couldn't resolve after several months so we swapped out for Bolt. And alas, my poor design stuck around.
    • ineedasername: "circular dependencies in our observability stack" This appears to be why the outage was extended, and was referenced elsewhere too. It's hard to diagnose something when part of the diagnostic tool kit is also malfunctioning.
    • phgn: Like the Facebook outage a few months ago, when their DNS being down prevented them from communicating internally.

  • When you're way too ahead of your time. Ever seen an Automatic Hostess? These were jukeboxes from the 1940s that had no phonograph mechanism. Let’s say you walk up to a unit in a tavern and insert a coin, would it contact the cloud and play music over the internet? No! None of that existed. You would speak into a microphone, which connected to an operator in a studio, who would locate the record, and play the request!

  • Building a serverless application is easier and better than ever in 2022. How Serverless Saved Money on My Heating Bill. Scraping: Browserless. Database: Supabase Web framework: Remix. Hosting: Fly.io. Cost: 0.

  • What's the best serverless compute platform out today, and why? Answers are all over the place, which is a good thing.
    • Lambda. Appsync/Cognito/Dynamo/Lambda. Firebase/FB Auth/GCF. Vercel/Fauna. Something with Hasura? OpenFaaS? Cloud Run. Fargate. NextJs. Vercel. Browser. Mobile apps. Cloudflare Workers. Jamstack. Cloudflare Pages. Google Sheets. Github Actions. Etherium. Azure Container Instances. webcode.run. App Engine. Firestore function.GKE autopilot. Back4app. Akka Serverless. MongoDB Realm.
    • Interesting question: Does running the platform yourself count as serverless? I agree with Mick Pollard, who the provider of the serverless platform doesn’t matter, from a developer perspective what matters is your deployment model.
    • Lightbend is a Akka based PaaS and Platform that you might find interesting.
    • There’s also Durable Objects from Cloudflare.

  • Videos from Handmade Seattle 2021 are now available.

  • Creating greenfield apps in the cloud makes good cloud sense, but it’s not often we hear about lifting and shifting (sort of) a large SAP system to thge cloud. Could moving SAP to cloud be a win over on-premise? It seems so. Fender’s CIO Talks Tuning Up SAP with a Migration to AWS:
    • That must have been stressful: Over a three-day weekend, we migrated all of our SAP loads into the AWS cloud. We upgraded the systems. We migrated to SAP HANA. We switched the operating systems.
    • When you’re stuck with your existing equipment it’s hard to make huge performance improvements: The outcome was very significant performance improvements. Certain reports, business intelligence reports that would run 10, 20, 30 plus minutes would run [in] subseconds. It’s really a combination of the AWS infrastructure on one side and the HANA database on the other side.
    • When you’re stuck with existing boxes it’s hard to play around with different architectures: We have since then resized the production system. We started off with very large instances that AWS offers. We were able to, in a stairstep approach, reduce that size.
    • When you’re stuck with existing boxes it’s hard to add capacity: we just acquired a new company, PreSonus. They’re running on an Oracle ERP system. We have plans to migrate them over to SAP. We literally are copying a sandbox so they can start getting used to the system. It’s something that we would have a really difficult time setting up with our higher environment with AWS. That can be done within a day.
    • Applying ML to your data is a big theme behind cloud migrations: What I’m looking forward to is working with AWS, taking this huge amount of intelligence, of transactional data we have in the SAP system and using that dataset with Amazon's AI and machine learning algorithms to provide insight that we just can’t do today with the classic tools we have available.
    • Control is always the issue, but with control comes great responsibility: The only drawback would be a significant Amazon outage. I believe one of the [recent] outages really impacted the SAP instances. Having said that, we had a few outages in our own environment. But there we had more control.

  • Your bill will be as low as the thought you put into your architecture. How we handle 80TB and 5M page views a month for under $400:
    • Serving 80TB of bandwidth from S3 is estimated to cost around $4,000 per month alone
    • Cloudflare, for us, is primarily a massive caching layer.
    • For asset downloads, these cached downloads make up for about 85% of all traffic.
    • Because of Argo (which I’ll explain below) Cloudflare can cache this kind of traffic even better than the huge asset files, resulting in about a 93% cache ratio.
    • The cost is $40 per month.
    • Alone, Backblaze’s B2 clone of S3 service would cost us about $130 per month for that last 15% that Cloudflare doesn’t handle. But, we can get around that because of a partnership between Backblaze and Cloudflare they call the Bandwidth Alliance. As long as we use both services together, and pay for the $20 cloudflare subscription, we don’t get charged for download traffic at all.
    • polyhaven.com is built with Next.js – a javascript framework created by Vercel. Their base fee is $20 per month, with additional costs based on usage. Since we use Cloudflare in front of Vercel and are super careful about what can be cached and what can’t (e.g. anything requiring user authentication), we generally don’t go over the included usage limits
    • I decided to splurge a bit and go for a cloud solution where I wouldn’t have to worry about reliability, performance, scaling or integrity ever again: Google Firestore. This is certainly not the cheapest option, at around $100 per month it accounts for about half of our monthly web budget, but I still think it’s the right way to go. To avoid database reads as much as possible, we cache as much as possible. Our data doesn’t change that often, or when it does (e.g. download counts) it’s not very important to show users the latest information anyway. Cloudflare all the way
    • To give us the most control over caching database reads, and also to avoid racking up bills in Vercel, we have a separate $5 server (yes, seriously) on Vultr that runs our API.
    • All of our images shown on the website (thumbnails, renders, previews, etc.) are stored on Bunny.net – another budget CDN. This costs us about $27 per month, depending on traffic. Bunny.net doesn’t just store our images though, they also have an optimization service which allows us to dynamically resize and compress images for the website.
    • Almost everything is hosted on some cloud service that is inherently scalable by their design (e.g. vercel’s serverless functions). As they charge per usage, they actually want you to scale indefinitely.
    • kijin: I have some clients who use AWS and others who prefer colo and/or dedicated servers from traditional datacenters. The latter group can afford to over-provision everything by 3-4x, even across different DC's if necessary. DC's aren't yesterday's dinosaurs anymore. The large ones have a bunch of hardware on standby that you can order at 3 a.m. and start running deployment scripts in minutes.
    • virtuallynathan: At Netflix we’re doing close to 400Gbps on 1U commodity hardware, and pretty inexpensive.
    • mad182: I handle ~150TB and 26M page views for ~$500 by simply renting a few dedicated servers at hetzner. And if I didn't need quite a lot of processing power (more than average website), it would be much lower. I only need so many servers for the CPU power, not traffic. My app allows for "share nothing" architecture, basically using multiple DNS A records as load balancing. Currently it has 6 servers. Even if 5 of them would go down at the same time, the site would still work as intended (though probably couldn't handle peak load with less than 3 or 4). If one or two are down, nothing happens. Also completely reinstalling a server takes around an hour.

  • How do you give developers an instance of a complex product stack to test against? It’s harder than it sounds. Scaling productivity on microservices at Lyft (Part 3): Extending our Envoy mesh with staging overrides. Lyft went from Onebox, which started a VM that ran 100+ of their services to test against. This ran into scaling problems. They changed over to a shared staging environment as a viable replacement that relied on a lot of Envoy magic: We fundamentally shifted our approach for the isolation model: instead of providing fully isolated environments, we isolated requests within a shared environment. At its core, we enable users to override how their request flows through the staging environment to conditionally exercise their experimental code.

  • So far the James Webb Space telescope has been a great success. Engineering FTW! The James Webb Space Telescope — making 300 points of failure reliable and The James Webb Space Telescope — Success through Redundancy.
    • conservatively, out of the thousands of tasks which must be completed before the telescope can be considered deployed successfully, over 300 are “single points of failure.
    • testing, testing, and testing. The JWST cost so much (over 9,000,000,000 dollars) and took so long to create (originally planned for 2013, only launched in the final week of 2021) because so many unplanned challenges arose and had to be solved before the launch.
    • The high cost of the JWST meant that only one would be built — therefore redundancy was not possible to assure success. Some of the specific requirements of the JWST mean that it cannot be repaired once launched — therefore repairability was not possible to assure success. The only solution left for NASA managers & engineers was making sure that the JWST would be as reliable as possible.
    • The three ways in which NASA engineers¹ could ensure the success of the Webb telescope: Redundancy — building multiple telescopes, so that if one failed, the others could complete the mission. Repairability — building the telescope in a way which could be fixed if anything went wrong. Reliability/Resiliency— building the telescope so that it cope with any failures which might occur.
    • NASA invested heavily in the third option, designing, building and perfecting mechanisms which would guarantee the success of the mission.
    • the system has redundancy if it can be replaced and the mission will still succeed.
    • Essentially, redundancy is a form of horizontal scaling — adding more “of the same — where each component is not aware or dependent on the others.
    • Common misconceptions about space-grade integrated circuits.
    • Deep Space Network Now: The real time status of communications with our deep space explorers

  • Videos from Strange Loop 2021 are now available.

  • Welcome player. Your quest is to reduce the AWS bill by 50% in the next 2 months.. New side quests may be generated as goals are achieved. Actually, lots of good advice in the responses.

  • Why is there both good and evil in the world? For the same reason jitter makes communication more efficient.
    • An Injection of Chaos Solves Decades-Old Fluid Mystery: At some point, the molecular motion causes the fluid flow to become chaotic, surging and rippling in convoluted eddies that loop back on themselves. The onset of chaos is what impedes the fluid’s movement.
    • Phillip Ball: Consequently, cells may have evolved adaptations that use noise to their advantage, and Elowitz’s model of the combinatorial logic of regulatory networks “may be one example of such adaptation, Wagner said. “Cells may have sloppy systems whose power emerges from the right kind of combinatorics.

  • Fun stuff. Power Loss Siren: Making Meta resilient to power loss events: To help increase resiliency, we built the Power Loss Siren (PLS) — a rack level, low latency, distributed power loss detection and alert system. PLS helps mitigate the impact of these events. It leverages existing in-rack batteries to notify services about impending power loss without requiring additional hardware, so that engineers or the services themselves can take action. PLS also features a simple API for services to implement mitigation handlers while servers run on battery power. With PLS support, services can failover proactively, rather than reactively after servers go down.

  • People are just so gosh darn clever. A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution . usmannk: NSO used a compression format's instructions to create logic gates and then from there "a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations", all within a single pass of decompression. Combine this with a buffer overflow and you've got your sploit.

  • GraphLoad: A Framework to Load and Update Over Ten-Billion-Vertex Graphs with Performance and Consistency: GraphLoad is a scalable graph loading framework eBay developed and deployed in production. It has been loading and updating a graph with over 15 billion vertices and over 20 billion edges since May 2020. NuGraph is a graph database platform developed at eBay that is cloud-native, scalable and performant. It is built upon the open-source graph database JanusGraph, with FoundationDB as the backend that stores graph elements and indexes. Also, eBay’s Global Secondary Indexes.

  • gerbilly: The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did. Go read a few RFCs and try to imagine a scrum team today putting as much thought into an up front design.
    • Yah, no. No doubt the original crowd that invented the internet were brilliant, but not uniquely brilliant across all time and space. They had a couple of advantages. One: all they had were single computers. There was no cloud. So all those lovely protocols people love were developed out of necessity, not foresight. Everything makes sense when you think through the lens of trying to share a disparate set of geographically distributed single, large, expensive computers. Computers were heterogeneous, communications were heterogeneous, locations were heterogeneous, motivations were heterogeneous, so protocols were it. Protocols allowed all those different interests to align. We don’t have protocols today because nobody really wants them. Protocols don’t allow you to create a moat around your IP. So F* that. It’s a completely different world. Two: They got to build small working systems and then evolve larger working systems from that seed. They could do that because they had time, money, and purpose. Nobody was depending on them. Again, it was a completely different world.

  • How Uber Migrated Financial Data from DynamoDB to Docstore. LedgerStore is an immutable, ledger-style database storing business transactions. Over this period of time we realized that operating LedgerStore with DynamoDB as a backend was becoming expensive. We decided to change the LedgerStore backend to be one of our in-house, homegrown databases. The estimated yearly savings are $6 million per year, and we also laid the foundation for such other future initiatives.

  • How to Build a Supersonic Trebuchet. What’s to love about this video is the analytical approach to design and problem solving. Oh, and it's a supersonic trebuchet!

  • Just in case you still have more room in your mental basement for more microservices talk. Some thoughts on microservices. I hear it’s for scaling developers, not users. Just what I heard at the barbershop.

  • Why Zillow Couldn’t Make Algorithmic House Pricing Work. In short, algorithms don’t handle Black Swan events well.

  • Databases are a hard business. Ten years of NewSQL: Back to the future of distributed relational databases:
    • It is clear from the list above that the original NewSQL vendors did not succeed in disrupting the database market. In fairness, they were setting out to tackle one of the most significant challenges in computer science – combining the scalability advantages of NoSQL with the structure, consistency, performance and transactional support provided by the relational data model (there's a reason why the speed of light can be considered a competitor when it comes to distributed transactions).
    • The next generation of distributed relational database vendors has fared better, however. Cockroach Labs has more than 300 customers and a $2bn valuation; PlanetScale recently announced the launch of its developer-focused database service; and PingCAP launched the public preview of its TiDB Cloud managed service, having launched version 2.0 of its database in 2020 and raised a $270m series D funding round.
    • 5 Database technologies used by 2000 Wix microservices
    • Announcing the new Timescale Cloud, and a new vision for the future of database services in the cloud

  • Herding elephants: Lessons learned from sharding Postgres at Notion:
    • Shard earlier. Aim for a zero-downtime migration. Aim for a zero-downtime migration.
    • jordanthoms: I looked at a bunch of options but ultimately managing a sharded database seemed like too much overhead for our small (but expanding!) dev team - so we decided to move our heavily loaded tables to CockroachDB instead, as since it's mostly Postgres compatible our transition would be easier, and it would be much easier to manage as it automatically balances load and heals. We're running cockroach on GKE and it's a really nice fit for running on kubernetes. Ended up working well for us - we still have our smaller tables on PG but we want to move it all to Cockroach over time.
    • jandrewrogers: Postgres starts to struggle when you push it past 10TB, the details of which will depend on your specific data model and workload. I've seen it pushed to 50TB but I would not recommend it. The architecture simply isn't designed for that kind of scale.

  • Videos from Dotnetos Conference 2021 are now available. You might like Reuben Bond - Orleans under the hood. It’s a deep dive on efficient RPC and serialization.

  • You can do magic when you control your own hardware. Multicast and the Markets. Why would I prefer multicast over unicast?
    • The switches are very fast and deterministic about how they do that. And because of, I think their usage in the industry, they’ve gotten faster and more deterministic. So they can just electrically repeat those bits simultaneously to 48 ports, or whatever, that that switch might have. And that’s just going to be much faster and more regular than you trying to do it on a general purpose server, where you might be writing multiple copies writing to multiple places, you really can’t compare the two.
    • One of the key advantages of using switches is that the switches are doing the copying in specialized hardware, which is just fundamentally faster than you can do on your own machine. Also there’s a distributed component of this. When you make available multicast stream, there’s this distributed algorithm that runs on the switches where it learns essentially what we call the “multicast tree. At each layer, each switch knows to what other switches it needs to forward packets, and then those switches know which port they need to forward packets to. That gives you the ability to kind of distribute the job of doing the copying. So the if you have like 12 recipients in some distant network, you can send one to the local switch, and then the final copying happens at the last layer, at the place where it’s most efficient. That’s the fundamental magic trick that multicast is providing for you.
    • Multicast on the open Internet doesn’t work. Multicast in the cloud basically doesn’t work. But multicast in trading environments is a dominant technology. And one of the reasons I think it’s a dominant technology is because it turns out there are a small number of videos that we all want to watch at the same time
    • This is, in fact, in some ways, a general story about optimizing many different kinds of systems: specialization, understanding the value system of your domain, and being able to optimize for those values.
    • People who are running exchanges, who are disseminating data, care about getting data out quickly and fairly, but they care more about getting data to almost everyone in a clean way than they do making sure that everyone can keep up. So you’d much rather just kind of pile forward obliviously and keep on pushing the data out, and then if people are behind, well, you know, they need to think about how to engineer their systems differently so they’re going to be able to keep up. You worry about the bulk of the herd, but not about everyone in the herd.

Soft Stuff:

  • Time Stamp Authority: freeTSA.org provides a free Time Stamp Authority. Adding a trusted timestamp to code or to an electronic signature provides a digital seal of data integrity and a trusted date and time of when the transaction took place.
  • pinterest/memq (a href=https://medium.com/pinterest-engineering/memq-an-efficient-scalable-cloud-native-pubsub-system-4402695dd4e7>article): a new PubSub system that augments Kafka at Pinterest. It uses a decoupled storage and serving architecture similar to Apache Pulsar and Facebook Logdevice. Is 90% more cost effective than our Kafka footprint.
  • Netflix/cachemover (article):  This project provides the ability to dump memcached data (all KV pairs) to disk, and populate a memcached process on a different server. Netflix reduced  total warm up times reduced by ~90% as compared to our previous architecture.
  • spotify/XCRemoteCache (article): a remote cache tool for Xcode projects. It reuses target artifacts generated on a remote machine, served from a simple REST server. Reduces build times by 70%

Video Stuff:

Pub Stuff:

  • Version 2.0 of Architecting for Scale: How to Maintain High Availability and Manage Risk in the Cloud.
  • Log-structured Protocols in Delos: We show via experiments and production data that log-structured protocols impose low overhead, while allowing optimizations that can improve latency by up to 100X (e.g., via leasing) and throughput by up to 2X (e.g., via batching)
  • PI-Terminal Planetary Defense: Our best option, according to a new analysis (detailed pdf), is to launch an array of rods (e.g., a 10x10 array of multi-meter long hardened penetrators, some containing explosives) into the asteroid’s path, using its relative velocity to pulverize it into a cloud of fragments that can then burn up in the atmosphere, causing a bunch of uncorrelated smaller airbursts instead of a single large catastrophic one. The paper found this to be surprisingly effective. Existing launchers could easily deliver 100 penetrators, or 10 tons, into an asteroid’s path, and Starship with refueling could deliver 100 tons, enough to “take on asteroids well in excess of 100 m diameter with a goal of mitigating an Apophis-class (370m diameter) asteroid
  • Hybrid Networking Lens AWS Well-Architected Framework:  This whitepaper describes the Hybrid Networking Lens for the AWS Well-Architected Framework, which helps customers review and improve their cloud-based architectures and better understand the business impact of their design decisions. The document describes general design principles, as well as specific best practices and guidance for the five pillars of the Well-Architected Framework.