Stuff The Internet Says On Scalability For Sep 18th, 2020

Hey, it's HighScalability time!

I can't wait for the duel. Just don't shoot into the air.

Do you like this sort of Stuff? Without your support on Patreon this kind of Stuff won't happen.

Know someone who could benefit from becoming one with the cloud? Of course you do. I wrote Explain the Cloud Like I'm 10 just for them. On Amazon it has 167 mostly 5 star reviews. Here's a 100% lectin-free review:

Number Stuff:

  • 1/8th: failure rate for Microsoft's underwater data center—after two years—when compared to conventional data centers. Only 8 out of 855 servers failed. Why? No humans to mess things up? Shielding from cosmic rays? Nitrogen atmosphere? Cooler?
  • 100,000+: requests per second handled by Shopify.
  • 22.2%: US electricity supplied by renewable energy.
  • 40,000: oldest technical system ever built by humans is a series of fish traps built in Australia. Last use was 1915.
  • ~330M: traces and ~8.5B spans per day at Slack.
  • ~60%: of organizations run a mix of SQL and NoSQL databases. Only 14% of organizations run exclusively NoSQL databases.
  • $1 million: not tempting enough to hack Tesla. Other companies were less fortunate.
  • 0.81%: BackBlaze's Annualized Failure Rate (AFR) for Q2 2020. Q1 2020 which was 1.07% One year ago (Q2 2019), the quarterly AFR was 1.8%. Three drive models had 0 drive failures: the Toshiba 4TB, the Seagate 6TB and the HGST 8TB.
  • 22%:  ecommerce penetration in the US. It was 17% a few month ago. 5 years of growth in three months.
  • 5%: of world's websites hosted on Wix. 700 million uniques per month.
  • 43%-67%: faster SQL Server backup by writing to multiple files.
  • 0: score for the meat puppet F-16 fighter pilot against an AI. DeepMind becomes DeepDeath.
  • $720 billion: wasted on failed IT replacement efforts.
  • $734.38: Joe Emison's itemized monthly bill for a full-stack insurance company running on serverless.
  • 11 to 60Mbps: Starlink dowload speeds. Ping ranges from 31ms to 94ms. Not bad, but not stratospheric either.
  • £200m: British companies paid in ransomware last year.
  • 10x: more outages by ISPs when compared to cloud providers.
  • 50x: chip cooling improvement over typical microchannel cooling approaches. It uses an optimized 3D structure to extract heat before it propagates, rather than wait as is done with a heat sink. By increasing the heat flux that can be managed, many more devices can be integrated on a chip. Not only that, we can start having integrated power chips. This is a new thing. It could have an impact similar to that of silicon microchips.

Quotable Stuff:

  • Ian Banks: “It happens.” [being hacked] Hippinse sighed. “Not to Culture ships, as a rule; they write their own individual OS as they grow up, so it’s like every human in a population being slightly different, almost their own individual species despite appearances; bugs can’t spread."
  • @TimSweeneyEpic: Two facts about Apple: 1) Apple is #3 in the world in game revenue 2) Apple doesn’t make games
  • @RituFM: F.R.I.E.N.D.S. OF PRODUCT 1. Sales: Joey 2. Marketing: Rachel 3. CEO: Monica 4. Engineering: Ross  5. Support: Chandler 6. Rest: Phoebe Me: Paul Rudd ;)
  • iRobot CEO: Thinking that autonomy was the destination was where I was just completely wrong
  • @raganwald: “Why are you asking for this intellectual property and non-compete agreement?” We don’t want you taking what you learned here, somewhere else. “Why are you offering me a job in the first place?” We want to take advantage of what you learned somewhere else. “Siri, playback.”
  • tehlike: As an engineer that worked both at google and Facebook, I vastly prefer googles monorepo on perforce. Combining that with citc was pretty solid way to develop your project. Hg is a bit of a nightmare in the wfh situation. Really slow, hangs for a long time if you haven't synced in a few days. Yes, im sure there are ways to tweak, but not sure if you can tweak them enough!
  • 45nshukla: Moving 25TB data from one S3 bucket to another took 7 engineers, 4 parallel sessions each and 2 full days
  • redm: I really love this offering and I don't think it gets enough attention. We are an on-prem company and we use CloudFlare. Our users pay for that latency (in time) for us to traverse our IP providers to get to a CF pop. Since all our traffic goes over CF, directly connecting makes a lot more sense. I'm going to investigate further for the latency benefits. I've also backhauled lots of IP over the years and it can be a real pain. Fiber cuts are common, keeping redundant wave service or dark fiber drives up the cost, and in the end, its often cheaper to hand off to an IP providers meshed network, then to backhaul any distance for latency.
  • Evan Ackerman: iRobot is announcing a major new software update that represents a significant shift of its overall approach to home robot autonomy. Humans are being brought back into the loop through software that tries to learn when, where, and how you clean so that your Roomba can adapt itself to your life rather than the other way around.
  • SpectralCoding: The correct answer is S3 Batch Operations, using the PUT object copy functionality. We had to move a large on-premise backup destination bucket from us-east-1 to us-east-2. It resulted in my post here Cross Region S3 Transfer Speed 50Gb/s? (moving 122TB in about 5 hours).
  • manigandham: GCP and Azure both have much better built-in tooling that would make this a few clicks. Their storage system design is also much better. It's unfortunate that the industry has standardized around S3 just because it's a first mover rather than pushing Amazon's product to get better.
  • pythonpoole: A key difference is that Cosmos DB is designed so that it can be effectively used as a drop-in replacement for a traditional relational database supporting SQL queries whereas DynamoDB is not designed this way. DynamoDB does not support SQL-like queries. DynamoDB instead has a proprietary API which is not designed to work with relational data. While secondary indexes are supported, it's not like Cosmos DB where all properties are auto-indexed. Ultimately, you won't get the same degree of query flexibility that you get with CosmosDB. With DynamoDB you generally first have to organize your data in a way that is optimized for the types of queries you want to perform.
  • shakezula: I used to work in the music industry professionally, on the ground level doing booking and management. This trend has been happening slowly for nearly a decade but it’s finally here. Rap and Hip Hop figured out a long time before most other genres that rapid small releases was a far better way to keep hype and sales up. Before Spotify was a thing, the shift was happening with YouTube but it wasn’t as predominant. Now it’s basically assumed you’ll be releasing singles every month. The music isn’t your product, the music is your marketing. The shows, the merch, your influence - that’s your product.
  • zelly: Innovation happens on the factory floor. Pretty soon the countries we outsourced to will come up with better designs too. The U.S. thesis for outsourcing is that the manufacturing countries are filled with braindead automatons who can't compete with our "Designed in California". That may have been true in the 20th century when the U.S. brain drained all the top talent, but now other nations are in a position to pay their engineers more than U.S. companies. The "Designed in California" cope can only last so long.
  • @Rainmaker1973: NASA only uses 15 digits of π for calculating interplanetary travel. At 40 digits, you could calculate the circumference of a circle the size of the visible universe with an accuracy that'd fall off by less than the diameter of a single hydrogen atom
  • @_ericelliott: 2 months into TDD: Tests are hard to write & brittle. 2 years in: Tests taught me better code patterns, reduce bugs 40%-80% and eliminate fear of change. 10 years in: TDD changed my life.
  • Rich Miller: Consultant and futurist Chetan Sharma projects the edge economy will reach $4.1 trillion by 2030. The State of the Edge 2020 report from the Linux Foundation projects that edge investment will accelerate after 2024, with the deployed global power footprint of edge IT and data center facilities forecast to reach 102,000 megawatts by 2028, with annual capital expenditures of $146 billion.
  • Slack: To address these limitations and to easily enable querying raw trace data, we model our traces at Slack as Causal Graphs, which is a Directed Acyclic graph of a new data structure called a SpanEvent.
  • gcommer: tl;dr of confidential computing: In normal cloud computing you are effectively trusting the cloud provider not to look at or modify your code and data. Confidential computing uses built in CPU features to prevent anyone from seeing what is going on in (a few cores of) the CPU (and in EPYC's case, encrypt all RAM accesses). Very roughly: These CPU mechanisms include the ability to provide a digital signature of the current state of the CPU and memory, signed by private keys baked into the CPU by the manufacturer. The CPU only emits this signature when in the special "secure mode", so if you receive the signature and validate it you know the exact state of the machine being run by the CPU in secure mode. You can, for example: start a minimal bootloader, remotely validate it is running securely, and only then send it a key over the network to decrypt your proprietary code. Effectively, it increases your trust in the cloud from P(cloud provider is screwing me over) to P((cloud provider AND CPU manufacturer are both working together to screw me over) ∪ (cloud provider has found and is exploiting a vulnerability in the CPU)).
  • rkangel: We mostly use C rather than C++, but the same two big reasons get in the way of using Rust for everything:* Compiler support * Availability of suitable engineers
  • Mattman: Owner insisted on 0 downtime. Moved the server 700 feet on a cart with 2 UPSs and a chain of (3)gigabit switches. Should have been a 5 minute job if done correctly. Owner ended up paying for over 10 hours of work.
  • @Carnage4Life: Then iOS happened. It turns out what actually wins is user experience not openness. We all got it wrong.
  • @ben11kehoe: Today, people's code is local, and so they think "how can I move the cloud down here to test with my local code?" I believe the question should be "how can my local dev environment be better and more quickly manifested in the cloud?"
  • Mikael Ronstrom: Just a fun image from running a benchmark in the Oracle Cloud. The image above shows 6 hours of benchmark run in a data node on a Bare Metal Server. First creating the disk data tablespaces, next loading the data and finally running the benchmark. During loading the network was loaded to 1.8 GByte per second, the disks was writing 4 Gbyte per second. During the benchmark run the disks was writing 5 GByte per second in addition to reading 1.5 Gbyte per second. All this while CPUs were never loaded to more than 20 percent.
  • donatj: We actually restructured our entire product to win the sale of a very large customer who’s users didn’t fit perfectly into our metaphor. It was unwieldy and we basically rolled the whole thing back several years later
  • @etherealmind: Chuck Robbins, Cisco CEO:  "I think this pandemic is basically just – it's just giving us the air cover to accelerate the transition of R&D expense into cloud security, cloud collab, away from the on-prem aspects of the portfolio." Router huggers won't be happy.
  • @jvthing: Our entire platform at @thingco is built on one DynamoDB table and lots of Go lambda functions! Perfect combo for speed, stability and cost
  • Business Insider: She earns money by placing ads on her YouTube channel, and promoting products on her Instagram page (176,000 followers) and her podcast "Thick & Thin." On average, Bellotte earns between $2,400 and $5,000 for a sponsored Instagram post, she said. For an Instagram Story slide, she asks for $500 per frame.Oct 15, 2019
  • @nathankpeck: My DynamoDB usage last quarter: - 1 TB of data, >10 billion rows - 1 billion on demand read units per month - 200 million on demand write units per month - Consistent 2.5ms query latency the entire time - Cost is about $6k per month. Having such a worry free DB is priceless
  • @swardley: ... the best way to think of China Gov is the world's largest venture capital firm but a really good one with high levels of situational awareness and gameplay. It's tough for US policy makers to cope with because it doesn't fit into a more US view of economics.
  • @emollick: Over 2 billion years ago, 17 nuclear reactors started up in Gabon, Africa. They ran for almost a million years, off & on, producing 100 kilowatts of power at a time. They were entirely natural, because uranium was common.
  • @mipsytipsy: Metrics scale up linearly in terms of write amplification, storage and cost. Double the metrics you capture and store, and you've doubled your bill. This sucks because the only way you can really ask new questions using metrics is by defining new custom metrics upfront.
  • @sheeshee: today we released our new #kubernetes clusters based on #rancher on bare metal. \o/ :) sdn is #calico, new ci/cd/workflows integrated the #argo family. all metrics goes to our new tsdb #victoriametrics. storage from our new #ceph cluster. 15/10 would do own infra again. :)
  • @addisonsnell: A one-slide summary of @Google #TPU v3 over v2: 4x nodes; 2x matrix-multiply;+30% freq; +30% memory bw; 2x HBM memory capacity; +30% interconnect bw #HotChips2020 #AI
  • @chrismcatackney: 8 year old just experienced his first serious production issue. He created a block in Minecraft that started spawning dragons in an infinite loop. He ran in crying that all the megabytes were getting used up and the computer would go on fire. Welcome to software, son.
  • @jpietsch: At Amazon we built very large datacenter networks with OSPF almost decade ago. After leaving last year, I've learned from the industry that you can't do that with OSPF. hmmm.
  • @phaeria: Even Amazon has outages, this morning AWS EC2 in London region was down. One of our clients was affected, the system was built to be scalable so we redeploy &  went live in 30 minutes. Instead of 2 hours of potential downtime.
  • @sbyrnes: An interesting observation about Google is that it succeeded in eating a lot of the Web value chain. In comparison, Facebook has slowly devoured the social media ecosystem until it was the only one left. It hasn't succeeded in branching out to other parts of the value chain. As a result, Facebook finds itself very vulnerable to regulation and likely cannot simply acquire their competition as they have in the past. For now they scale based on how much their social networks scale, but eventually are going to be vulnerable to new competition (TikTok).
  • @mathiasverraes: "Monolith" is the word you use when you want to blame the brokenness of your system on its size, instead of on 15 years of bad development practices.
  • @RosaCtrl: ”So why does NTP's support hinge so much on the shaky finances of one 59-year-old developer?” Because open source lacks political motivation besides free labour
  • Ryan Warrender: Relapse is real. The majority of the companies we worked with saw significant improvements in site speed and/or user engagement. However, 30–60 days post consultation (when we were no longing looking over their shoulder) we would see bad habits resurface. To avoid this pitfall, use a performance budget.
  • readingnews: As a systems admin of some 30 years, I have read this over and over, and even seen it firsthand. I think the real root cause of (at least in my case, the ones I have seen) of this problem is management. Leaders do not want to spend now. Leaders do not want to make a progressive plan to keep up with technology and move ahead. It costs money, and is very difficult to talk to upper-upper management about. Telling the boss to tell his boss to spend money on something "that works now" is very difficult. Few want to invest in the future when it costs now and is working now (unless its the stock market).
  • Andreas Zwinkau: Software that failed based on the phase of the Moon at CERN: “A few desperate engineers discovered the truth; the error turned out to be the result of a tiny change in the geometry of the 27km circumference ring, physically caused by the deformation of the Earth by the passage of the Moon!”
  • @BrianRoemmele: Amazon said it had received more than 3,000 requests for smart speaker user data from police earlier this year. Amazon complied with the police's requests on more than 2,000 occasions. This number marks a 72% increase from the same period in 2016—up 24% year over year.
  • rubiquity: Putting hubris aside, I think it's great that a decade or so into large scale computing we're starting to see patterns emerge for scaling stateful systems and be able to build good generic solutions to them. This is sorely needed especially on the control plane side which historically hasn't gotten the attention that data planes have.
  • @SteveSmith_81: In the past 3 years I've worked with 3 different clients using #kubernetes: Kops + AWS and GKE. In all 3 cases k8s has successfully managed workloads at scale, yet I wouldn't recommend it to any future clients. The upfront and ongoing investment is astonishing
  • @mweagle: “Looking back from 2020, Go has succeeded in both ways: it is widely used both inside and outside Google, and its approaches to network concurrency and software engineering have had a noticeable effect on other languages and their tools.”
  • msadowski: In the two years working as a Robotics Consultant, I’ve noticed a pattern: the more a potential client bargains before starting the job, the more bargaining and complaining will follow, resulting in an unpleasant experience for everyone involved.
  • Jana Lyengar: So where does this leave us in terms of final QUIC and HTTP/3 deployment in the world? I’ll venture to make a few predictions; note that standard disclaimers apply about any such forecasting. Looking at the landscape, I expect that we will see rapidly increased rollouts of QUIC and HTTP/3 by clients this year, as well as higher volume testing on pre-release channels first, followed eventually by clients turning QUIC and HTTP/3 on in their stable releases. Going a step further, I believe that QUIC and HTTP/3 will become the de-facto mainstream web protocol stack in 2021.
  • Zdenek Prikryl: China is really active right now. In fact, it is the most active territory at the moment. With RISC-V, we see a lot of traction at the universities and in the companies. Pretty much every company has some kind of RISC-V strategy. Either they have adopted RISC-V already, or plan to do that quite soon. The next one from a geographical point of view is North America. The U.S. is quite active. You can see startups working with RISC-V in AI domains because you need to have some kind of customization, and RISC-V is very well positioned for that.
  • Science Daily: "It's not the first step of attributing a mind to an android but the next step of 'dehumanizing' it by subtracting the idea of it having a mind that leads to the uncanny valley. Instead of just a one-shot process, it's a dynamic one."
  • @DrQz: For reasons known only to those who use it, the term "latency" has become regurgitated ad nauseam. In performance analysis it is a useless generic word. Successful analysis requires that it immediately be decomposed into service time, waiting time, etc.
  • @migueldeicaza: I bet Fortnite could work in Safari without going through the AppStore. Like Confucius famously said in 500 BC: “When there is a billion dollar budget there is a way to compile the code to WebAssembly” Socrates famously retorted “the bleeding can end when the man drops his knife and stops stabbing thyself” To this day, philosophers debate whether “the man” in the quote  refers to an adult or a precocious toddler.  “The Republic” left enough room for interpretation.
  • @vgill: But the perception still sticks. I remember a few years ago a super smart VC claimed in front of 30+ people that AWS had reduced pricing 40+ times while HP and Dell had not. The entire crowd was nodding along till I mentioned that if you looked at the IOPs, storage, and CPU AWS was much further behind on price cutting you would get based on memory and Moore's observation. Awkward. An entire set of ostensibly smart decision makers hand-picked to fly to Hawaii and thought leader could not draw the lines on something this simple.
  • @jordannovet: Snowflake, which offers cloud-based data warehousing software, is worth about $70 billion. Teradata, which sells traditional data warehousing hardware and software, is worth $2.5 billion
  • @jks: I remember a database professor telling us how he had the opportunity to ask an airline systems programmer how they do distributed locking for seat allocation. "Sometimes two people get the same seat, then we might have to bump or upgrade one."
  • @emilyst: "your API is writing checks your program can't cache"
  • Jeffrey Burt: Alibaba in July introduced its first RISC-V-based product, the XT910 (the XT stands for Xuantie, which is a heavy sword made using dark iron), a 16-core design that runs between 2.0 GHz and 2.5 GHz etched in 12 nanometer processes and that includes 16-bit instructions. Alibaba claims the XT910 is the most powerful RISC-V processor to date. The company spoke more about the processor at this week’s virtual Hot Chips 2020 conference, giving an overview of the processor, an idea of how it stacks up to Arm’s Cortex-A73
  • Pat George: At some point we realized one’s ability to solve the Boggle challenge didn’t correspond to the person’s success here.  After thinking about why there seemed to be no correlation we determined that of all the things the algorithm-type questions can tell you about a candidate, we cared about slightly different things.  Additionally we decided that if none of us liked doing them during our own interviews, why would we subject our future colleagues to them?
  • Qnovo: The combination of new cathode materials with silicon-graphite composite anodes promise to deliver energy densities around 900 ~ 1,000 Wh/l. Yet, the vast majority of lithium ion batteries continue to ship today with graphite anodes highlighting the difficulties and the long durations needed for bringing new materials to market.
  • Neflix: Logs, metrics, and traces are the three pillars of observability. Metrics communicate what’s happening on a macro scale, traces illustrate the ecosystem of an isolated request, and the logs provide a detail-rich snapshot into what happened within a service.
  • Ed Sperling: For high-performance applications, chips are being designed based upon much more limited data movement and near-memory computing. This can be seen in floor plans where I/Os are on the perimeter of the chip rather than in the center, an approach that will increase performance by reducing the distance that data needs to travel, and consequently lower the overall power consumption....Scaling of digital logic will continue beyond 3nm using high-NA EUV, a variety of gate-all-around FETs (CFETs, nanosheet/nanowire FETs), and carbon nanotube devices...Designs are becoming both more modular and more heterogeneous, setting the stage for more customization and faster time to market. All of the major foundries and OSATs are now endorsing a chiplet strategy
  • Sabine Hossenfelder: This path-dependence is also why magnets can be used to store information. Path-dependence basically means that the system has a memory.
  • Memory Guy: In fact, if you estimate that Intel’s NSG group’s NAND profit was equal to the average of its competitors, then you can calculate XPoint losses of about $2 billion for 2017, another $2 billion for 2018, and $1.5 billion for 2019!
  • @MaxEpstein5: My least favorite part of programming is that triumphant feeling during debugging of "FINALLY found the bug" followed by the soul-crushing "oh wait that's not THE bug, that's just a(nother) bug"
  • Mark Callaghan: The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.
  • @dotpem: We're experimenting with the new Gravitron instance types at @honeycombio and they're saving us an ARM and a leg.
  • @tmclaughbos: How long has this idea of treating your entire cloud infrastructure as a monolith been around? I'm shocked at the number of times I've talked to people that use a cloudformation stack with many many nested stacks to manage their entire AWS infrastructure and applications in it.
  • @davidcrawshaw: Use of Reed-Solomon error correcting codes in COVID-19 pooled testing. Each sample is assigned to more than one pool, and if positivity is low enough the result is 10x throughput.
  • jeffffff: yeah i've learned the hard way not to give a customer an sla without a rate limit built into it
  • Bruce Dawson: If everyone on a project spends all of their time heads-down working on the features and known bugs then there are probably some easy bugs hiding in plain sight. Take some time to look through the logs, clean up compiler warnings (although, really, if you have compiler warnings you need to rethink your life choices), and spend a few minutes running a profiler. Extra points if you add custom logging, enable some new warnings, or use a profiler that nobody else does.
  • Frank Schirrmeister: Where is all of this going? Networks will have to become faster, storage latencies will have to go down and storage volumes will have to go up. Compute-domain specificity will increase even more. The Next Platform’s Timothy Prickett Morgan’s discussion with NVIDIA’s Jensen Huang nicely illustrates how the transformation of the data center goes well beyond just changes in the semiconductor industry design chain and the dynamics in the processor ecosystems and Tier 1 hyperscale companies who are doing their own chip design. The data center will become fully programmable. Design for power efficiency, thermal optimization and integrating multiple chiplets using 3D-IC technologies will be key enabling technologies, worthy of their own future blog.
  • trevor-e: There was a three-week gap since the last Xcode update and Apple is notorious for breaking stuff between Betas. We were given a single day to get everything set up (CI/CD, signing certificates, provisioning profiles, etc), update our codebase (yes there were source changes made in the GM update), and test that everything works (three weeks worth of Xcode changes), so no it's not really how you picture it to be. Having iOS14 features ready to be in sync with the iOS14 launch is pretty crucial for apps.
  • Atlantic Council Report: Software supply chain security remains an under-appreciated domain of national security policymaking. Working to improve the security of software supporting private sector enterprise as well as sensitive Defense and Intelligence organizations requires more coherent policy response together industry and open source communities. This report profiles 115 attacks and disclosures against the software supply chain from the past decade to highlight the need for action and presents recommendations to both raise the cost of these attacks and limit their harm.
  • @mountain_ghosts: I love how bank transfers are the canonical DB transaction example when banks take days to execute them and allow dirty reads the entire time.
  • @shreyas: Don’t hide behind the data. Don’t wait for it to tell you what to do. You can actually validate promising ideas to death. At some point you have to go with it. Put an MVE or MVP out there. Risk something, but not much. Then measure and take the next decision.
  • jwr: Back when I worked at a supercomputing center, we had "operators" on duty, who were supposed to visit the machine room every 2-3h or so and check several things. It turned out that they were the major cause of hangs and reboots of our SunSITE server (a large FTP archive) — walking on the lifted datacenter floor caused vibrations which were enough to disturb the (terrible) external SCSI connectors to multiple drive arrays.
  • pier25: If it was closer to 5%, or even a fixed fee (eg: $1 per app sold), I would accept the narrative that it's a fee. 30% of your business is not a fee, it's more like a partnership. Apple is, in practice, a business partner to each and every iOS developer. Except that they have total and absolute control of the business. If they shut you down on the App Store, you're done. Your iOS app is worthless on any other platform. I shit you not, I've personally had apps rejected by the review board because they didn't like the screenshots. Those were screenshots of the app itself.
  • teh_klev: I speak as someone who's worked for 30+ years on data modelling. Every time I encounter some mongo or other non-relational DB where the company jewels (the data) are stored with no documentation, no data model etc and stuff is just shoved into these stores willy nilly it makes me weep. Start off with relational, if perf is a problem then look at denormalising, after that then consider other alternatives for special cases. But to see run-of-the-mill apps with no near future scalability issues jumping right into mongo et al from day one makes me want to run away.
  • mojomark: You're correct, as a marine engineer, I can tell you that the biofouling problem is far from solved. To date, copper biocide works best, but is terrible for the environment. A lot of the new coatings are also 'speed-release', meaning the hull must travel a certain speed in the water before the biofouling simply falls off. Obviously for a hull that site stagnant in the water, like Natick, this type of coating won't help. Many people have tried to solve the issue passivly, and even tried mimicking shark skin (tiny chevron-shaped scales) at the microscale to minimized biological growth's ability to "latch" to the material surface. However, I haven't seen any good commercially viable progress in practice.
  • distantskeptic: Open source CPU design is fundamentally different than open source software design. In the latter costs are extremely low - just the cost of a computer per developer. That developer's computer need not be replaced for years. There is no significant incremental cost for a software bug - just recompile and in a few minutes you're off to the races with a new executable which can be distributed over the internet for next to nothing. Contrast that with CPU design - every time a hardware bug is found you'd have to fix the design, verify it in software simulations, fabricate a new wafer, package it, install it in test hardware, and then perform hardware verification. This is 5 to 6 orders of magnitude slower and more expensive than software. Sure, corporations can perform this open CPU design, verification and manufacturing function. But in the end for a CPU to have a certain level of speed and reliability, you'd have to spend at least the same amount of money as the commercial CPU makers. Companies that produce an open source CPU chip are incurring huge monetary risks - and would have to be compensated for this risk if their chips have bugs and cannot be sold. The only way for an open source chip design to be remotely competitive would be if they were to embrace FPGA technology. But FPGAs run 4 times slower than purpose built ASICs and are at least 10 times more expensive per unit in volume.
  • snuxoll: BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view. If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available. This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them. Convergence time is a known bugbear of BGP.
  • Tweedy et al.: Migration of cells through tissues and embryos is often steered by gradients of attractive chemicals in a process called chemotaxis. Cells are best at navigating complex routes, for which they use “self-generated chemotaxis” and create their own attractant gradients. An example of this is when neutrophils migrate into tissues to attack infection. Using modeling and live-cell data, Tweedy et al. found that self-generated chemotaxis allows cells to obtain surprising amounts of information about their environment. Cells of the slime mold Dictyostelium discoideum and mouse pancreatic cancer–derived cells were able to use the diffusion of attractants to identify the best route through complex mazes, even when the correct path was long and twisted, without ever entering incorrect paths.
  • Justin Pietsch:  learned how to think about scaling networks and how critical it is to make things simpler. How to think about tradeoffs around magic abstractions and understandability. Infrastructure that is magic is often too good to be true, at least when you are scaling and growing very quickly. It requires deep introspection, understanding of what happens under failure, and some great monitoring...Running out of capacity on a shared resource is about the worst sin you can perform in a network. And we [Amazon] ran out of capacity a lot.
  • @__steele: Lambda continues to impress me. I was about to throw away some broken data, but then I decided I against it. I spent 40 minutes writing some code, uploaded it, pointed it at a bucket and hit go.  It downloaded, processed and reuploaded ~600GB in under two minutes. 4,000 files.
  • jrockway: I think relational databases have largely failed developers because they don't provide the features they actually need. A common question that comes up is how to do zero-downtime schema changes. The answer is that there isn't one. A correct implementation would store each schema version in the database and when an application connects, it would specify which version it's speaking. The developer would supply a mapping on how to make vX data available to a vY program. But no relational database supports such a feature, so people are forced to tread carefully -- look at all the deployment software that exists to attempt to find changes with database migrations and treat them differently. Look at all the software people have written to even apply those migrations. It's staggering, all because in the 70s when these systems were designed, the thought of deploying your code multiple times a day was unheard of. Another problem that comes up is transactional isolation. Most engineers, and even casual practitioners, "know" that transactions exist for cases where you want to perform multiple operations atomically. But very few of these people are running the transaction with an isolation level that provides those guarantees.
  • Kevin Mitchell: In short, brain is not like muscle. Bits of brain don’t just grow with experience – they mainly change by reorganising their internal connectivity. This is just as well because if the brain did continue to grow with use, all of our brains would be busting out of our skulls. I don’t mean to be too sarcastic (just the right amount), but I’ve been going around seeing things like crazy – really intensely using my visual system for many years now – without causing massive growth of my visual cortex.

Useful Stuff:

  • Why is it always snakes? Facebook on Optimal Workload Placement
    • A large snake once caused a short in one of Facebook's main switchboards. All the servers connected to the device went down, as did the snake. The loss of the switchboard caused cascading failures for some major services. This caused all user traffic in that datacenter to switch to other datacenters. 
    • Servers powered by one main switchboard are grouped into a fault domain. In a region there are many fault domains. The server capacity of the downed fault domain was less than 3% of the server capacity in the datacenter. Why then did it cause such a big disruption? Unfortunate workload placement. This one fault domain contained a large proportion of the capacity of some major services. Some specific services lost over 50% of their capacity, which caused cascading failures, which caused user traffic to be drained from the entire datacenter.
    • There can be many other causes for fault domain level failures: fire, lightning, water leaks, routine maintenance. The number incidences will increase 9x as the number and size of regions increase. 
    • Goal is to be able to lose a fault domain without losing the whole datacenter. Key is the placement of hardware, services, and data. You can't seamlessly lose a fault domain if over 50% of the capacity for a service is in the fault domain. It's not an isolated issue. Services are poorly spread across fault domains because fault domains were not taken into consideration when placing services. Hardware was placed wherever space and power were available and services were placed on whatever services were available at the time. 
    • Services need to be spread better across fault domains. This is where the push for optimal placement comes in. Optimal placement means hardware, services, and data are well spread across the fault domain within a datacenter so if one fault domain goes down that they lose as small a proportion of capacity for each hardware type and service as possible.
    • Capacity lost still means service problems, so they install buffer capacity. Buffer capacity elsewhere in the region can handle the failed over traffic. By evenly spreading capacity they can significantly reduce the amount of buffer needed. 

  • RustConf 2020 videos are now available.
    • NoraCodes: Right now is a very strange time for Rust. The language’s core set of features is done, and the team is moving on to higher-level concerns and polish points like implementing const generics (items that are generic over values rather than types) and improving support for async/await. Tons of interesting software is being written (or re-written, if you believe the memes) in Rust, from a new generation of command-line tools to highly correct game engines and web software to (ugh) blockchain. At the same time, the recent Mozilla firings shook the community in a real way. Watching team members at the conference who recorded their talks or segments before being laid off was really jarring and quite sad. Rust is a beautiful thing in the sense that it opens up the domain of systems programming in a real way, and brings a lot of the power of algebraic types and high-performance personal computing to the mainstream, but it’s also a very fragile community because it’s still so small. I hope that Rust moves forward into an era of stability, in terms of finances, contributions, features, and ecosystem. I think the existing community is doing a great job of that, and that this Rust Conf is a good example of its best properties. At the same time, the world is changing and nobody quite knows what comes next.

  • AnandTech has their typically amazing wall-to-wall coverage of Hot Chips 2020. You might like Hot Chips 2020 Live Blog: Silicon Photonics for AI

  • 2020 Eckert-Mauchly Award Lecture - Luiz André Barroso. Three lessons:
    • Don't pigeonhole yourself. Don't let ignorance of a topic stop you. Risky, but rewards can be great. Attempt career suicide every once in a while.
    • Learn to respect the obvious.The problems worth solving are obvious. Really easy to explain to anyone. If you can't explain the problem you are solving in a short sentence it may not be important enough. Don't let go of that problem. Keep working it.
    • Detach yourself from your past. Things that you've done before that arguably have been the foundation of your success.

  • This is disappointing. Magic did not happen. It requires a completely new database stack. We [MongoDB] Replaced an SSD with Storage Class Memory. Here is What We Learned.
    • One question we wanted to answer is whether a storage device sitting on a memory bus can deliver better throughput than an equivalent storage device sitting on a PCI-e.
    • our experiments revealed that a modest latency advantage that SCM provides over a technology-equivalent SSD does not translate into performance advantages for realistic workloads. The storage engine effectively masks these latency differences. SCM will shine when it is used for latency-sensitive operations that cannot be hidden with batching and caching, such as logging. 
    • georgewfraser: Andy Pavlo talks about this in his class at CMU. You shouldn’t expect to get better performance by running a disk-optimized storage engine on memory, because you’re still paying all the overhead of locks and pages to work around the latency of disk, even though that latency no longer exists. Instead, you have to build a new, simpler storage engine that skips all the bookkeeping of a disk-oriented storage engine.

  • @MarkSailes3: Before I worked for @awscloud I was a customer building a near real time system in Lambda, SQS, SNS, DynamoDB. There were a few moments using serverless which really woke me up to the step change in technology.
    • Security. When the Intel Spectre vulnerability hit, our team didn't have to do anything. Lambda had already been patched.
    • Agility. Wrong architectural decisions are hugely less problematic in AWS. We changed our system from using Kinesis to SQS in an afternoon.
    • Cost. For the first time we knew the total cost of our system. Which meant we could work out the cost per business function.
    • Scaling. Our system had periods of absolutely no traffic. During those periods we didn't incur any costs.
    • Ownership. DevOps is hard. AWS reduces the sum of all the different things you need to learn.

  • Facebook's Systems @Scale 2020 videos are now available. Always interesting stuff. You might like: Optimal Workload Placement; Containerizing Zookeeper: Powering container orchestration from within; Asynchronous computing @Facebook; Scaling Facebook’s Data Center Infrastructure.

  • Three Basecamp outages. One week. What happened? Redundancy usually isn't. These are not Black Swan events at all. Failovers need to be tested constantly.
    • Due to a surprise interdependency between our network providers, we lost the redundant link as well, resulting in a brief disconnect between our datacenters. This led to a failure in our cross-datacenter Redis replication when we exceeded the maximum replication buffer size, triggering a catastrophic replication resync loop that overloaded the primary Redis server, causing very slow responses. 
    • Our network links went offline, taking down Basecamp 3 Campfire chats and Pings again. While recovering from this, one of our load balancers (a hardware device that directs Internet traffic to Basecamp servers) crashed. A standby load balancer picked up operations immediately, but that triggered a third issue: our network routers failed to automatically synchronize with the new load balancer. 
    • Earlier in the morning, the primary load balancer in our Virginia datacenter crashed again. Failover to its secondary load balancer proceeded as expected. Later that morning, the secondary load balancer also crashed and failed back to the former primary. This led to the same desynchronization issue from yesterday,
    • Basecamp 3 and some supporting services (login, billing, and the like) are not in the cloud, however. They’re hosted exclusively in our on-premises datacenters.

  • Julia is production ready. Videos from JuliaCon 2020 are now available.

  • Scaling is specialization. Cockroach Labs is moving away from RocksDB and to their own low level key-value store called Pebble. Eventually companies always take over the parts of the system that are their key differentiators.
    • Pebble Wins: it's Go native so you save all the C++ compatible layer cruft; Faster reverse iteration via backwards links in the memtable's skiplist; Faster commit pipeline that achieves better concurrency; Seamless merged iteration of indexed batches; Smaller, more approachable code base; forward compatibility with RocksDB 6.2.1; code base 45k+ lines of code; support efficient range deletion; passes  metamorphic tests suite; meets or exceeds RocksDB on the 6 standard YCSB workloads.
    • RocksDB Losses: code base has sprawled over time, growing from LevelDB’s original 30k lines of code to a current state of 350k+ lines of code; serious bugs, While the absolute number of bugs we’ve encountered in RocksDB is modest, their severity is often high; CockroachDB is primarily a Go code base; significant performance problems; features have deficiencies.

  • Impressive, but this is why a managed service is easier for most users. Just upgrading can be a major undertaking. How we upgraded PostgreSQL at GitLab.com: We needed to have a rollback plan to optimize our capacity right after Recovery Time Objective (RTO) while maintaining a 12-node cluster’s 6TB-data consistent serving 300.000 aggregated transactions per second from around six million users.

  • Attack of the week: Voice calls in LTE. You know all those giant spec humans create? They are full of bugs. It turns out if you let an AI churn on products of the human mind, they will own you in a heart-beat. Humans have to get out of the spec writing business. It's time to let machines do what they do.

  • Evernote went for the big rewrite. Having everyone work together instead of on separate teams is the right idea. At the end they mention splitting up again, so they can release on different tracks. That's the wrong idea. They'll just get in the same mess again. Cross-functional teams work for Apple. Evernote’s CEO on the company’s long, tricky journey to fix itself
    • When Ian Small joined Evernote, he decided the only way forward was to rebuild everything almost from scratch. Eighteen months later, a new Evernote is here.
    • Still has 250 million users.
    • Evernote had five different apps run by five different teams for five different platforms, and each had its own set of features, design touches and technical issues. Internally, Evernote employees called the app's codebase "the monolith," and that monolith had grown so big and complex, it was preventing the company from shipping cross-platform features or doing much of anything in a short time. While other apps were becoming faster, more useful and more powerful, Evernote was slowly becoming a crufty, complex relic of a once-great app.
    • They would stop building new features and new products for as long as it took to fix the core of Evernote from the ground up in a way that would work better going forward. The note editor, the apps, the search, the cloud infrastructure, everything had to change. 
    • Evernote, in this new launch, will have none of those things. Small said he's been fighting the urge to bake loads of shiny new features into the first iteration of the new Evernote, and he's helping his team do the same.
    • "You know, if you go back eight years or something, the Evernote architecture made all the sense in the world!" Going forward, the team needs to continually invest in solving its tech debt problems, rather than papering over them to build something new.
    • The first step, Small said, was to get everyone working toward a single goal, rather than each split off on their own product. He reorganized the company to focus teams on features and functions, rather than platforms, in an attempt to keep everything thinking about Evernote as a whole rather than a single platform

  • Malware writers love Golang too. FRITZFROG: A NEW GENERATION OF PEER-TO-PEER BOTNETS

  • If you're into optimal server performance this is the story for you. VALORANT'S 128-TICK SERVERS. A fun and detailed read. 
    • Riot Games wanted to offer a free a game to users, but at the current performance level it would be too costly. So their goal was to radically improve server performance so they could run more games on a server, which would make the game cost effective enough to run. And they succeeded: a server frame took 50ms, and by the end we reached sub-2ms per frame - all by looking at code optimization, hardware tweaks, and OS tunings. 
    • The most important strategy was measuring so they could see where time was being spent. They also learned it's vitally important to measure performance in a configuration that matches your production environment. That allowed them to drill down to micro optimizations. 
    • We found in many cases changing from a replicated variable to an RPC offered a 100x to 10000x performance improvement!
    • players are just in the idle pose during the buy phase. This helped reduce costs of the animation system over the course of a round by another 33%!
    • Moving to the more modern Xeon Scalable processors showed major performance gains for our server application. We still see the effects of L3 contention but we saw roughly a 30% increase in performance, even using similar clock speeds.
    • We turned our memory access from about 50% NUMA local to 97-99% NUMA local! In addition to the 5%, performance was much more consistent between server instances.
    • During our time monitoring the game server host, we saw an interesting pattern where cores would hover at around 90-96% usage but never reach 100%. Through our investigations, we learned that modern Linux uses the Completely Fair Scheduler (CFS). By lowering the migration cost setting to 0, we guarantee that the scheduler immediately migrates a game server that needs to run to any available core on the system. Doing this lets us make much better use of CPU resources on the system and granted another 4% performance boost.
    • By limiting our process to the higher C-States (C0, C1 and C1E), we were able to host another 1-3% games stably. It particularly stabilized performance of 60-90% loaded servers where the reduced workload was allowing many cores to frequently idle.
    • When we flipped hyperthreading back on, we saw performance increase by 25%. 
    • We changed to the tsc clocksource which is provided by the CPU instruction rdtsc. For our game servers, we were able to get about a 1-3% boost in performance by moving our clocksource.
    • Erlang has a configurable for how many threads its scheduler is allowed to spawn. The default is one thread per core on the system, hence the 72 processes. Once we set this to a more reasonable default like 4, the problem disappeared overnight. 

  • Costs of running a Python webapp for 55k monthly users. $172/month. Servers and database on DigitalOcean, AWS, GoogleCloud, DNS, Disqus. Good HN discussion of why this costs too much money, but the developer is fine with it because they just want it done.

  • Support may be one of those things where you get what you ask for and if you don't ask they won't tell. 
    • Here's my default attitude towards support. postmodernchicken: My company [support] would cost us 200k a month for the cheapest support option. We basically only need support when it looks like there’s an AWS issue. So we don’t have a support contract.
    • Here's an example that's an experience I've never had anywhere. abakedcarrot: we had a big ai/ml project and they've spent many hours giving us demos and meetings and helping us plan out the infrastructure and pipelines for it; we get a lot of pre release info under nda; if we want random full day training sessions on something they will come up with a curriculum and free labs for us.; they regularly go through our infra and point out places we can save cost.; they are on top of our savings plan/RI renewals; if we need help building anything I can mention it in one of our weekly or biweekly meetings and our TAM will find a person to assist us in planning etc; our TAM will follow up on our tickets and escalate things for us if necessary.

  • Some architecture examples. 
    • How We Built a Serverless E-Commerce Website on AWS to Combat COVID-19Handling webhooks with EventBridge, SAM and SARAWS Serverless Ecommerce Platform. Not much to talk about. A lot of connecting things to other things.
    • Dream11: Scaling a Fantasy Sports Platform with 5M Daily Active Users
      • Dream11 uses Amazon Aurora with Amazon ElastiCache to serve 1 million concurrent users within 50ms response time, serving at an average 3 million requests per minute (rpm), which can surge to 3X in a 30-second time span.
      • Broke down a monolith into microservices. Traffic comes in through cloud front, WAF is used for some security, and enters into an intelligent orchestration service they call Coordinator. Coordinator routes traffic to other microservices or to Contest. Aurora is used for Contest because it can handle 1 million transactions per second and 2 million read requests per minute. Unlike for MySQL read replica lag is less than 20 msecs so they can direct write traffic to a master and the read traffic back to the replica. 
      • To get around locking they use Redis to hold tokens that mediate access to the database. 

  • Uber charts the evolution of their API Gateway. Designing Edge Gateway, Uber’s API Lifecycle Management Platform
    • Uber started as a not so mild mannered taxi company and has evolved into a sprawling service company attacking numerous domains at once. When that happens you have to change a few things, especially when handling 800,000 peak req/s, multiple languages, multipl message formats, 100s of teams developing in parallel, 50,000 tests, 40+ services, 1,500 engineers, 2,500 npm libraries,  performance regressions in a few APIs, and ever increasing number of business lines.
    • The third generation of their API gateway followed a tiered approach that allowed for a separation of concerns: edge layer; presentation layer; product layer; domain layer. 
    • We built an in-house system called Control Flow Framework (CFF) in Golang that allows engineers to develop complex stateless workflows for business logic orchestration in service handlers.  
    • Moving to a golang based system has improved our resource utilization and request/core metrics significantly. The latency numbers on most of our APIs have decreased significantly. 
    • Lessons: stick with a single protocol for your mobile applications and internal services. Adopting multiple protocols and serialization formats ultimately results in a huge overhead; design your gateway system to scale horizontally—a single binary would have been too large to run 1600 complex APIs; It is critical to have continuous investment as engineers transition in and out of the project; Making conscious choices on dropping support is critical for long term sustainability.

  • I'm not exactly sure what this was. Chaos Community Broadcast Episode 4: The one with John Allspaw. Humans are the adaptive part of the system because they can handle unforeseen problems. 

  • Estimating systems with napkin math with Simon Eskildsen @ Shopify.
    • The Shopify journey: on-prem to the cloud; sharding; moving shops between shards; running out of multiple regions; running out of multiple continents; running shops out of multiple regions; rewrite of store front to go from a monolith to the modular monolith. 
    • On Black Friday, Thursday is the last day changes can go in. There's a plan about what risks you can take when. For example, you can't upgrade the MySQL version the week of Black Friday. They probably won't even do that in November given the problems that could occur. It's a good internal deadline to make sure things get in. On Black Friday you can respond to things. You monitor, but not change things. They watch dashboards all day. They try not to do anything hard all day so they have the energy to respond to any problems. They need maximum energy available. 
    • COVID was an unexpected Black Friday because people shopped online. It's not clear what the means for Black Friday this year? Will it be smaller because the steady state has gone up? Or will it be larger?
    • If you can't do the napkin math it's probably too early to build the system. This is called Programming through the Wall. It's when you keep writing code when you should step back and think about the system and learn about it.
    • It's worse when a database is slow than when it's down. A slow DB can clog up queues and cause cascading failures.
    • A big success has been load shedding. When a system becomes overloaded you want to start prioritizing the type of traffic you accept. If a store is experience a DDoS attack that traffic should dropped before dropping the traffic for other merchants. They prioritize traffic at the edge so the load balancer can prioritize traffic so merchants can have as much uptime as possible. 
    • They want to provide load shedding at the database level, but databases do not provide this kind of functionality. A lot of companies these days are multi-tenant SaaS companies, but databases do not directly support this model. You should be able to restrict the affects one customer has another customer. By default the way databases are designed does not support multi-tenancy at all. You should by able to prioritize traffic by merchant. 
    • Competitive programming is a good start to a programming career. That was his motivation for understanding napkin math in the first place. You need to implement the correct solution to a problem the first time. So you need to make estimates to judge how well your solution will score.
    • Napkin math is a way of understanding how a system should perform from first principles. Understanding is built from the bottom up. This is useful when designing systems and during tech reviews. When someone says maybe we could do X, instead of taking two weeks to build a prototype, you can run a thought experiment, make some basic assumptions, run the calculations, and see if the solution is at least in the ballpark of working. This can save a lot of time traveling down dead ends.
    • The Napkin Math Newsletter talk about these kind of problems. For example, how many transactions can MySQL fundamentally do every second? Think about from the buttom up. Parsing a SQL query is probably fast. The ACID guarantee requires appending an insert to the end of file. That means the file must be flushed to the file system. A flush/fsync takes about 1 msec so that maximum transactions per second should be around 1000. 
    • But MySQL can really do about 7K tps. This is the First Principles Gap. Your simple bottom up model of how the system works is not correct. What don't you know? It turns out MySQL batches transactions so they are applied during the same fsync. 
    • Also, Multi-tenant ArchitecturesEpisode 45: Multitenant Architecture with Ian Varley

  • Inside TikTok's killer algorithm. Nothing unexpected. Maybe TikTok users just generate better content?

  • Shopify has an article on How Shopify Reduced Storefront Response Times with a Rewrite. Like Evernote it did a rewrite, but they only rewrote part of the system, not the whole thing.
    • Across all shops, average server response times for requests served by the new implementation are 4x to 6x faster than the legacy implementation.
    • The Rails monolith still handles checkout, admin, and API traffic, but storefront traffic is handled by the new implementation.
    • Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith: it has much stricter performance requirements and can accept more complexity implementation-wise to improve performance, whereas other components (such as payment processing) need to favour correctness and readability.
    • Designing the new storefront implementation from the ground up allowed us to think about the guarantees we could provide. The decision to design the new implementation on top of an active-active replication setup. As a result, the new implementation always reads from dedicated read replicas, improving performance and reducing load on the primary writers.
    • By rebuilding and extracting the storefront-related code in a dedicated application, we took the opportunity to think about building the best developer experience possible: great debugging tools, simple onboarding setup, welcoming documentation, and so on.
    • With improving performance as a priority, we work to increase resilience and capacity in high load scenarios (think flash sales: events where a large number of buyers suddenly start shopping on a specific online storefront), and invest in the future of storefront development 
    • Optimizing Data Access Patterns. The new implementation uses optimized, handcrafted SQL multi-select statements maximizing the amount of data transferred in a single round trip. We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.
    • Reducing Memory Allocations. We reduce the number of memory allocations as much as possible so Ruby spends less time in garbage collection. We use methods that apply modifications in place (such as #map!) rather than those that allocate more memory space (like #map).
    • Implementing Efficient Caching Layers. We implemented various layers of caching throughout the application to reduce expensive calls. Frequent database queries are partitioned and cached to optimize for subsequent reads in a key-value store, and in the case of extremely frequent queries, those are cached directly in application memory to reduce I/O latency. Finally, the results of full page renders are cached too, so we can simply serve a full HTTP response directly from cache if possible.
    • To ensure that the new implementation responds well to flash sales, we implemented and tweaked two mechanisms. The first one is an automatic scaling mechanism that adds or remove computing capacity in response to the amount of load on the current swarm of computers that serve traffic. If load increases as a result of an increase in traffic, the autoscaler will detect this increase and start provisioning more compute capacity to handle it. Additionally, we introduced in-memory cache to reduce load on external data stores for storefronts that put a lot of pressure on the platform’s resources.
    • When an external data store isn’t available, we don’t want to serve buyers an error page. If possible, we’ll try to gracefully fall back to a safe way to serve the request. 
    • We implemented circuit breakers on external datastores using Semian, a Shopify-developed Ruby gem that controls access to slow or unresponsive external services, avoiding cascading failures and making the new implementation more resilient to failure.
    • Similarly, if a cache store isn’t available, we’ll quickly consider the timeout as a cache miss, so instead of failing the entire request because the cache store wasn’t available, we’ll simply fetch the data from the canonical data store instead.

  • Episode #65: Serverless Transformation at AWS with Holly Mesrobian. I think Holly has a future in politics. She did recommend having one microservice per account.

  • Even in the cloud spikes can cause a backlog unless the underlying system can handle the load. RDS may surprise you in a bad way. What happened to our infrastructure when a customer got over 10 million page views in a few hours?
    • I woke up at 6 AM to a 1.5 million job backlog in our queue and immediately jumped out of bed.
    • One of our customers, who were no strangers to viral content, had launched a new project that had gone unbelievably viral. 
    • But why was the queue piling up? We’d built Fathom to handle infinite scale, running on serverless infrastructure, auto-scaling as needed. This didn’t make any sense.
    • We’re moving from MySQL to DynamoDB. I don’t like servers. I don’t want to have to worry about IOPS, slow scaling and all that nonsense. The fixed MySQL instance was one of the only things in our infrastructure that isn’t fully serverless, and I don’t like that.
    • Stats are getting moved to a managed Elasticsearch service
    • We’re going to be introducing Pusher. We’re moving onto WebSockets for Version 3. This will mean that instead of queuing up a pageview & reading from the database, we will send data (stripped of any PII, of course) to the pusher and then the WebSocket will read it back. 

  • Do you absolutely need Kubernetes? Prepare for a transformation of the soul. 3 Years of Kubernetes in Production–Here’s What We Learned
    • Kubernetes eventually made our lives easier, but the journey was a hard one, a paradigm shift. There was a complete transformation in not just our skillset and tools, but also our design and thinking. We had to embrace multiple new technologies and invest massively to upscale and upskill our teams and infrastructure.
    • Today, if we have to choose Java, we ensure that it’s version 11 or above. And our Kubernetes memory limits are set to 1GB on top of JVM max heap memory (-Xmx) for headroom. That is, if JVM uses 8GB for heap memory, our Kubernetes resources limits for the app would be 9GB.
    • It’s relatively easier to build and run the cluster, but lifecycle maintenance is a whole new game with multiple moving parts.
    • Be prepared to redesign your entire build and deployment pipelines. Our build process and deployment had to go through a complete transformation for the Kubernetes world. 
    • We learned that exposing services using static external IP takes a huge toll on your kernel’s connection tracking mechanism. It simply breaks down at scale unless planned thoroughly.
    • Kubernetes transformation is not cheap. The price you pay for it must really justify ‘your’ use case and how it leverages the platform. If it does, then Kubernetes can immensely boost your productivity.

  • Things I Learned to Become a Senior Software Engineer. Thoughtful list and good HN discussion.

  • If you're going to shard you probably want it to work like this. Teams should use one shard manager rather than develop their own. But to do that it must be flexible and full featured. Facebook on Scaling services with Shard Manager (video). Convergent evolution is a common pattern in large organizations.
    • to the best of our knowledge, we are the only generic sharding platform in the industry that achieves wide adoption at our scale. Shard Manager manages tens of millions of shards hosted on hundreds of thousands of servers across hundreds of applications in production.
    • Over the years, hundreds of sharded applications have been built or migrated onto Shard Manager, totaling upper tens of millions of shard replicas on upper hundreds of thousands of servers with historical hypergrowth, as shown in the figure below.
    • use cases vary significantly in both complexity and scale, ranging from simple counter service with high tens of servers to complex, Paxos-based global storage service with tens of thousands of servers
    • Various factors contribute to the wide adoption. First, integrating with Shard Manager means simply implementing a small, straightforward interface consisting of add_shard and drop_shard primitives. Second, each application can declare its reliability and efficiency requirements via intent-based specification. Third, the use of a generic constrained optimization solver enables Shard Manager to provide versatile load-balancing capability and easily add support for new balancing strategies. Last but not least, by fully integrating into the overall infrastructure ecosystem, including capacity and container management, Shard Manager supports not only efficient development but also safe operation of sharded applications, and therefore provides an end-to-end solution, which no similar platforms provide.

  • Should you or shouldn't you put everything in the database first? It makes life a lot simpler when you do. The Automated CIO: We also get the freedom of caching all our data in a database that we own so we can access it even when services are down. Previously, we called out to each service’s API for every script, bot, or whatever, which can get expensive, slow, and potentially be riddled with rate limits, or worse, downtime. Also, Building storage-first serverless applications with HTTP APIs service integrations

  • Building a Multiplayer Game with API Gateway+Websockets, Go and DynamoDB. Nice because it contains code, design motivation, SAM files, and cost analysis: for $1, you can play about 6,000 games. For a lot of use cases, this is going to be far more attractive than even the cheapest, smallest t-type instance. A t3a.nano costs $3.38/mo plus whatever you spend on EBS, and that only includes a 1h 12m burst. So if you were hosting over 20,000 games a month, you might be able to save some money by doing a tiny instance – assuming it didn't go down or run out of resources during a period of load. I love how dependable the serverless stack is.  I was working on this app off and on for a few months, and then left it in the corner since about May. It was impressive, but not surprising, to be able to open the web app last night and have it be working perfectly after being left alone for 4 months – accruing exactly $0/month in costs.

  • Fast can make customers furious. Production testing with dark canaries.
    • it is not uncommon for new code to be committed and pushed to production within an hour, multiple times a day. While this has been incredibly successful—it has improved developer productivity by allowing us to make changes more quickly—it has also caused problems such as site or service outages when bad code, configuration, or AI models have been pushed to production. Even with complete unit and integration tests, sometimes there is no substitute for production-level testing using both production data and production volume, especially with regards to system performance metrics, such as memory consumption, CPU/GPU usage, concurrency, and latency.
    • A “dark” canary is an instance of a service that takes duplicated traffic from a real service instance, but where the response from the dark canary is discarded by default. This means the end user is never impacted even if something goes wrong in the dark canary, such as errors, higher latency, higher CPU/memory consumption, etc. Only read requests without side effects, such as tracking events that might pollute business metrics, should be duplicated. For instance, if there is a request to read a profile, a side effect would be to write to a database counting the number of profile reads. You could mitigate this side effect by detecting that this was a dark canary request and using a separate database counter, or not writing to the database in this case. 

  • Dropbox on Keeping sync fast with automated performance regression detection
    • With all the work above, we were able to get the variability down from 60% in the worst case to less than 5% for our all of our tests. This means we can confidently detect 5% or greater regressions.
    • Caught regressions: 
      • Slowdown coming from using a third party hash map implementation, which internally used a weak hashing scheme. This led to correlated entries getting bucketed together and thereby increasing insertion and lookup cost from constant time to linear time.
      • Using unbounded concurrency leading to a ~50% regression in sync duration coupled with excessive memory and CPU usage
      • A sequence of events involving iterating on a set (seeded with an RNG) and removing the element from a map (using the same seed) led to inefficient removal, triggering collisions when the map was resized, ultimately yielding high latency lookups.
      • Using incompatible window size and frame sizes for high bandwidth and high latency connections when we first switched to a GRPC powered network stack led to request cancellations under high load

  • How AWS Customers Are Using Local Zones
    • Roberts said customers who combine Local Zones with AWS Direct Connect are achieving sub-1.5ms latency communication between their AWS infrastructure and  applications in on-premises data centers in the LA metro area.
    • Edge computing can perform “data thinning” to distill large datasets down to smaller files to be sent across the network for review. In TV and film production, that means transcoding, which converts large files into a format more suitable for digital transport. Edge computing can bring processing power onto remote sets, allowing transcoding to take place on location.
    • “We are shooting content at 8K, and that’s about 12.5 gigabytes a minute compressed,” said Dave Temkin, Vice President of Networks at Netflix, in a January presentation at PTC. “The ability to transcode something quickly on location, to go from 8K down to a daily – which doesn’t even need to be HD – that  you can get off the set or off location very quickly, that’s really important to us.”

  • WhatsApp:
    • WhatsApp scaled to 900M users with 50 engineers
    • Erlang strength: very efficient architecture
    • Core design hasn’t changed in 8 years
    • 2B+ users, multi-data centers, containers
    • Developer productivity for larger teams becomes critical, not merely important.
    • Things that help optimize development cycle matter a lot.
    • Working on making Erlang statically typed.

  • Engineering For Failure. Basic ways Riskified handled failures in order to provide maximal uptime and optimal service to customers:
    • Retrying. Enough said.
    • Prefetching — Fail outside of the main flow. In cases of fairly static data, we can easily pre-fetch all (or some) user details from the user service in a background process.
    • Best efforting. In some cases, we should just embrace failure, and continue processing without the data we were trying to get.
    • Falling back to previous or estimated results. In some cases, you may be able to use previous results or sub-optimal estimations to handle a request while other services are unavailable.
    • Delaying a response. If the business of the product allows it, it’s possible to delay the processing of the request until the problem with the external resource is solved.
    • Implement simplified fallback logic. One of the solutions we devised for such critical external resources, is to use “simplified” in-process versions of them. In other words, we’re re-implementing a simplified version of the external service as a fallback within our service, so that in the event the external service fails, we still have some data to work with, and can successfully process the request. 

  • Why We Chose a Distributed SQL Database to Complement MySQL
    • Previously, we used MySQL as our backend database. But as our data size sharply increased, a standalone MySQL database had limited storage capacity and couldn't provide services for us.
    • TiDB is an open-source, distributed SQL database built by PingCAP and its open-source community. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. It's a one-stop solution for both Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) workloads. 
    • How we use TiDB at VIPKid: Working with large amounts of data and highly concurrent writes; Multi-dimensional queries for sharded core applications; Data life cycle management; Real-time data analytics
    • The system's current peak workload is about 10,000 transactions per second (TPS), and about 18 million rows of data is added each day. This table only retains the data for the last two months. Currently, a single table has about 1.2 billion rows of data.
    • Thanks to TiDB's support for multi-dimensional SQL queries on sharded tables, we don't need to worry about multi-dimensional queries across shards.
    • Deploying a new cluster with TiFlash has reduced our overall costs by 35%.
    •  when I queried in a single TiFlash node, it took only 10 seconds. If I added more TiFlash nodes, the speed would further improve. 

Soft Stuff:

Vid Stuff:

Pub Stuff:

  • Murat has a lot stuff for you to read. My Distributed Systems Seminar's reading list for Fall 2020
  • Cosmic Rays Don’t Strike Twice:Understanding the Nature of DRAM Errors and the Implications for System Design: We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics
  • WebRTC for the Curious: This book was created by WebRTC implementers to share their hard-earned knowledge with the world. WebRTC for the Curious is an Open Source book written for those that are always looking for more. This book doesn’t settle for abstraction.
  • Patterns of Distributed Systems: What follows is a first set of patterns observed in mainstream open source distributed systems. I hope that these set of patterns will be useful to all developers.
  • Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes: We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex ethical situation, often posing moral dilemmas, paired with a distribution of judgments contributed by the community members. Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement. However, when presented with simplified moral situations, the results are considerably more promising, suggesting that neural models can effectively learn simpler ethical building blocks.
  • Scalability! But at what COST?: We survey measurements of data-parallel systems recently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.
  • Effectively Prefetching Remote Memory with Leap: Using Leap to fix the system’s data paths provided single-microsecond latency at worst in 95% of tasks, with an average of 5 or 6 microseconds. Including the prefetcher in their tests, Leap provided sub-microsecond latency, or latency in nanoseconds, on 85% of tasks.