Stuff The Internet Says On Scalability For March 27th, 2020

Hey, it's HighScalability time!

Awesome explanation of how to build a PID controller to fly a rocket! (BPS.space via Orbital Index)

Do you like this sort of Stuff? Without your support on Patreon this kind of Stuff can't happen. You are that important to the fate of the intelligent world.

Know someone who wants to understand the cloud? I wrote Explain the Cloud Like I'm 10 just for them. On Amazon it has 103 mostly 5 star reviews. Here's a recent authentic unfaked review:

Number Stuff:

  • 667%: spike in malicious phishing emails exploiting concerns over COVID-19 since the end of February
  • 1,500,000,000,000,000,000: Folding@Home Reaches Exascale Operations Per Second for COVID-19.
  • 88: (44%) of the 200 cities we analyzed have experienced some degree of network degradation over the past week compared to the 10 weeks prior. However, only 27 (13.5%) cities are experiencing dips of 20% below range or greater.
  • 90%: of all ZipRecruiter-advertised jobs that required advanced-A.I. skills are in California, Washington, New York and Massachusetts.
  • 1 exabyte: customer data stored by Backblaze.
  • 44 million: use Microsoft teams. It grew by 12 million users in one week. Slack has 12 million users. Slack added 7k customers in 7 weeks.
  • 70%: ride reduction at Uber.
  • $5.7 billion: invested in space startups in 2019. 63% increase over the $3.5 billion invested in 2018.
  • $483 million: Samsung Electronics's AWS spend.  And they own Joyent.
  • 1 billion: eBay HDFS file system objects.
  • 7%: COVID related reduction in Overcast usage on weekdays. 18% on weekends.
  • 20 million: record number of active users playing games on Stream.
  • 6,100: mean the number of reported open-source vulnerabilities, up from 4,100 last year.
  • -455: temperature in space (degrees Fahrenheit).

Quotable Stuff:

  • @dabit3: In light of the Coronavirus outbreak, many companies are moving away from REST and GraphQL and back to SOAP
  • @benthompson: The tech industry warned about the impact of the coronavirus in January, closed its offices in WA and CA in early March (to great effect), and has enabled millions to work from home with basically zero hiccups.
  • @bencurthoys: Hey @AWS, any plans to let people cancel reserved instances in the current pandemic crisis? No one can use my services, but I'm still paying you full whack for them whether I shut the servers down or not.
  • Google: Affected customers may experience elevated error rates that surface across multiple Google Cloud Platform services. Currently known products that are impacted are: Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud MemoryStore, Cloud Spanner, Cloud Storage, Cloud Composer, Cloud Dataproc, Cloud KMS, Cloud Container Registry, Compute Engine, Cloud IAM, Cloud SQL, Firebase Storage, Cloud Healthcare API, Cloud AI, Firebase Machine Learning, Data Catalog and Cloud Console.
  • Verizon CEO: Web traffic spiked 20% in one week amid coronavirus shutdown
  • Corey Quinn~ In the world of cloud you aren't billed for what you use, but rather what you forget to turn off.
  • @benedictevans: Mobile traffic down slightly [in the UK] (people are using WiFi at home) and UK roaming traffic is down 55% in the last 5 days.
  • @benedictevans: MTN (250m mobile users, mostly in Sub-Saharan Africa) ~50% smartphone penetration (entry price is now $20) But only ~40% of the base are active data users.  Average data use is 2.8 gig/month, growing 46% Y-on-Y
  • @AmyZenunim: i'm programming in Go now it feels like they've mashed the worst parts of C and Javascript syntax together, with the package dependency hell of early-2010s Ruby, and the elitist community of early-2000s Linux fanboys. oh, and Google owns it
  • Mendel Rosenblum: a colleague, Balaji Prabhakar, and I have been working on a new clock sync algorithm that can synchronize clocks down into the single-digit nanoseconds
  • tptacek: As a threshold concern, a 2020 web security class needs to be teaching about SSRF, the most important current web bug class. OAuth flows would be another thing I'd hope to see covered.
  • @QuinnyPig: An all-upfront reserved instance for a db.r5.24xlarge Enterprise Multi-AZ Microsoft SQL server in Bahrain is $3,118,367. I challenge you to find a more expensive single @awscloud API call.
  • ZEYNEP TUFEKCI: The phrase flatten the curve is an example of systems thinking. It calls for isolation and distancing not because one is necessarily at great risk from COVID-19, but because we need to not overwhelm hospitals with infections in the aggregate. Also, R0 is not a fixed number: If we isolate ourselves, infectiousness decreases. If we keep traveling and congregating, it increases. Flattening the curve is a system’s response to try to avoid a cascading failure, by decreasing R0 as well as the case-fatality rate by understanding how systems work.
  • blowski: In my experience, many websites are an extension of their CEO's ego. They don't have the money to build a website that is both unique and usable, but there's no way they want to look everyone else - so they drop the requirement to be usable.
  • @houlihan_rick: Avoid slow transactions in Lambda by setting @DynamoDB connect timeout to 100ms. Default is 60 secs, and Lambda functions can hang for awhile waiting if a connection request gets dropped en route to the table. I have seen this twice lately, dropping connect timeout fixed both.
  • @cliff_click: Actually, GC CAN be slower than malloc.  In high churn rate apps, malloc can be recycling memory in-cache whereas GC typically burns thru a generation before repeating addresses... all out of cache.  I can totally show ~5x speedups on streaming java using object pools vs GC. :-(
  • mgraczyk: No, having worked on explore sourcing and ranking I can tell you with certainty Instagram does not do the same thing [suppress ugly people].
  • @rafalwilinski: Short story why #GraphQL is not always a great choice for Single-Table Design in @dynamodb - you can't really predict all access patterns. Our devs started requesting _everything_ on login with 7 or even 8 levels of nesting.
  • Alan Kay: The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free? The Web, in comparison, is a joke. The Web was done by amateurs.
  • Memory Guy: What this leads us to is an expectation that 2020 will be a down year in the chip market, yet this is something that Objective Analysis was already predicting based on excessive capital spending in 2018.  The anticipated CapEx-driven oversupply will be accompanied by a demand downturn that will cause more immediate damage to semiconductor revenues. This situation will not last.  Since demand is likely to rise back to the trend line, then the future shortage that we have already been predicting is likely to happen on time, driven by insufficient capital spending.  The net impact of COVID-19 will be to cause an earlier downturn in 2020 than would have otherwise occurred, but the impact is unlikely to go beyond that.
  • f_fat: I’ve reimplemented a JSONEncoder and JSONDecoder in pure Swift. That means no third party libraries, no use of Foundation. The encoding/decoding is about 1.5-2x faster on macOS and 8-10x faster on Linux.
  • antoineMoPa: There is another [open source] model: - Write some software and share it in 1970 - Now everyone is using it without knowing i
  • Arthur Holland Michel: ARGUS had 1,854,296,064 pixels, enough imaging power to spot an object six inches wide from an altitude of 25,000 feet in a frame twice the width of Manhattan. It generated 27.8 gigabytes of raw pixel data, enough to fill six DVDs, every second. Downloading the raw data in real time would require an internet connection 16,000 times faster than the fastest wireless internet service available in the United States in 2017. Just processing all the pixels put the Xbox-styled computer’s 33,000 processing elements through 70 trillion operations each second.
  • @ZLevyMD: We are now a monolith.  We started with a medical ICU, surgical ICU, cardiac ICU, neurosurgical ICU, and a cardiothoracic ICU, plus a dozen mixed specialty floors.  Every floor and unit is becoming a COVID unit.  There is no more specialization — we’re all treating one thing.
  • @mipsytipsy: I am convinced there are many, many (most?) software companies out there with 2x, 3x, 4x+ the headcount they really need to build and support their core product. But they never understood their breaky, flaky systems, so they had to plaster over the problems with people.
  • @bodil: There is no emotion, there is peace. There is no passion, there is serenity. There is no ping, there is panic. There is no response from the DHCP server, there is despair.
  • Wisen Tanasa: Lock-in cost = Migration cost - Opportunity Gain
  • @mjasay: love @timbray's comment: "In AWS engineering, we develop stuff and we operate stuff. I think the second is more important" (But even if not *more* important, it's at least equally important)
  • Lock Picking Lawyer~ I see this paradigm so often. It seems like every lock maker met up 20 years ago and came up with this terrible design and agreed that would be the industry standard.
  • nikhilsimha: Used to work at fb in an infra team. Their abstraction for job scheduling (Tupperware) is about 5 years behind - something like Borg or EC2/EMR. Something like that is a fundamental reason why fb can’t do cloud as it is right now. Plus, the infra teams do operate like product - impact at all costs. Which to most management translates to short-term impact over technical quality. It would be a true 180 in terms of eng culture if they could pull off a cloud platform. An example is how they bought Parse and killed it, while firebase at google is doing extremely well. Having said all that, I think focusing on impact over technical quality is probably the right business decision for what they were trying to do at the time - drive engagement and revenue.
  • SABINE HOSSENFELDER: The core idea of Superdeterminism is that everything in the universe is related to everything else because the laws of nature prohibit certain configurations of particles (or make them so unlikely that for all practical purposes they never occur). If you had an empty universe and placed one particle in it, then you could not place the other ones arbitrarily. They’d have to obey certain relations to the first. This universal relatedness means in particular that if you want to measure the properties of a quantum particle, then this particle was never independent of the measurement apparatus. This is not because there is any interaction happening between the apparatus and the particle. The dependence between both is simply a property of nature that, however, goes unnoticed if one deals only with large devices. If this was so, quantum measurements had definite outcomes—hence solving the measurement problem—while still giving rise to violations of Bell’s bound. Suddenly it all makes sense!
  • Mikael Ronstrom: Using numbers produced already with MySQL Cluster 7.6.10 we have shown that NDB Cluster is the world's fastest Key-Value store using the Yahoo Cloud Serving Benchmark (YCSB) Workload A. We reached 1.4M operations using 2 Data Nodes and 2.8M operations using a 4 Data Node setup. All this using a standard JDBC driver.
  • Marc Andreessen: SpaceX and Tesla were not lean startups. They were very big, ambitious. They raised a lot of money. The big question, the question I’m noodling around, is what about the efforts where you have to say, “This thing is going to take $300 million?” It just is. There’s no shortcut and there’s no minimum viable product. It is going to take $300 million, and that $300 million has to be reserved ahead of time.
  • Geoff Huston: The situation points to the uncomfortable conclusion that as far as the security of the Internet is concerned, we are placing undue reliance on a security framework that at best offers same week service in a nanosecond world
  • FireEye: Beginning this year, FireEye observed Chinese actor APT41 carry out one of the broadest campaigns by a Chinese cyber espionage actor we have observed in recent years. Between January 20 and March 11, FireEye observed APT41 attempt to exploit vulnerabilities in Citrix NetScaler/ADC, Cisco routers, and Zoho ManageEngine Desktop Central at over 75 FireEye customers. Countries we’ve seen targeted include Australia, Canada, Denmark, Finland, France, India, Italy, Japan, Malaysia, Mexico, Philippines, Poland, Qatar, Saudi Arabia, Singapore, Sweden, Switzerland, UAE, UK and USA. The following industries were targeted: Banking/Finance, Construction, Defense Industrial Base, Government, Healthcare, High Technology, Higher Education, Legal, Manufacturing, Media, Non-profit, Oil & Gas, Petrochemical, Pharmaceutical, Real Estate, Telecommunications, Transportation, Travel, and Utility.
  • Stanislas Dehaene: Most of the computation in the brain is unconscious. Whether it is face recognition or word recognition or understanding the meaining of a sentence, this is something whose end result is conscious but the process is not. Learning is the construction of a mental model.
  • ridiculous_fish: Xoogler here, circa 2015. My reaction is that this is a very google3 (i.e. web services) centered book. But Google contains multitudes and it feels wrong to ignore them. For example the book has a section called "How Code Review Works At Google." And it goes on to describe strictly the google3 process. But Chrome, ChromeOS, GoogleX, others have different processes. If Google has a proven model, why do so many of its projects deviate from it? During my time there, the Android team was recruiting internally, advertising "come work on Android, we don't require Readability." It was seen as an internal competitive advantage to reject these processes! What does that say about how they are perceived internally? I speculate that Android and Chrome and others have distinct processes for a good reason, and that the book is unknowingly slanted towards web-service style engineering.
  • Sacha Altay: Mercier and Sperber say that reason is a tool that evolved to solve particular problems related to communication, like evaluating information provided by others, convincing family or tribe members with arguments, and justifying one’s behavior to protect and improve one’s reputation in a complex social world. Their theory makes novel and testable hypotheses, like that reason works best when people argue with each other rather than reason alone, and that we evaluate arguments more objectively than we make them.
  • Stathis Maneas: When choosing among drive types/models, our results indicate that from a reliability point of view, flash type (i.e., eMLC versus 3D-TLC) seems to play a smaller role than lithography (i.e., 1xnm versus 2xnm eMLC) or capacity
  • Steven Swanson: Memory systems are on the verge of a renaissance:  Scalable, persistent main memories (e.g., Intel’s 3DXPoint) are the first new technology to enter the upper layers of the memory hierarchy in 50 years.  They bring a fundamentally new capability (i.e., persistence), a dramatic increase in capacity, and an array of complications (e.g., asymmetric read and write performance, power limitations, and wear out).  This combination of characteristics raises a deceptively simple but fundamental question:  What should we do with persistent main memory

Useful Stuff:

  • Honeycomb.io with Observations on ARM64 & AWS’s Amazon EC2 M6g Instances. Some big wins here.
    • For our use case, M6g is superior in every aspect to C5: it costs less on-demand, has more RAM, exhibits lower median and significantly narrower tail latency, and runs cooler with the same proportional workload per host. 
    • @lizthegrey: We've been able to decrease our instance count for one workload by 30%, with each instance costing 10% less (C5 vs M6g). All it took was a recompile with GOARCH=arm64 (plus some infra wrangling). 
    • A postscript: the outcome of our week-long bake running entirely on Graviton2. Notice the 27%+ decrease in instance count peak-to-peak, and 37.5% fewer instances for baseline load, which averages out to >30% fewer instances overall.
    • Saving 40% on the EC2 instance bill for this service once we’re able to fully convert our instances to Graviton2 is well worth the investment
    • Charlie Demerjian: On Saturday SemiAccurate brought you exclusive news about Intel’s latest server cancellation. This mainstream platform, not a niche variant, shows exactly how untenable Intel’s server roadmap is. For years now we have been saying this roadmap is folly, it can not hold, and now it has happened. Lets take a look at how much worse Intel’s position is since our ‘rosy’ article entitled, “Intel has no chance in servers and they know it“. We said Intel knows they have no chance to beat AMD until after their 2022 platform so how can things get worse? 

  • Videos from !!Con West 2020 are now available.

  • Azure appears to be full. A cloud is a bunch of servers, so when a pandemic disrupts the server supply chain even the infinite capacity of the cloud turns out to be finite.
    • Ragishar: I'm reading a lot of these comments and a lot of y'all are missing the point. All these cloud providers are constantly growing at a high rate, and there is not a problem when they can just order more racks. Now the supply is impacted and it's harder to get the same capacity in the same amount of time, so you start to fall behind. Fall behind too much and you get the 'azure is full' headlines. People working from home is only loosely related because there is an increase in demand, but it wouldn't be a problem if the supply chain to order more capacity was healthy. Source: I do this stuff for a living
    • angrycat: in our experience Azure does weird shit when it hits capacity. We had to go through some MS back channels to find out one of their data centers we use is over capacity which explained why only our nodes at that location were failing inconsistently in bizarre ways. Moved our resources to another location and all the issues went away. They are certainly not open to making their capacity known.
    • Nyefan: I work in videoconferencing. Our traffic has gone up 20x in the last 3 weeks, and we initially had some difficulty getting the servers we need from Amazon. They've added us to a higher priority lane for support and compute capacity because we're helping maintain the quarantine. They also helped us move traffic away from some of their smaller and more highly impacted regions by letting us know which regions had the most spare capacity of the instance types we were using. So this announcement from Azure rings true to me.
    • cdreid: I'm a trucker you would be very very surprised and how big are stock piles of everything are but of course you're right prices are going to go up. And I have zero criticism of Microsoft for not being able to handle the traffic none. 
    • drysart: You'd be surprised. Right now I'm working with a major healthcare-related application used at numerous hospitals and practices that is 100% Azure hosted; and as a trend, the entire healthcare industry is only getting more reliant on cloud-based technology (from all cloud vendors), not less. And as everyone in healthcare is scaling up enormously right now, their needs in terms of cloud footprint is also growing. And the major cloud vendors were already having scaling issues due to supply chain disruptions in general before the coronavirus suddenly made healthcare's infrastructural needs grow. There were times when you couldn't provision new Azure resources last fall, and the world's supply chain situation certainly isn't better today than it was then. So no, a doctor or a nurse might not need technology to stick a stethoscope to someone's chest to listen to their lungs; but they certainly need their technology stack working to schedule their appointments, reserve rooms and ventilators, order prescriptions, and make entries into patients' charts because all of those things are done electronically. And they need bandwidth to do all the video telehealth appointments that everyone's pretty much doing exclusively for non-coronavirus, non-emergency issues right now.
    • Just1689: I think this introduces some interesting points to the DR and BCP conversation. Is it a safe bet that we can rely on the cloud to have capacity? Normally I wouldn't doubt it but in this sort of situation is becomes more likely they will be put under capacity stress. Will the cloud vendors learn and build slack in?

  • Need good talks? Take a look at reddit/ConTalks

  • Reactive systems are a good match for distributed real-time IoT problems. This must have been fun to build. Tesla Virtual Power Plant
    • Software is really the key to enabling these diverse components to act in concert. One of the things we can do is bring together thousands of small batteries in people's homes to create virtual power plants, providing value to both the electrical grid as well as to the home or business owner. This marries some of the most interesting and challenging problems in distributed computing, with some of the most important and challenging problems in distributed renewable energy.
    • The tricky thing about the power grid is that supply and demand have to match in real-time or else frequency and voltage can deviate, and this can damage devices and lead to blackouts. The grid itself has no ability to store power, so the incoming power supply and outgoing power consumption need to be controlled in a way that maintains the balance.
    • It runs a full Linux operating system and provides local data storage, computation, and control, while also maintaining bi-directional streaming communication with the cloud over WebSocket so that it can regularly send measurements to the cloud for some applications as frequently as once a second. It can also be commanded on-demand from the cloud. 
    • The foundation of this platform is this linearly scalable WebSocket front end that handles connectivity as well as security. It has a Kafka cluster behind it for ingesting large volumes of telemetry from millions of IoT devices. This provides messaging durability, decouples publishers of data from consumers of data, and it allows for sharing this telemetry across many downstream services. The platform also has a service for publish-subscribe messaging, enabling bi-directional command and control of IoT devices. These three services together are offered as a shared infrastructure throughout Tesla on which we build higher-order services. On the other side of the equation are these customer-facing applications supporting the products that I just highlighted.
    • The APIs for energy products are organized broadly into three domains. The first are APIs for querying telemetry alerts and events from devices or streaming these as they happen. Second are APIs for describing energy assets and the relationships among these assets, and lastly, APIs for commanding and controlling energy devices like batteries. Now, the backing services for these APIs are composed of approximately 150 polyglot microservices, far too many to detail in this presentation. I'll just provide a high-level understanding of the microservices in each domain.
    • It relies heavily on Postgres database to describe these relationships. We use Kafka to integrate changes as they happen for many of these different business systems where we stream the changes directly from IoT devices. At scale, actually, this is a lot more reliable. Devices are often the most reliable source of truth self-reporting their configuration, state, and relationships. Digital twin is the representation of a physical IoT device, a battery, an inverter, a charger in software modeled virtually, and we do a lot of digital twin modeling to represent the current state and relationships of various assets.
    • Akka has been an essential tool for us for building these microservices. Akka is a toolkit for distributed computing. It also supports actor-model programming, which is great for modeling the state of individual entities like a battery, while also providing a model for concurrency and distribution based on asynchronous and mutable message passing. It's a really great model for IoT, and I'll provide some specific examples later in the presentation. Another part of the Akka toolkit that we use extensively is the Reactive Streams component called Akka Streams. Akka Streams provides sophisticated primitives for flow control, concurrency, and data management, all with backpressure under the hood, ensuring that the services have bounded resource constraints. 
    • Like any large platform, there is a mix of languages, but our primary programming language is Scala. The reason we came to Scala was through Akka because it's really the first-class way to use Akka. Then we really fell in love with Scala's rich type system, and we become big fans of functional programming for building large, complex distributed systems. We like things like the compile-time safety, immutability, pure functions, composition, and doing things like modeling errors as data rather than throwing exceptions.
    • Majority of our microservices run in Kubernetes, and the pairing of Akka and Kubernetes is really fantastic. Kubernetes can handle coarse-grained failures in scaling, so that would be things like scaling pods up or down, running liveness probes, or restarting a failed pod with an exponential back off.
    • Then we use Akka for handling fine-grained failures like circuit breaking or retrying an individual request and modeling the state of individual entities like the fact that a battery is charging or discharging. Then we use Akka Streams for handling the system dynamics and these message-based real-time streaming systems. The initial platform was built with traditional HTTP APIs and JSON that allowed rapid development of the initial platform. Over the past year, we've invested much more in gRPC. It's been a big win.
    • What's needed to build these complex services, especially in IoT is a toolkit for distributed computing. For us, that's been Akka.
    • Tesla's unique vertical hardware, firmware, software integration enables this distributed algorithm. The vertical integration lets us build a better overall solution. This distributed algorithm makes the virtual power plant more resilient. Devices are able to behave in a reasonable way during the inevitable communications failures of a distributed system.

  • Syncing almost never works. Is Dropbox the exception? Sounds like they tried to be. Rewriting the heart of our sync engine
    • It's a hard problem: Sync at scale is hard. Distributed systems are hard. Durability everywhere is hard. Testing file sync is hard. Specifying sync behavior is hard. 
    • We wrote Nucleus in Rust! Rust has been a force multiplier for our team, and betting on Rust was one of the best decisions we made. More than performance, its ergonomics and focus on correctness has helped us tame sync’s complexity. We can encode complex invariants about our system in the type system and have the compiler check them for us.
    • We redesigned the client-server protocol to have strong consistency. The protocol guarantees the server and client have the same view of the remote filesystem before considering a mutation. Shared folders and files have globally unique identifiers, and clients never observe them in transiently duplicated or missing states
    • sujayakar: yeah, the deterministic simulation is my favorite tech in the whole project. it’s caught all types of bugs, from simple logic errors to complicated race conditions that we would have never thought to test. I think there’s some interesting work out there to bring more of this “test case generation” style of testing to a larger audience…
    • sujayakar: we broke the sync protocol down into two subproblems: 1) syncing a view of the remote filesystem to the clients and 2) allowing clients to propose new changes to the remote filesystem. then, the idea is that we’d solve these two problems with strong consistency guarantees, and then we’d use these protocols for building a more operational transform flavored protocol on top. we took this approach since protocol-level inconsistencies were very common with sync engine classic’s protocol. we spent a ton of time debugging how a client’s view of the remote filesystem got into a bizarre state or why they sent up a nonsensical filesystem modification. so, it’d be possible to build a serializable system on our core protocol, even though we don’t, and that strength at the lowest layer is still really useful.
    • sujayakar: we write almost all of our logic on a single thread, using futures to multiplex concurrent operations on a single thread. then, we make sure all of the code on that thread is deterministic with fixed inputs. there’s lots of ways code can sneak in a dependency on a global random number generator or time. have traits for the interfaces between the control thread and other threads. we also mock out external time behind a trait too. then, wrap each real component in a mock component that pauses all requests and puts them into a wait queue.
    • rbtying: Rust was adopted at Dropbox for some serving infrastructure use cases more than a year before the sync rewrite was started, which was about four years ago. I'd say we solidly predated the "rewrite it in Rust" meme. I believe that this rewrite was only successful because of Rust's ability to both interact safely/efficiently with underlying OS APIs (they're pretty much all C-like) and to encode complex concepts into the type system and the compiler. Rust isn't the only language with these properties, but it is one of the few -- and it's one that we really enjoyed using.
    • sujayakar314: we use an in-house RPC library that uses protobuf as its interface description language. it's very similar to using gRPC but for in-process communication rather than networked RPC.

  • Moving Messages in AWS: Super-Fast Lambdas Use Batches: Using the individual method send_message, sending ten messages consistently takes 700-800ms...Doing the same thing with the batch method, send_message_batch. We see much quicker processing time. All ten messages are consistently broadcast in ~300ms — about 60% less!

  • Lots of good info. AWS Cost Optimization Q&A with Corey Quinn
    • So you think you spend a lot of money on AWS. When do you hit enterprise territory? When you spend about $1 million a year. At that point they want to invoice you instead of paying by credit card.
    • Figure out where the money is going. You don't want spend time optimizing something that's not a significant portion of spend. People tell themselves a lot of lies that aren't backed by data. EC2 is usually the big expense. Then data transfer, EBS, RDS, S3, DynamoDB. Managed NAT Gateway can be really expensive. Use spot. If you aren't sure if you'll need reserved instances buy half instead. Get rid of data you don't need. Put all that old data in Glacier deep archive.
    • For most architectures your Lambda costs will be miniscule. Don't spend a lot of time trying to optimize serverless costs.
    • Time people spend on your AWS bill is time they are not spending on your product. You're not going to cost optimize yourself to the next milestone.
    • When the bill is expensive that's never actually the problem. What everyone cares about is not wasting money. Have a normalized metric, like unit cost to service a user, that takes in to account that costs are rising because traffic is rising prevents people from focussing just on the cost. 
    • Does everyone need to be aware of cost? Most engineers do not. What you need to know would fit on an index card. Big things cost more than small things. Data transfer is super weird be careful with it. If you don't turn something off you pay for it forever. More ephemerality is better. Most expensive storage to cheap storage.

  • Serverless in the wild: characterizing and optimising the serverless workload at [Azure] a large cloud provider
    • Over half of all apps have only one function, 95% have at most 10 functions, and 0.04% have more than 100 functions. Looking at invocations overall, the number of functions within an applications turns out not to be a useful predictor of invocation frequency. Applications with more functions show only a very slight tendency for those functions to be invoked more often. The most popular way to trigger a function execution is an HTTP request. Only 2.2% of functions have an event-based trigger, but these represent 27.4% of all invocations. Meanwhile many functions have a timer-based trigger, but they only account for 2% of all invocations. I.e., events happen much more frequently than timer expiry!
    • The 10-minute fixed keep-alive policy [current cloud provider approach] involves ~2.5x more cold starts at the 75th percentile while using the same amount of memory as our histogram with a range of 4 hours… Overall, the hybrid policies form a parallel, more optimal Pareto frontier (green curve) than the fixed policies (red curve).

  • With everyone needing remote access these days you may need to setup a VPN. You can do that in a cloud. Deploying a 10000 user VPN in a Month

  • Every system evolves until it supports distributed transactions. Solving Serverless Computing’s Fault Tolerance Problem 
    • By default, FaaS systems like AWS Lambda or Google Cloud Functions require programmers to worry about failed executions. All that they guarantee is this: function executions that fail — whether because of an application error or infrastructure failure — will be retried. This means that your function may run 2 times. Or 3 times.
    • Here’s the most painful issue: most FaaS systems provide no guarantees that a failed execution that reaches out to shared resources like databases or files will be “cleaned up”. In the process of failures and retries, applications that modify shared state can unwittingly expose partial results.
    • we’ve built a system called AFT (Atomicity for Fault Tolerance) that is a shim layer sitting between any serverless compute layer (e.g., AWS Lambda, Google Cloud Functions) and storage layer (e.g., AWS DynamoDB, Redis). Each logical request at the compute layer (which may be composed of multiple functions) is treated as a transaction. AFT guarantees that all of the updates made by a transaction are atomically installed at the storage layer.
    • We implemented the AFT and its protocols in a couple thousand lines of Go over three storage backends — AWS DynamoDB, AWS S3, and Redis (AWS ElastiCache). We’re able to minimize the overheads relative to doing IOs directly to/from the underlying storage engines (see section 6 of the paper for concrete results), and AFT scales smoothly to hundreds of parallel clients and thousands of transactions per second

  • Whatya gonna do when they come for you? GitHub Actions and the DevOps Lifecycle. GitHub is a platform. Lke AWS it sounds like if you're building a service on top of GitHub you may find GitHub competing with you in the future. GitHub wants to own the entire developer value chain. 

  • Halodoc: Building the Future of Tele-Health One Microservice at a Time. This is one of those videos with high production values and not a lot of technical details, but the key takeaway is the sheer scale of problems you can solve these days with relatively little resources. It's amazing.
    • The system connects connects 22,000 doctors and 1,500 pharmacies across 50 cities in South East Asia. An app is the face of service, but it hides a big complicated backend that supports 5 million users. 
    • Started with a monolith in EC2. Then they started launching to new business lines, so they went to a 40-50 microservices. Each microservice is kind of a pod, with an EC2 cluster and RDS. DynamoDB is used for cross cutting data like User data. 
    • Microservices were getting tightly coupled so they built an event bus using SNS. Example: a patient talks to a doctor, the doctor issues an prescription, and the event goes to SNS, SNS triggers Lambda, Lambda pulls data and generates a PDF,  the PDF is then available to the user to show to the pharmacist. Prescriptions are uploaded to pharmacists for pickup. 
    • Lambda is used to enrich metrics which are presented to Kafka. Data is moved out of RDS using DMS to Redshift to support business queries. Data is moved to Glacier for safe keeping. 
    • They also built a realtime pipeline to answer realtime questions, like how many doctors are available right now? Data goes from Kafka to Elastic Search and Grafana so it can be accessed by analysts. 
    • Users can upload photos which go to S3. Rekognition is used to filter out inappropriate images. 
    • Investigating containers, k8s, and the use of ML for health.

  • Sensible approach. It will be difficult to outcompete by building datacenters. Cloudflare’s Current Expansion Is Different from the Others
    • Cloudflare’s network already spans some 200 cities in more than 90 countries, but this round of expansion is different from any infrastructure deployment the company’s done before. Vapor IO and EdgeMicro are both startups building out a new kind of digital infrastructure: small data center sites designed specifically for edge computing. 
    • We are within 100 milliseconds of 95 percent of the world's population, or 99 percent if you look at internet users in the developed world. Cloudflare and EdgeMicro said they would disclose official network throughput later, but Bourg said the company saw promising initial roundtrip statistics of between 50 and 75 milliseconds. “We believe we can take that from 75 milliseconds down to sub 20, maybe even sub 10 milliseconds.”
    • The catalyst for edge computing – in this particular case – is the ability to connect from the edge of the access networks all the way to the centralized cloud, Trifiro told us. As far as low-latency workloads are concerned, the weakest link today is the “middle mile,” the section of the network between those access networks and the interconnection point for regional data centers.
    • “All the internet traffic in Austin, Texas, whether that’s the enterprise ISP or the cable networks, the MSOs, it’s all backhauled to Dallas,” Bourg explained. It’s the same elsewhere: traffic goes through the seven or nine internet “drains” in major metro areas like Chicago, New York, New Jersey, Miami, Atlanta, and Los Angeles. That’s not good for low-latency streaming applications like Cloudflare Workers (its serverless cloud computing service) or its content delivery network.

  • Why Is Facebook Not in the Cloud Business?  Why would they? Facebook already tried the platform thing. They didn't make money. You know what makes money? Advertising. That's Facebook's core competency. They would be just another cloud vendor and that's not a place you want to be.

  • Nice write up. Leaky abstractions for the loss. Server Outages and Increased API Errors: Code within our service discovery system was not resilient to this type of failure - as it was not within our expectations that a key could be announced without a value due to a transient network error. Our service discovery system is resilient to various failure modes within etcd, however, we did not anticipate a client being able to write a corrupt record due to improper HTTP handling. golang's net/http module — which is in violation of the HTTP 1.1 specification in this specific case — introduced an additional failure case which we did not anticipate. 

  • Reducing UDP latency: Eventually we have found out that high latency was caused by low link speed: we used 100 Mbit/s USB-to-ethernet adapter; net driver didn’t support 1Gbit/s link too...Another solution is to change scheduling policy to round-robin. You can do it with chrt command like this: chrt --rr 99 ./client Finally, it worked! Number of “slow” responses has decreased dramatically. 

  • Centaurs — Kevin Kelly at The Interval. Which is better at chess play: AI, human, or AI + human? The best is a centaur, which is the AI + human option, by a long shot. Where we're going is AI + humans. You will be paid in the future by how well you work with AIs, not working against them. AIs and humans think differently than humans, so we make a good team.

  • Don't you hate it when your requirements change? What if they change and you're on Mars? You've got to be clever. We need to get better at this robot thing if robots are going to explore the galaxy. At long last, NASA’s probe finally digs in on Mars: the probe has found soil that seems more dirt-like than sand-like; It sticks together and doesn’t collapse around the mole to give it enough friction to dig. What the mole needs is a little nudge...In late February, the team moved on to what Spohn calls “plan C.” They positioned the scoop above the mole’s tail and pushed it straight down into the dirt. The move is risky, because a delicate tether that provides power and communications from the lander attaches to the back part of the mole, and a hard whack could damage it. “This is our last resort,” Spohn said in an interview last fall.

  • NASA to launch 247 petabytes of data into AWS – but forgot about eye-watering cloudy egress costs before lift-off. I think they missed an opportunity to use Data Gravity in the title. 

  • Keep in mind JavaScript can't handle numbers > 53 bits, so when you want to make your ids 64 bits, which you will, you'll be screwed. So just use a string for ids. Less cleverness now saves a lot of pain later. Twitter IDs (snowflake). Tumblr New, Bigger Post IDs.

  • Good explanation of why average response times are not what you should be looking at. Lies, Damned Lies, and Averages: Perc50, Perc95 explained for Programmers: Previously I mentioned “perc50” and “perc95”, what exactly do those terms mean? The term “perc” stands for percentile, and the number indicates what percentile. The term “perc50” indicates you’re looking at a number where 50% of requests are at or below that number.

  • Worse is better has some theory behind it. On The Hourglass Model
    • A system with a specific spanning layer has a tendency to grow virally. Weakness is an important part of deployment scalability. 
    • The hourglass model of layered systems architecture is a visual and conceptual representation of an approach to design that seeks to support a great diversity of applications and allow implementation using a great diversity of supporting services. At the center of the hourglass model is a distinguished layer in a stack of abstractions that is chosen as the sole means of accessing the lower-level resources of the system. This distinguished layer can be implemented using services that are considered as lying below it in the stack as well as other services and applications that are considered as lying above it. However, the components that lie above the distinguished layer cannot directly access the services that lie below it.
    • We would then design our spanning layer to be as weak, simple, general and resource limited as possible while supporting this set of applications. The reasons for including these four characteristics are different. The weaker the spanning layer the greater the number of possible underlying services that can support it. The simpler and more general the spanning layer the greater the number of possible applications that can be implemented using it.

  • People don't think much about the timing plane in their system, but it's always fascinating to see how it works. Building a more accurate time service at Facebook scale: Facebook NTP service is designed in four layers, or strata:  Stratum 0 is a layer of satellites with extremely precise atomic clocks from a global navigation satellite system (GNSS), such as GPS, GLONASS, or Galileo. Stratum 1 is Facebook atomic clock synchronizing with a GNSS. Stratum 2 is a pool of NTP servers synchronizing to Stratum 1 devices. Leap-second smearing is happening at this stage. Stratum 3 is a tier of servers configured for a larger scale. They receive smeared time and are ignorant of leap seconds.

  • Don't just cache, prefetch! Prefetch Caching of eBay Items: When a user accesses the prefetched cached item data, the response time of item service decreases by several milliseconds (~400ms). Since prefetch is done for the top 5-10 items allowing the optimum balance of impact and cost, it covers 70% of search to item page traffic. Taking this ratio into consideration, the overall speed for above the fold rendering time has improved by 10%.

Soft Stuff:

  • microsoft/coyote (article):  a programming framework for building reliable asynchronous software. Coyote ensures design and code remain in sync, dramatically simplifying the addition of new features. Coyote comes with with a systematic testing engine that allows finding and deterministically reproducing hard-to-find safety and liveness bugs. Coyote is used by several teams in Azure to design, implement and systematically test production distributed systems and services.  
  • googleforgames/agones (article): a library for hosting, running and scaling dedicated game servers on Kubernetes.
  • google/oss-fuzz: continuous fuzzing of open source software.

Pub Suff:

  • Private Kit: Safe Paths- Can we slow the spread without giving up individual privacy?: first-generation contact-tracing tools, deployed against the current 2019 novel Coronavirus (COVID-19) crisis, can also be – and have been – used to expand mass surveillance, limit individual freedoms and expose the most private details about individuals. Citizen-centric, privacy-first solutions that are open-source, secure, and decentralized (such as MIT Private Kit: Safe Paths) represent the next generation of tools for disease containment in an epidemic or pandemic.
  • Stagnation and Scientific Incentives:  We demonstrate empirically that measures of novelty are correlated with but distinct from measures of scientific impact, which suggests that if also novelty metrics were utilized in scientist evaluation, scientists might pursue more innovative, riskier, projects.
  • INFINICACHE: Exploiting Ephemeral Serverless Functionsto Build a Cost-Effective Memory Cache: InfiniCache is one to two orders of magnitude cheaper than ElastiCache (managed Redis cluster) on some workloads. Compared to S3, InfiniCache achieves superior performance improvement for large objects: it’s at least 100x for about 60% of all large requests. This trend demonstrates the efficacy of the idea of using a distributed in-memory cache in front of a cloud object store. For the large object workload, a production-like test resulted in the availability of 95.4%. InfiniCache without backups sees a substantially lower availability of just 81.4%.
  • Millions of tiny databases: In other words this paper is about the second order effects of the EBS replication system. But these second order effects still become very important at the AWS scale. If you have millions of nodes in EBS that need configuration boxes, you cannot rely on a single configuration box. Secondly, yes, the configuration box should not see much traffic normally, but when it does see traffic, it is bursty traffic because things went wrong. And if the configuration box layer also caves in, things will get much much worse. The paper gives an account of "21 April 2011 cascading failure and loss of availability" as an example of this.
  • PigPaxos: Devouring the communication bottlenecks in distributed consensus: The central idea in PigPaxos is to decouple the communication from the decision-making at the leader. PigPaxos revises the communication flow to replace the direct communication between the leader and followers in Paxos with a relay based communication flow. PigPaxos chooses relays randomly from follower clusters at each communication round to reduce contention and improve scalability of throughput.
  • Software Engineering at Google: Lessons Learned from Programming Over Time.
  • Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook: These characterizations reveal several interesting findings: first, that the distribution of key and value sizes are highly related to the use cases/applications; second, that the accesses to key-value pairs have a good locality and follow certain special patterns; and third, that the collected performance metrics show a strong diurnal pattern in the UDB, but not the other two.