Stuff The Internet Says On Scalability For December 14th, 2018

Wake up! It's HighScalability time:

We've come a long way in 50 years. Or have we?

Alan Kay: I believe ARPA spent $ 175,000 of 1968 money for that one demo. That’s probably like a million bucks today.

Bill English: What we did was lease two video circuits from the phone company. They set up a microwave link: two transmitters on the top of the building at SRI, receiver/ transmitters up on Skyline Boulevard on a truck, and two receivers at the Civic Center. Cables of course going down into the room at both ends. That was our video link. Going back we had two dedicated 1,200-baud lines: high-speed lines at the time. Homemade modems.

Doug Engelbart: It was the very first time the world had ever seen a mouse, seen outline processing, seen hypertext, seen mixed text and graphics, seen real-time videoconferencing.

Alan Kay: We could actually see that ideas could be organized in a different way, that they could be filtered in a different way, that what we were looking at was not something that was trying to automate current modes of thought, but that there should be an amplification relationship between us and this new technology.

Do you like this sort of Stuff? Please support me on Patreon. I'd really appreciate it. Know anyone looking for a simple book explaining the cloud? Then please recommend my well reviewed (32 reviews on Amazon and 77 on Goodreads!) book: Explain the Cloud Like I'm 10. They'll love it and you'll be their hero forever. And if you know someone with hearing problems they might find Live CC very useful.

  • 50,000: images in a National Geographic shoot; 11 billion: Voyager 2 miles traveled; 83%: AI papers originate outside the US; 80%: network partitions lead to catastrophic failures; 97%: large AWS customers use auto scaling; 82%: startup failures due to cashflow problems; $115.7 billion: robot and drone spending in 2019; 

  • Quotable Quotes:
    • @jmarhee: "I went to Kubecon and all I got were these 300 service meshes built by companies that went under by time I got home"
    • @ReformedBroker: “Retention is the new growth.” - CEO of Adobe ($ADBE), which has gained 793% since moving Photoshop etc to a subscription-based service from a one-time software sale in 2011.
    • @mattray: OH: "Cloud native is pretty simple. You just need to know Kubernetes, Prometheus, Fluentd, Jaeger, Envoy, Core DNS, Linkerd, Rook, Vitess, Etcd and Raft."
    • @davidfrum: In order to lose 80% of its value, the bitcoin network expended more electricity than 150 of the world's 195 countries
    • @ChappellTracker: So here's something. Users are testing the limits of Tumblr's new algorithm that flags adult content (aka "censorbot"). This one found that a man's chest was flagged, but a man's chest next a 50% scale owl went unnoticed. 
    • Jimmy Chin: Great editors are brutal.
    • @danbri: intro slide: "Main message is that biology has been computing long before brains evolved. Somatic decision-making and memory are mediated by ancient pre-neural bioelectric networks across all cells. Exploring non-neural Cognition is [an untapped frontier for AI...]"
    • @swardley: X : Large companies are leaving cloud and building internally. If they use it, then it will be hybrid. Serverless is for startups who can't afford their own. 
      Me : Hmmm. The only words of comfort I have are ... if you don't like change, you're going to hate irrelevance.
    • @GossiTheDog: "Equifax did not see the data exfiltration because the device used to monitor ACIS network traffic had been inactive for 19 months due to an expired security certificate. On July 29, 2017, Equifax updated the expired certificate and immediately noticed suspicious web traffic."
    • Kinsta: PHP 7.3 was the fastest engine in 14 out of the 16 configurations tested above. 
    • @mipsytipsy: 2015: <me> we're different bc ... we handle arbitrarily wide events with no fixed schema and high cardinality dimensions <adviser> use none of those words please <me> it's a game changer <adviser> you can't sell high cardinality <me> no we just need to explain it <adviser> 🥺🤯😓
    • @codeboten: “A system is observable if the behavior of the entire system can be determined by only looking at its inputs and outputs” - kalman 1961  @adrianco kicking things off #o11y #KubeCon #observability
    • James T. Areddy: China is big, messy and complicated, he said. “We have been out there in the trenches for many years.”
    • @adrianco: I’ve been saying using serverless+saas+dbaas first as a rapid prototype then optimize with custom container based services as needed.
    • @BrianRoemmele: All of the machines pictured below, except for the printers are now in your pocket. This is IBM System/360 released in 1965 and sold until 1978 with 1024 KB RAM. and 8 MB of slow Large Capacity Storage. It fit nicely in a small air conditioned house with ~20 folks operating it.
    • andrewmcwatters: I've written UIs my entire career. Entire career. This isn't progress. It's why teenagers are having so much fun with web technologies, and in my adult years I've been dying to see something come out and replace them.
    • @jim_scharf: Why does it matter? I know customers used to need to think about partitions, but with on demand and adaptive capacity, we’re really taking big steps towards this being an unnecessary detail for customers. There are good Reinvent talks on this. See DB blog for listing.
    • Jeff Dorsch: The Houston-Galveston Area Council’s website recently divulged contract figures with two startups, Drive.ai and EasyMile. For Silicon Valley-based Drive.ai, the company charges $14,000 a month, which works out to $168,000 a year, to provide one van vehicle, with the company assuming operation and maintenance costs.
    • @roundtrip: Original price (typical 360/65 1MB system without LCS) Rental: US$200,000 per month (US$1,400,000 in 2014-dollars) Buy: US$4,000,000 (US$27,000,000 in 2014-dollars)
    • @esh: The free http://timercheck.io  service had 8,634,255 hits in November for $30.22 in API Gateway costs. I was hoping ALB to Lambda could cut costs significantly, but I'm not sure it would. The cheapest ALB cost looks to be about $22.28/mo and goes up based on complex factors.
    • Helene Stapinski: That night, I got up the nerve to ask Paulina if I could follow her on Instagram. Miraculously, she said yes, shrugging as she walked up the stairs to her room. I grabbed my phone, and suddenly, there it was: Paulina’s life. In black-and-white and full color.
    • Janakiram MSV: Firecracker takes a radically different approach to isolation. It takes advantage of the acceleration from KVM, which is built into every Linux Kernel with version 4.14 or above. KVM, the Kernel Virtual Machine, is a type-1 hypervisor that works in tandem with the hardware virtualization capabilities exposed by Intel and AMD...This, along with a streamlined kernel loading process enables a < 125 ms startup time and a reduced memory footprint.
    • Benjamin Raskin: A major component of the M3 platform is its query engine, which we built from the ground up and have been using internally for several years. As of November 2018, our metrics query engine handles around 2,500 queries per second (Figure 1), about 8.5 billion data points per second (Figure 2), and approximately 35 Gbps (Figure 3). These numbers have been constantly trending upwards at a rate much higher than Uber’s organic growth due to the increased adoption of metrics across various parts of our stack.
    • @fuesunw: Great insight. Can be also reworded as: holding on to the Data Centers by CIO is either the fear of failure or the lack of interest in the future of the company (because they are leaving soon anyway). usually both. We need CEOs who can recognize it and act in those situations.
    • Randy Rowland: By 2025, 80% of enterprises will have shut down their traditional data center, versus 10% today. This is seen in the shift many organizations are making to “cloud-first” IT strategies. 
    • Hrishikesh Barua: In the future, Pinterest aims to explore Kubernetes as an abstraction layer for their Kafka deployments, which some organizations are already doing. Some services at Pinterest have already moved to containers. Another goal is explore EBS again for storage as the newer EBS offerings are better optimized.
    • Ahmed Alquraan: Our analysis highlights that the current network maintenance practice of assigning a low priority to ToR switch failure is ill founded and aggravates the problem.
    • Rachel Traylor: The danger of using Little’s Law as your silver bullet (much the way the Central Limit Theorem is commonly used as a silver bullet) is that you risk losing valuable information about the variation in the random variables that make up your queuing system, which can wreak havoc on the best-laid plans.
    • talkingtab: DO's kubernetes release is an an example of why I am a big fan. As a sole developer, I can't afford high technical debt, but DO packages tech in a way I can manage. I hope they keep on and wish other services (here's looking at you AWS) would package their services as well.
    • sonnyblarney: AWS used to be easy, but over the last decade it's become a specialization. Every time I wander back to it, there's another layer of complexity in the way towards doing something simple.
    • Barbara Liskov: The idea that software programming was an intellectual activity that required a great deal of thought was kind of controversial 
    • dudul: I really like Erlang and Elixir, but I'm really worried when I see a tech stack that is still trying to figure out deployment and runtime configuration of systems. I'm not trying to be inflammatory here. As per many resources such as "Phoenix in Action" deployment is still a major culprit. And Dockyard themselves (one of the major Elixir shops, where McCord works actually) have a full time guy on the payroll to try to solve the problem of runtime configuration. It just sucks, because these are 2 major concerns of any production-ready stack.
    • @leppert: Periodic reminder: Slack is a software company gunning for a $10B IPO and they can’t be bothered to write native desktop clients that optimize for the user’s platform of choice.
    • Barbara Liskov: What we desire from an abstraction is a mechanism which permits the expression of relevant details and the suppression of irrelevant details. In the case of programming, the use which may be made of an abstraction is relevant; the way in which the abstraction is implemented is irrelevant. — Programming With Abstract Data Types
    • Rob May: For us, our worst performing ads were around things like "Close Support Tickets Faster" or "Clone Your Best Reps" or "Do Blah Blah in Slack" or "Share Knowledge More Effectively". VCs will tell you to focus on a specific pain point, but, automation is like this meta pain point that is working way better for us, and for most of the AI companies I know.
    • @virdeechapman: Since getting my head around @swardley Mapping these past few weeks, I have found myself defaulting to “Let’s map that out” in meetings. What surprised me the most? It worked in practice—Decisions occurred dispassionately, rather than pitched via compelling ‘punditry’.
    • theredbox: These people are really funny. Especially the old german manufacturers are so so similar to Nokia's leadership. Tesla is not some kind of a new car. It is not even a car in consumer's perspective! It's a gadget. A very expensive gadget but still a gadget. Tesla is what iPhone was to Nokia/Symbians. Tesla is so much ahead in the markets where it matters that the traditional car manufacturers should really be scared. They are good at making cars that's undeniable truth but are they good enough to produce gadgets ? The car of tomorrow is the new iphone. The traditional middle class dudes are dying out, the ones that will be able to afford cars at this price point will rarely shop traditional cars with shitty infotainments and no ecosystem.
    • Grawprog: The whole point of the article was that teaching your child the syntax of a programming language will not necessarily teach them to be master programmers, they need to also be able to think logically about how things are made and put together and you need to teach them to think this way. The only way for them to do that is to learn to think about things in the world that way.
    • Johnny: We’re approaching the seventy year mark of the Highway Era. The cost of maintaining the entire system is beginning to outstrip the benefits. Sooner or later triage will set in as people search for alternatives. Our collective willingness to keep each rarely used stretch of road paved will evaporate in due course. Only the truly productive bits will survive the fullness of time.
    • The Information: Within Google Cloud, some employees believe Ms. Greene isn’t a strong public speaker, which they worried caused her presentation of Google Cloud projects to fall flat.
    • @mjmoriarity: Every NoSQL database I've used so far has been based around a similar idea: what if you never had to think about how you put data in, and instead devoted all of your energy into figuring out how you're ever going to get it back out.
    • James Urquhart: What would speed this loop up even further for business? I believe the answer is an ecosystem of event sources, published via standard mechanisms, with easily interpreted payloads, shared (for free or under some payment model) for all to consume. Every business finding the value in the events they generate, and building new business on events published by others. Literally, simplifying and accelerating the automation of the flow of money and information across the business landscape.
    • Kathryn Jonas: Overall, it's been a positive experience and Go is one of the critical elements that has allowed the Content Platform to scale. Go will not always be the right tool, and that's fine. The Economist has a polyglot platform and uses different languages where it makes sense. Go is likely never going to be a top choice when messing around with text blobs and dynamic content, so Javascript is in the toolset. However, Go's strengths are the backbone that allows the system to scale and evolve. 
    • Bryan Cantrill: The beauty of Rust is that it shifts cognitive load back from software when it's running in production to the developer in development. 
    • Peter Norvig: But I figured that in the course of a transcontinental plane ride I could write and explain a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second in about half a page of code.
    • Peter Norvig~ Doesn't think we'll reach the singularity. The singularity is the idea progress is accelerating and will continue to accelerate until all things are possible and who knows what will happen after that. Not a believer in that. Part of the problem is people look at a small class of problems and look at progress in those problems, and some those are accelerating on those types of curves, and think that all the problem are like that. Some problems are like route finding. We can get better and better at that. We will have lots of progress on those types of problems. A lot of life is not like that. If you want to solve the Middle East crisis having a faster computer or better machine learning algorithms isn't going to get you very far. More of life is these messy kinds of problems that are not amenable to progress in automation so therefor the curves for the easy parts will continue to go up and up and up and the curves for the hard messy wicked problems will not go up. The net curve will not be exponential off to infinity.

  • Colm MacCárthaigh glossed his own epic talk in an equally epic Twitter thread. Find the talk here: AWS re:Invent 2018: Close Loops & Opening Minds: How to Take Control of Systems, Big & Small. Highly recommended. You'll learn a thing or two because he's seen a thing or two.
    • Like most systems architects, we divide our services into data planes ... the pieces that do the work customers are most interested in, like a running an instance, serving content, or storing bits durably ... and control planes, which manage everything behind the scenes.
    • Control Planes are all about taking intent and translate it into real world action in the universe. Just like your TV remote. You tell it what you want, and it's job is to make that happen. But it's harder than it seems!
    • Have you ever used a universal remote control, and had it turn your TV on, but not your audio system? Super common problem! There's two things going on: one is that there's a network partition ... your remote can't reach everything, ok fine, so you move it around and press again.
    • But more deeply, the real problem is that the control has no idea whether it achieved success or not. It has no feedback mechanism! This is the most common design problem for control planes. A system like that can never be stable!
    • I see this all the time in customer designs. For example: users change settings, but sometimes they don't take, because the update doesn't make it to all the servers. Often they end up with support processes to push everything again, on demand, or tell users to try reseting.
    • And lots more.
    • See also, Eat your own dog food: how AWS leverages Serverless. AWS uses API Gateway and Lambda to implement control planes.

  • My favorite AI generated cookie names: Hersel Pump Spritters. A close runner up: Low Fuzzy Feats. Least creative: Bars.

  • Everyone falls into creative ruts, but two people rarely do so at the same time. Jeff Dean and Sanjay Ghemawat pair programmed Google into existence. The Friendship That Made Google Huge: On their fifth day in the war room, Jeff and Sanjay began to suspect that the problem they were looking for was not logical but physical. They converted the jumbled index file to its rawest form of representation: binary code. They wanted to see what their machines were seeing. On Sanjay’s monitor, a thick column of 1s and 0s appeared, each row representing an indexed word. Sanjay pointed: a digit that should have been a 0 was a 1. When Jeff and Sanjay put all the missorted words together, they saw a pattern—the same sort of glitch in every word. Their machines’ memory chips had somehow been corrupted...Together, Jeff and Sanjay wrote code to compensate for the offending machines. Shortly afterward, the new index was completed, and the war room disbanded...Failures, which occurred seemingly at random, kept breaking the system. To survive, Google would have to unite its computers into a seamless, resilient whole. Side by side, Jeff and Sanjay took charge of this effort...“We were writing things by hand,” Sanjay said. His glasses darkened in the sun. “We’d rewrite it, and it was, like, ‘Oh, that seems near to what we wrote last month.’ ” “Or a slightly different pass in our indexing data,” Jeff added...“I don’t know why more people don’t do it,” Sanjay said, of programming with a partner. “You need to find someone that you’re gonna pair-program with who’s compatible with your way of thinking, so that the two of you together are a complementary force,” Jeff said.

  • Videos from the Whole Earth 50th Anniversary are now available.

  • The NFL Truly Wraps It's Arms Around Vegas! Sure, the internet spawned whole new industries, my dog knows that, but the internet has also revolutionized one of the oldest professions. No, not that one, we're talking sports betting. Now that the US Supreme Court has legalized sports betting, we're seeing even once gambling-phobic monopolies like the NFL look to grow revenues by getting a taste of the action. The NFL first put a toe in the gambling waters with fantasy football. And it was good. Then in an unprecedented step the NFL let the Oakland Raiders become the Las Vegas Raiders. So the NFL now says Viva Las Vegas! The NFL controls all game data and they control all broadcast rights. Just imagine what you could do with all that lovely lovely data and content! Noted gambling expert RJ Bell thinks the first type of sports betting the NFL will adopt is real-time in-game proposition betting or prop bets. A perfect internet data play. Imagine it's fourth and one on the goal line and the following pops up on your official NFL app: -170 touch down by pass, -210 touch down by run, +120 loss of possession. You can think of every game as simply an event stream of betting opportunities. RJ Bell thinks the NFL will need to partner with Apple TV, Roku, etc, but that's old style thinking. The NFL doesn't need anyone. They own everything so they can cut out the middleman. Isn't the internet all about disintermediation? The NFL should create their own apps, hire their own odds makers, AI the hell out of it, manage the money, and walk all the way to the bank. It will make TV contracts look like pocket change. That is, if they don't blow it. Which they probably will. 

  • Excellent explanation. How to scale your Node.js server using clustering

  • Pivotal announces new serverless framework. Great discussion on HN. erikb: Why is it so hard to understand? a) re FaaS. Of course if you have an on-prem component, someone needs to take care of it. That's why it's on-prem. But still you can separate administration of that from the developers and have the additional new feature that you as admin don't need to care which software runs in these clusters. (In reality it's never that simple since specific hardware for the task outperforms the standard hardware by more than a margin, but at least you have to worry less about getting from cool-product:v1.0 to cool-product:v2.0 as admin anymore. richards: Honestly, this rarely seems to come down to cloud pricing considerations. Rather, the goal for most shops who choose something like PCF is to (a) simplify onboarding and ongoing work of devs since they don't need to learn the nuances of each cloud IaaS, (b) stripe a single ops model across each infrastructure pool, which matters since no enterprise of size is using only one. People buy because they struggle to ship, and PCF helps these big companies put their focus back onto shipping software, not where or how it runs.

  • The secret to saving money on AWS? Blend spot, RI, and OD using Launch Templates, EC2 Fleet and Auto Scaling. AWS re:Invent 2018: Better, Faster, Cheaper Cost Optimizing Compute with Amazon EC2 Fleet. Use custom tags. Each resource can have up to 50 tags. Helps identify which applications are using which resources. Tags are free. Can tag an instance as part of a website, or dev, or prod, or launched by Engineer X. Then using cost explorer you can answer questions like: how much does my webisite cost? Are my dev or test environments being run all the time? Are people using the RIs that we purchased or are they spinning up new instances? Three purchase options: on-demand, reserved instances, spot instances. These are just financial constructs. All instance types are available under each option. 95% of spot instances are terminated by the customer. Hibernate saves money because it's faster to start. 175 instance types, rate of instance type growth is increasing because custom hardware (Nitro) makes it easier to manage. Experiment with different instance types to save money. Could save 10-15% on AMD. On a LAMP stack can save 45% by scaling out using a larger number of small instances with 1 or 2 CPUs. You need to combine purchasing options. Use RIs for known, steady-state workloads. On demand for work that can't be interrupted. Use spot for stateless workloads. Illumina cut gene sequencing costs by 75%. Ad Roll handles a 100 billion requests a day using spot at a saving of $3 million per year. Three core technologies for how to save money: Launch Templates, EC2 Fleet, and Auto Scaling groups. Launch Templates captures all launch parameters in a template. EC2 Fleet use all three purchase options to optimize costs. You say how much you want and if you want to optimize costs or diversification. Zillow saved 30% using a blended environment campared to all RIs, save 70% compared to on demand. Task nodes, worker nodes, and processors can be converted to EC2 Fleet using spot. Lyft runs a lot on spot: restartable production workloads, non-production compute resources, offline operations, custom batch workloads. Lyft runs on mixed spot, RI, OD, Lambda: Horizontally scalable services, streaming compute. Lyft has 3 people on a capacity team to mange AWS costs. Containers are great for spot. Optimize across all threee purchasing options using Fleet. Right size and scale based on demand using EC2 auto scaling use both predictive scaling and dynamic scaling. Don't forget to scale down and turn stuff off. Use Launch Templates to streamline and simplify the launch process. Look into hibernate. Architect workloads with both performance and cost in mind.

  • Scale By The Bay 2018 videos are now available. You might like: Pat Helland, Keynote III: Mind Your State for Your State of Mind

  • An awesome Recap of Frontend Development in 2018. WebAssembly is poised to do whatever it does. React leads, but Vue is a bigger star. GraphQL will be big next year, if people can get over that "looks like a lot work compared to REST" hump. Never even heard of CSS-in-JS. In response to framework hell there's a back-to-the-land movement with increased adoption of static site generation. TypeScript is entering widespread adoption so you can count on looking at even more files where you go "that looks like Javascript, but I have no idea what's going on." I learned I'm using the JAMstack. What does 2019 hold? More of the same, but who really believes that?

  • The great Jeremy Daly explored how to contain Lambda costs. Serverless Tip: Don’t overpay when waiting on remote API calls. He found something that goes against common wisdom: my experiments show there is little to no effect [of memory] on total Lambda execution time since it is simply waiting for a response and not doing any processing of its own. So that means: Functions that make remote API calls can be broken down into small, asynchronous components with low memory settings. We get the same performance and significantly reduce our costs, especially at scale.

  • A good technique to test message based systems is to blast protocols with random state machines and random message formats. There are also very clever ways to do the same with code. Adventures in Video Conferencing Part 3: The Even Wilder World of WhatsApp: I then wrote a library for Android which had the same parameters as memcpy, but fuzzed and copied the buffer instead of just copying it, and put it on the filesystem where it would be loaded by dlopen. I then tried making a WhatsApp call with this setup. The video call looked like it was being fuzzed and crashed in roughly fifteen minutes...Using this setup, I reported one heap corruption issue on WhatsApp, CVE-2018-6344. This issue has since been fixed. 

  • Ants and software organizations with developed cultures have a lot in common. An ant colony has memories that its individual members don’t have: Colonies live for 20-30 years, the lifetime of the single queen who produces all the ants, but individual ants live at most a year. In response to perturbations, the behaviour of older, larger colonies is more stable than that of younger ones. It is also more homeostatic: the larger the magnitude of the disturbance, the more likely older colonies were to focus on foraging than on responding to the hassles I had created; while, the worse it got, the more the younger colonies reacted. In short, older, larger colonies grow up to act more wisely than younger smaller ones, even though the older colony does not have older, wiser ants.

  • Former Google CEO Eric Schmidt listed the ’3 big failures’ he sees in tech startups today: People stick to who and what they know; Too much focus on product and not on platforms; Companies aren’t partnering up early enough. 

  • It's common wisdom software can't reliably be built using a big bang process. Working software evolves from smaller working systems, no matter how imperfect. That's how nature works too. As perfect as ice can appear, it always starts with a defect: "Without a speck of dust or soot to act as a seed, supercooled water simply will not freeze. But these imperfections can lead to beauty. In “Ice Formations,” photographer Ryota Kajita captures some of the oddities of ice in Alaska’s interior swamps and ponds. In Kajita’s images bubbles are frozen in suspension, plates of ice form strange shapes, and star-shaped cracks peek through the snow. Whether the ice formed too quickly or too slowly, there are interesting signatures left behind." In the same wabi-sabi way, software can be beautiful too.

  • We have been running GraphQL on NodeJS for about 6 months, and it has proven to significantly increase our development velocity and overall page load performance. [Netflix] learnings from adopting GraphQL: The difference now is that the majority of the data is flowing between servers within the same data center. These server to server calls are of very low latency and high bandwidth, which gives us about an 8x performance boost compared to direct network calls from the browser. The last mile of the data transfer from the GraphQL server to the client browser, although still a slow point, is now reduced to a single network call...With GraphQL, we define each piece of data once and define how it relates to other data in our system. When the consumer application fetches data from multiple sources, it no longer needs to worry about the complex business logic associated with data join operations.... Once we defined the entities in our GraphQL server, we use auto codegen tools to generate TypeScript types for the client application...With a GraphQL query wrapper, each React component only needs to describe the data it needs, and the wrapper takes care of all of these concerns...Most of the infrastructure we built to make network requests and transform data was easily transferable from our React application to our NodeJS server without any code changes. We even ended up deleting more code than we added...Passing around objects is what OOP is all about, but unfortunately, GraphQL throws a wrench into this paradigm. When we fetch partial objects, this data cannot be used in methods and components that require the full object.

  • Rennovating your house always costs more and takes longer than you think. Building Services at Airbnb, Part 3: Airbnb is moving its infrastructure towards a Service Oriented Architecture. A reliable, performant, and developer-friendly polyglot service platform is an underpinning component in Airbnb’s architectural evolution...We saw multiple waves of increased traffic (external as well as compounded by retries upon request failures)...As each service box reached its limit, it started to timeout more often and caused the upstream P2 service to retry...The service uses a hashing algorithm that is computationally intensive, and we think the root cause was the spike of requests caused a sharp increase of CPU utilization, which starved resources for other requests and caused health reporting to mark individual service nodes as unhealthy...Readers who handled similar systems and incidents must have noticed some common reliability issues: request spikes, system overload, server resource exhaustion, aggressive retry, cascading failures...The resilience measures that we implemented are well-known patterns and have already prevented downtime in the core booking flow...Async Server Request Processing...Request Queuing...Load Shedding...Dependency Isolation and Graceful Degradation...Outlier Server Host Detection...Individual services should maintain their SLOs (service level objectives) during all circumstances, whether it’s deploys, traffic surges, transient network failures, or persistent host failures.

  • What makes this work especially exciting to us is that it represents one of the first data structures discovered in the brain, along with a simple algorithm for how the brain may actually perform novelty detection. To detect new odors, fruit fly brains improve on a well-known computer algorithm: Based on the fruit fly’s Bloom filter variant, the team created a new algorithmic framework to predict fruit flies’ novelty responses. They tested their framework on research data collected as flies were presented pairs of odors in succession. The team’s novelty predictions turned out to closely match the actual novelty response of the mushroom body neurons, which validated their framework’s accuracy. Navlakha’s team then tested the framework on several machine learning data sets and found that the fly’s Bloom filter improved the accuracy of novelty detection compared to other types of novelty detection filters

  • An Analysis of Network-Partitioning Failures in Cloud Systems: We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by solating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms. We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. 

  • Video codec comparison using the dynamic optimizer framework (article): We present a new methodology that allows for more objective comparison of video codecs, using the recently published Dynamic Optimizer framework. We show how this methodology is relevant primarily to non-real time encoding for adaptive streaming applications and can be applied to any existing and future video codecs. By using VMAF, Netflix’s open-source perceptual video quality metric, in the dynamic optimizer, we offer the possibility to do visual perceptual optimization of any video codec and thus produce optimal results in terms of PSNR and VMAF. 

  • The Distributional Little's Law and Its Applications: This paper discusses the distributional Little's law and examines its applications in a variety of queueing systems. The distributional law relates the steady-state distributions of the number in the system or in the queue and the time spent in the system or in the queue in a queueing system under FIFO. We provide a new proof of the distributional law and in the process we generalize a well known theorem of Burke on the equality of pre-arrival and postdeparture probabilities. More importantly, we demonstrate that the distributional law has important algorithmic and structural applications and can be used to derive various performance characteristics of several queueing systems which admit distributional laws. As a result, we believe that the distributional law is a powerful tool for the derivation of performance measures in queueing systems and can lead to a certain unification of queueing theory.

  • The dry history of liquid computers: A liquid can be used to represent signals, actuate mechanical computing devices and to modify signals via chemical reactions. We give a brief overview of liquid based computing devices developed over hundreds of years. These include hydraulic calculators, fluidic computers, micro-fluidic devices, droplets, liquid marbles and reaction-diffusion chemical computers. See also, Your whole office could be a computer thanks to sculpted Wi-Fi waves.