Stuff The Internet Says On Scalability For February 15th, 2019

Wake up! It's HighScalability time:

Opportunity crossed over the rainbow bridge after 15 years of loyal service. "Our beloved Opportunity remains silent."

Do you like this sort of Stuff? I'd greatly appreciate your support on Patreon. Know anyone who needs cloud? I wrote Explain the Cloud Like I'm 10 just for them. It has 39 mostly 5 star reviews. They'll learn a lot and love you forever.

  • 200 million: per day YouTube videos recommended on home page; $9.3 billion: 27% increase in AI funding; 70%: Microsoft security bugs are memory safety issues; 11: new version of Perl; 24%: serverless users are new to cloud computing; 1 million: SpaceX satellite uplinks; $500K: ticket to mars; $13 billion: Google's new datacenter construction; 59%: increase in Tesla Autosteer accidents; $.30: reddit per user revenue; 38%: Airbnb bugs preventable by using types; 60K: data breaches reported since GDPR; 350: theoretical max rock stone skips;

  • Quoteable Quotes:
    • @gchaslot: Brian's hyper-engagement slowly biases YouTube: 1/ People who spend their lives on YT affect recommendations more 2/ So the content they watch gets more views 3/ Then youtubers notice and create more of it 4/ And people spend even more time on that content. And back at 1. This vicious circle was also observed with http://tay.ai , and it explains why the bot became racist in less than 24 hours. Example of YT vicious circle: two years ago I found out that many conspiracies were promoted by the AI much more than truth, for instance flat earth videos were promoted ~10x more than round earth ones 🌎🤯 I was not the only one to notice AI harms. @tristanharris talked about addiction. @zeynep talked about radicalization. @noUpside, political abuse and conspiracies. @jamesbridle, disgusting kids videos. @google's @fchollet, the danger of AI propaganda. There are 2 ways to fix vicious circles like with "flat earth" 1) make people spend more time on round earth videos 2) change the AI YouTube’s economic incentive is for solution 1). After 13 years, YouTube made the historic choice to go towards 2) Will this fix work?
    • crazyforbytes: Does anyone else not feel so good about a future of computing where everything but the application layer is rented from Jeff Bezos?
    • Scott Aaronson: Now what they’re trying to do with every quantum algorithm, is you’re trying to choreograph things in such a way that for each wrong answer to your computational problem, some of the paths leading to that answer have positive amplitude, and others have negative amplitude, so on the whole, they cancel each other out, whereas the path leading to the right answer should all be ‘in phase’ with each other. They should all have amplitudes of the same side, say all positive or all negative. If you can mostly arrange for that to happen, then when you measure the state of your quantum computer, then you will see the right answer with a large probability. If you don’t see the right answer, you can simply keep repeating the computation until you do.
    • @dhh: The last thing podcasting needs is another 800-lbs gorilla aggregating, dominating, and blowing up the free, open market with exclusives. Regulators, please learn from your recent mistakes aiding and abetting Facebook’s monopolist moves. You still haven’t unrolled those!
    • Gojko Adzic: cloud operators are now mining meta-data about platform usage, trying to sell even more convenience at a higher price. Want an EMR cluster and a database to create sales dashboards? Why not just subscribe to a dashboards service and not worry about databases, indexes and map-reduce any more? By moving up the stack, providers are now starting to offer business-application components, coming back full circle to original SaaS, but with an interesting twist...Moving from SaaS over IaaS to PaaS, by using FaaS, we got to BaDaaS: Business action deployment as a service.
    • @samphippen: “Does it scale?” Is a boring question. “Does it scale enough for our users right now in a way that lets us continue to work on what’s most important” is a better one.
    • @zebulgar: First identify whether the open role is value creating or value protecting. If it's value creating, focus on ideas they've taken from start to finish even if they aren't related to the role. If it's value protecting, focus on whether they have lots of prior domain expertise related to the value you expect this role to protect.
    • @ManishEarth: I was today years old when I realized the name "Raft" comes from it being a distributed *log* protocol
    • @hichaelmart: AWS Lambda blows my mind every time I flex it. Currently executing 2000 parallel 15min functions doing a hyperparameter seach with fasttext. 🤯 Would take me frikken hours to do this any other way.
    • Anonymous: Some of the most prominent hedge fund managers of the last few decades—Steve Cohen, Paul Tudor Jones—are going against type and launching technology-driven quantitative investment funds. They employ physicists and computer scientists to write algorithms to invest money, because that’s what investors want. You’re seeing a massive arms race across hedge funds to rebrand themselves in that direction.
    • Michael Wittig: There are cases where on-demand is significantly more expensive compared to provisioned with Auto Scaling. My rule of thumb: The spikier your workload, the higher the savings with on-demand. Workloads with zero requests also benefit. The reason why we saw such significant savings with marbot is our workload that goes down to almost zero requests/second for half of the day in production and is mostly zero for 24/7 in our test environment. My suggestion is to switch to on-demand for one day and compare your costs with the day before.
    • icholy: Saying you don't need types because you're an expert is the same as saying you don't need tests because you don't write buggy code.
    • alpha_squared: The way I've generally thought about it and have seen it done successfully in practice is to create a microservice for each component that scales independently. Auth and payments make sense because they scale independently. You may get authorization requests and financial transactions at a different rate than traffic to your application itself.
    • jedberg: You probably won't find much that will help you because there really is no "right time". I've done a lot of surveys, and what I've found is that if you're running microservices, about 25% of your engineering time will be spent managing the platform/DevOps/Not writing product code. That time can either be 25% of every engineer, or 25% of your engineering staff. In either case, the best time to do it would be when you feel the increased speed and agility is outweighed by the added overhead. The speed and agility comes from small teams being able to operate mostly independently.
    • @Havokmon: Yes, @VFEmail is effectively gone. It will likely not return. I never thought anyone would care about my labor of love so much that they'd want to completely and thoroughly destroy it.
    • EvanAnderson: I will continue to stand on my soapbox and proclaim that backup has to include an offline component (preferably one that's verified independently). An attacker has to be willing to bring "kinetic" means to bear if you have offline backup media.
    • Brent Ozar: Azure SQL DB’s log throughput seems kinda slow for $21,468 per month.
    • spricket: I agree with the author completely. I worked on a fairly large system using event sourcing, it was a never-ending nightmare. Maybe with better tooling someday it will be usable, but not now.
    • @QuinnyPig: The reason I'm bearish on Kubernetes in the long term comes in two parts. The first is that it too will slip beneath the waves of "plumbing I don't have to care about." That's the easy prediction. The second, and much likelier to result in hate mail, is that I've yet to have a conversation that answers the question "what is the business value of Kubernetes?"
    • vidarh: I posted a comment elsewhere in this thread about our use of events, and it boils down to selectively picking the entities we need to be able to reason about past states of, and storing the new states of those, and then deriving views from that state.
    • Jens: Developers Are The Problem, Not Monoliths
    • sametmax: I believe the most amazing feature of JS is the ability to make their devs think they got it all figured out.
    • @danielpearson: Multiple VC-backed founders have told me that they will bootstrap their next thing b/c VC path is crazy pressure. Multiple bootstrapped founders have commiserated with me about how brutal it is to have no capital or margin for error. No matter the path, the path is hard!
    • @davecheney: In the brave new world of Cloud Native®, every I/O is network I/O, every network I/O a HTTP request, and every network port is port 443.
    • Matt Reynolds: At the heart of the Directive on Copyright are two divisive articles – Article 11 and Article 13 – that have been dubbed the “link tax” and “meme ban” articles respectively. Critics of the Directive on Copyright argue that these articles mean that platforms will have to pay a fee to share a link to a news article and have to start filtering and removing memes.
    • xrd: I'm thinking of Firebase. For example, with Firebase, my web clients (web or Android, for example) don't ever talk to my backend directly. I have a serverless firebase function that accepts a trigger when there is an update to the firestore (cloud database) or storage (like AWS S3). I get a JSON object (which is declarative) that I can use to send the file or data on to a new process. This decoupling makes my firebase function really small: it only parses JSON and sends an event, perhaps to another Google service. Compare this to a large application where you are marshalling objects from the HTTP POST request into temp objects inside your application language. Making that work with all clients, within all server contexts, is a lot of code which I hated to write.
    • Jeremy Daly: If there is high-utilization and you are running this constantly (24 hours a day), a container might be a good fit. You would likely need to coordinate several containers, however, just in case there was a failure and the container crashed. You wouldn’t have this issue with Lambda, except for potentially a minute to restart the service automatically. If you had bursts throughout the day, such as peak times, running a container 24 hours a day might be overkill.
    • basilica: Unless you have literally the largest dataset in the world, training a neural network on it from scratch will probably give worse results than using a huge pretrained net as a feature extractor and training a simple linear model on that. This is absolutely remarkable, because training a linear model is way easier. We're talking ten lines of Python to implement, and training times measured in minutes...embeddings are going to be the main way most data scientists get value out of deep learning, which is really exciting.
    • simionescu: The problem is also that Lambda works really poorly with anything that is NOT Cognito and DynamoDB. For example, we were using some CouchBase we deployed ourselves - however, that meant that the Lambda now had a network interface in a private subnet, and start times skyrocketed, especially for concurrent access. When we also started doing our own authentication with yet another Lambda, request times for cold lambdas (which also means every new concurrent connection, remember) almost doubled.
    • Sarath Balachandran: we find that U.S. VCs that invest in Indian immigrant entrepreneurs in the U.S. subsequently invest more in India—in the exact region where the immigrants are from and with a lower likelihood of using a local co-investor. These effects are stronger for VCs with ethnic Indian partners. We find no such effects from investing in ethnic (non-immigrant) Indian entrepreneurs, who lack direct knowledge and connections in India.
    • Ory Segal: This is a classic case of application-layer denial of service, where instead of bombarding with huge volumes of traffic, you hog system resources by exploiting inefficient application code. The attack was run from a simple Macbook Pro with a terrible keyboard, connected to a DSL line, and it only took a few seconds to run. 
    • Byron Reese: How is it that for 200 years we have lost half of all jobs every half century, but never has this process caused unemployment? Not only has it not caused unemployment, but during that time, we have had full employment against the backdrop of rising wages.
    • @kellabyte: Oh my god our industry makes no sense. Bootstrap just spent a gigantic effort to remove jQuery only to break everything and implement its own using the same shadow dom. Baffles me so much what us engineers will do to completely toss working mature code for very little benefits.
    • Greg Ferro: As general principle, the biggest challenge is skills transformation. Moving from a cost management business model of today to a core business activity requires new skills to be added. Its not enough to spend 5 weeks in a classroom plus 5 weeks of self study to become a vendor professional. People who want to call themselves professionals must have more skills than the ability to operate a vendor product, be narrowly focussed on a specific area. 
    • Stanislav Costiuc: a meta-game is a game beyond the game - a loop wrapping itself around the core gameplay experience. It can take form of systems created by developers, or strategies used by players, and the latter one needs regular shake ups to prevent the meta from becoming stale, which can lead to loss of interest in the game among a hefty part of the player base.
    • @Carnage4Life: Apple wants to keep 50% of revenue from its subscription news service. As iPhone sales growth ends and company narrative shifts to service revenue growth, expect more user and partner hostile cash grabs.
    • Charlie Demerjian: One of the most interesting technologies shown in 2018 was Intel’s Foveros chip stacking. It may not have been the holy grail of semiconductors but it ranks as at least an anointed stein. Lets be clear about this, Foveros is something between a very advanced iteration of what was and a big step toward the end goal of arbitrary chip stacking but it doesn’t quite make it all the way.
    • Jim Witham: With each data center using about the same amount of energy as a mid-size city, this sums up to an astronomical 416 terawatt hours of electricity by the industry 
    • Kevin Kelly: We are now at the dawn of the third platform, which will digitize the rest of the world. On this platform, all things and places will be machine-­readable, subject to the power of algorithms. Whoever dominates this grand third platform will become among the wealthiest and most powerful people and companies in history, just as those who now dominate the first two platforms have. Also, like its predecessors, this new platform will unleash the prosperity of thousands more companies in its ecosystem, and a million new ideas—and problems—that weren’t possible before machines could read the world.
    • commandlinefan: After 25 years working I've come to terms with the realization that, no matter what I do - even if it's 10 times more than the people around me - they'll _always_ ask for more. They're not telling me that I'm not working hard enough because they don't actually think I'm not working hard enough, they tell me I'm not working hard enough because they're soulless bloodsuckers incapable of human emotion or empathy who are just idly curious if saying "you're not working hard enough" will magically result in even more.
    • @brandur: Getting a patch reviewed on the Postgres hackers mailing list floors me every time — just at the next level in terms of attention to detail, effort invested, and thoughtfulness. I'm 10+ years into working in software professionally and I've never seen anything else like it.
    • Alex Petrov: The RAM conjecture that our DBLab folks came up with states that setting up an upper bound for the two of dimension overheads also sets a lower bound for the third one. In other words, the three parameters [optimized for either reads, updates or memory overhead] form a competing triangle, an improvement on one side which means compromises on the other two.
    • Jordana Cepelewicz: Tsao and his colleagues were excited because, they posited, they had begun to tease out a mechanism behind subjective time in the brain, one that allowed memories to be distinctly tagged. “It shows how our perception of time is so elastic,” Shapiro said. “A second can last forever. Days can vanish. It’s this coding by parsing episodes that, to me, makes a very neat explanation for the way we see time. We’re processing things that happen in sequences, and what happens in those sequences can determine the subjective estimate for how much time passes.”
    • ryg: In conventional hash tables, we have different probing schemes, or separate chaining with different data structures. For a cache table, we have our strict bound P on the number of entries we allow in the same bucket, and practical choices of P are relatively small, in the single or low double digits. That makes the most natural representation a simple 2D array: N rows of hash buckets with P columns, each with space for one entry, forming a N×P grid. Having storage for all entries in the same row right next to each other leads to favorable memory access patterns, better than most probe sequences used in open addressing or the pointer-heavy data structures typically used in separate chaining.
    • Paraison: What we are providing is a low latency, deterministic interconnect based on PCI-Express that can solve a lot of different problem. One of the newest things we are doing is called device lending, which we are calling Smart I/O. With device lending, you have some systems with GPUs, in others you have NVM-Express drives, and maybe on others you have an Intel network interface with SR-IOV virtual networking support. With device lending, you can treat those PCI-Express devices as a pool of resources, and you can make them available to any node in the PCI-Express cluster and to any node, it will look, to the operating system and to the firmware, like those devices are installed locally in that node.
    • Taeer Bar-Yam: The development of the Multiscale Law of Requisite Variety leads to the notion of a trade-off between coordination and flexibility. A system can have many parts coordinated to allow large scale responses, or it can have those parts independent to allow independent, and varied responses. For example, if a person has the capacity to know how to construct up to ten different things, then each can learn ten ways of making a small hut by themselves, so that there are fifty different small buildings that can be built, or they can coordinate to learn to construct ten types of larger buildings
    • Tim O'Reilly: Consider these companies: Mailchimp, funded by $490 million in annual revenue from its 12 million customers, profitable from day one without a penny of venture capital; Atlassian, bootstrapped for eight years before raising capital in 2010 after it had reached nearly $75 million in self-funded annual revenue; and Shutterstock, which took in venture capital only after it had already bootstrapped its way to becoming the largest subscription-based stock photo agency in the world. (In the case of both Atlassian and Shutterstock, outside financing was a step toward liquidity through a public offering, rather than strictly necessary to fund company growth.) All of these companies made their millions through consumer-focused products and patience, not blitzscaling.

  • DigitalOcean builds out their cloud with a fully manage PostreSQL option starting at $15/month. Worry-free  database hosting. There's a lot of positive support. Some concerns. gr2020: Love this new offering. What I don’t love is they are charging for egress bandwidth ($0.01/GB), even in the same data center. I can understand it for outbound to internet or other data centers, but this is hard to swallow for the same facility. danpalmer: Our database primary is around the same price as the highest spec DO are offering, but for that we get 3x memory, 6x CPU, and 2x the disk. kyledrake: It costs $100/mo minimum to get failover support, pretty high for a starter package. eddiezane: Managed Databases are built on top of our core compute platform which use local SSD's. You should get that same awesome performance!

  • What happens when memcached can make use of high speed flash SSD’s and Optane NVMe devices? Caching beyond RAM: Riding the cliff: The extra tests we show here demonstrate a baseline of a worst case scenario for the performance of a single machine with one or more devices. With request rates around 500,000 per second with under a millisecond average latency, most workloads fit comfortably. While expanding disk space works well, further development is needed to improve throughput with multiple high performance devices.

  • Great OpenFaas example with source code. How to build a Serverless Single Page App: building a Single Page App (SPA) with Serverless Functions using Vue.js for the front-end, Postgres for storage, Go for the backend and OpenFaaS with Kubernetes for a resilient scalable compute platform. It uses Postgres 10 hosted on DigitalOcean using their new DBaaS service. It costs around 15 USD at time of writing and gives a node which can accept 22 concurrent connections and has 1GB RAM, 10GB storage.

  • Benchmarking Lambda’s Idle Timeout Before A Cold Start: Lambdas in us-west-2 have a consistent idle timeout of 26min for all lambda sizes; Lambdas in us-east-1 have varying idle timeouts, ranging from 22min - 65min; Lambdas in us-east-1 with smaller memory sizes (under 1536MB) have the highest upper idle timeouts; idle timeout is different between different AWS regions and you should not assume that benchmarks done in one region stay consistent across regions

  • This example is almost crazy to the point of satire, but it's the brilliant kind of crazy you'll appreciate if you're in an open frame of mind. GoogleCloudPlatform/microservices-demo. It's crazy because it shows how to build a website using a 10-tier microservices architecture—Kubernetes/GKE, Istio, Stackdriver, Skaffold, gRPC and OpenCensus—where each tier is written in a different language—Go, C#, Node, Python, Java, Python—all connected using gRPC, deployable using with little or no configuration. Who would do such a thing? Nobody. But that's not the point. If you've decided on a microservices architecture you've already decided to support a high level of complexity. This just shows you how to do it using Google tools. As is common these days most of the instructions are about the 99 different ways to package the same stuff together. Use any webstack these days and most of the instructions are on packaging. We won't get anywhere unless we can escape packaging hell and just build stuff again. Microsoft has a similar example

  • The secret is pretending Go is C. Going Infinite, handling 1M websockets connections in Go. Use epoll. Avoid goroutines. Worry about memory. Worry about limits. But it works.

  • Don’t worry too much about code quality, what is good today might be trashed tomorrow. Startup Pre-series A tech choices you can’t compromise on: We embraced serverless and it’s a choice that has been paying off, the learning curve it’s flat, the speed of going to production without worrying about the necessary infrastructure or service discovery beats any other technology I’ve used before, however, some third-party integrations require ‘real servers’...I’ve asked myself a few times, would we have been better off with a monolith? What if instead of 150~ lambdas we’d be working on a Rails/Django monolithic application?...We copied and pasted, we didn’t write many tests, the good things about lambdas (as it’s for microservices but here we are at the ‘nano-services’ size) it’s that it’s easy to trash something and rewrite later. We limited homegrown npm packages (not enough!), architectural dependencies between lambdas...For us this implies using IOPipe on all our lambdas, implies always set up at least one alert in both non-prod and prod environment for every new lambda....We invested in building an event-driven system, we can replay events and re-build our data if an unexpected condition happened...As much as we value freedom and empowerment overall simplicity this also implies no polyglot, no multi-cloud: we love AWS, Node.JS and React. Also, How We Moved Towards Serverless Architecture

  • Where has all the productivity gone? In hard to see places. Big Data-Driven Decision-Making At Domino's Pizza:  We’re becoming more of a digital e-commerce operation, rather than a traditional quick service restaurant business. We’re leading the way, I think, in what we’re doing with social and what we’re doing with our digital platforms. And importantly, I think, it’s structuring the digital service with the insights we have on customers and practices, that leading our customers to a better, faster and more quality-based experience on our digital platforms. It has surprised us, how quickly we’ve transitioned from a traditional ordering mechanism into an e-commerce based profile. Also, IT Improves Productivity!

  • Isn't everything hard? It's just a matter of it having enough benefit to be worth the work. Don't Let the Internet Dupe You, Event Sourcing is Hard: The big Event Sourcing "sell" is the idea that any interested sub-systems can just subscribe to an event stream and happily listen away and do its work...In practice, this manages to somehow simultaneously be both extremely coupled and yet excruciatingly opaque. The idea of a keeping a central log against which multiple services can subscribe and publish is insane. You wouldn't let two separate services reach directly into each other's data storage when not event sourcing – you'd pump them through a layer of abstraction to avoid breaking every consumer of your service when it needs to change its data – However, with the event log, we pretend this isn't the case...in effect, the raw event stream subscription setup kills the ability to locally reason about the boundaries of a service...Event Sourcing is not a "Move Fast and Break Things" kind of setup when you're a green field application. It's a more of a "Let's all Move Slow and Try Not to Die" sort of setup...A super common piece of advice in the ES world is that you don't event source everywhere *. This is all well and good at the conceptual level, but actually figuring out where and when to draw those architectural boundaries through your system is quite tough in practice...Software changes, requirements change, focuses shift. Those immutable "facts," along with your ability to process them, won't last as long as you expect...A good ol' fashion history table gets you 80% of the value of a ledger with essentially none of the cost. Also, Serverless Event Sourcing in AWS (Lambda, DynamoDB, SQS)

  • Dozens of FOSDEM (Free and Open source Software Developers) 2019 videos are now available

  • Good series of articles on Database Internals: Part 1: DistSys Vocabulary, Part 2: Path to Atomic Broadcast, Part 1: Flavors of IO, Part 2: More Flavours of IO, Part 3: LSM Trees, Part 4: B-Trees and RUM Conjecture, Part 5: Access Patterns in LSM Trees

  • After reading this fantastic interview—Quantum Computing, Capabilities and Limits—I'm at superposition of understanding and not understanding Quantum Computing. There are no shortcuts: a qubit, which is the quantum version of a bit, is advanced as can be in what we call a superposition of the zero and one state. So it’s neither definitely zero nor definitely one. And the main problem is that people always want to round this down to something that they already know...when you measure a qubit, you only see one result. You see a zero or a one; you don’t see both, and what quantum mechanics is at the core, is a way of calculating the probability that you’re going to see one outcome or another one when you make an observation. Now, the key point is that quantum states don’t obey the normal rules of probability that we know. So a probability is a number from zero to one, so you could have a 30% chance of rain tomorrow, but you never have a -30% chance, right, that would be nonsense, okay? But quantum mechanics is based on numbers called amplitudes, which can be positive or negative. In fact, they can even be complex numbers. So when you make a measurement, these amplitudes turn into probabilities, and so the larger amplitude becomes a larger probability of being something, but when a system is isolated, the amplitude can evolve by rules that are very unfamiliar to everyday experience. That is what pretty much everything you’ve ever heard about the weirdness of the quantum world boils down to.

  • Maybe it's time? Database as Filesystem. No, it's not. First try was 450x slower than XFS because of log file writing to a slow disk. On a SSD it was 100x slower. Using a newer vesion FUSE and other changes it was only 9x slower than XFS and 7x slower than NFS. But CPU utilization was 2.5x higher. You did get replication, find is fast as db query, you can do full text queries, adding new features is relatively easy. 

  • Moving from the old hotness to the new hotness. Moving from Ruby to Rust: When moving more larger chunks of code into Rust, we noticed increased performance improvements which we were carefully monitoring. When moving smaller modules to Rust, we didn’t expect much improvement: in fact, some code became slower because it was being called in tight loops, and there was a small overhead to calling Rust code from the Ruby application...Our project was successful: moving from Ruby to Rust was a success that dramatically sped up our dipatch process, and gave us more head-room in which we could try implementing more advanced algorithms. Also, Rust at speed — building a fast concurrent database

  • Licenses have consequences. GNU Health Federation Information System moves from MongoDB to PostgreSQLRed Hat Satellite to standardize on PostgreSQL backend

  • Swift is a fun language. It's just for iOS, right? Not exactly. Open-Sourcing The Swift Talk Backend: For us, the biggest downside is that Swift on Linux isn’t battle-tested. We ran into a few bugs (most notably around URLSession/URLRequest behavior), but they were easily solved or worked around...We aren’t the first to build a project like this in Swift. Point-Free was written in Swift at launch, and it continues to be an inspiration. Also, server-side swift.

  • Code in Java? Have long running threads? This debug tale is for you: SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue

  • An opinionated overview of the Cloud Native Application Architecture. Design principles: Designed As Loosely Coupled Microservices; Developed With Best-of-breed Languages And Frameworks; Centred Around APIs For Interaction And Collaboration; Stateless And Massively Scalable; Resiliency At The Core Of the Architecture; Packaged As Lightweight Containers And Orchestrated; Agile DevOps & Automation Using CI/CD; Elastic — Dynamic scale-up/down. Strategies for implementing resiliency: Retry transient failures ; Load balance across instances ; Degrade gracefully; Throttle high-volume tenants/users; Use a circuit breaker; Apply compensating transactions. 

  • As consumers the internet allows us to connect thousands of different ways—yet we use a few different social networks. As developers the internet allows to us to develop on thousands of different platforms—yet we pick a few different clouds. And The Unreasonable Effectiveness of Deep Feature Extraction: There's a growing consensus that deep learning is going to be a centralizing technology rather than a decentralizing one. We seem to be headed toward a world where the only people with enough data and compute to train truly state-of-the-art networks are a handful of large tech companies.

  • The Curious Case of BEAM CPU Usage: Turns out, busy waiting in BEAM is an optimization that ensures maximum responsiveness. In essence, when waiting for a certain event, the virtual machine first enters a CPU-intensive tight loop, where it continuously checks to see if the event in question has occurred...The highest impact was observed on the instance with the most available CPU capacity. At the same time, we did not observe any meaningful difference in performance between VMs with busy waiting enabled and disabled.

  • Paper review. Sharding the Shards: Managing Datastore Locality at Scale with Akkio: The paper advocates for µ-shards (micro-shards), very fine grained datasets (from ~1Kb to ~1Mb), to serve as the unit of migration and  the abstraction for managing locality. A µ-shard contains multiple key-value pairs or database table rows, and should be chosen such that it exhibits strong access locality. Examples could be Facebook viewing history to inform subsequent content, user profile information, Instagram messaging queues, etc. Why not shards, but µ-shards? Shards are for datastores, µ-shards are for applications. Shard sizes are set by administrators to balance shard overhead, load balancing, and failure recovery, and they tend to be on the order of gigabytes. Since µ-shards are formed by the client application to refer to a working data set size, they capture access locality better. They are also more amenable to migration. µ-shard migration has an overhead that is many order of magnitude lower than that of shard migration, and its utility is far higher. There is no need to migrate 1GB partition, when the access is to a 1MB portion.

  • Cell-based Reference Architecture: This document describes a reference architecture for modern agile digital enterprises. This reference architecture offers a logical architecture based on a disaggregated cloud-based model that can be instantiated to create an effective and agile approach for digital enterprises, deployed in private, public or hybrid cloud environments. In this paper we present the architecture, the approach to applying this architecture, and existing approaches that fit into this architecture. The architecture defined in this paper can be mapped to current architectures as well as used to define new architectures. It is designed to help move from the “as-is” towards the “to-be”.

  • Massively Parallel Hyperparameter Tuning (article): In this work, we tackle this challenge by introducing ASHA, a simple and robust hyperparameter tuning algorithm with solid theoretical underpinnings that exploits parallelism and aggressive early-stopping. Our extensive empirical results show that ASHA outperforms state-of-the-art hyperparameter tuning methods; scales linearly with the number of workers in distributed settings; converges to a high quality configuration in half the time taken by Vizier (Google’s internal hyperparameter tuning service) in an experiment with 500 workers; and beats the published result for a near state-of-the-art LSTM architecture in under 2× the time to train a single model.

  • Cloud Programming Simplified: A Berkeley View on Serverless Computing: Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an accounting of the predictions of the 2009 Berkeley View of Cloud Computing paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential.