Stuff The Internet Says On Scalability For February 8th, 2019

Wake up! It's HighScalability time:

Change is always changing. What will the next 5 years look like?

Do you like this sort of Stuff? I'd greatly appreciate your support on Patreon. Know anyone who needs cloud? I wrote Explain the Cloud Like I'm 10 just for them. It has 35 mostly 5 star reviews. They'll learn a lot and love you forever.

  • 16,000: Chrome bugs found with ClusterFuzz;  $2,000,000: for Apple iOS remote jailbreak; $1 million: think twice when profiting from a bug; 0: clicks to over the air explotation of Marvell Avastar Wi Fi; $300: cost for a bounty hunter to track your phone's location; 321M: Twitter MAUs; 3: years of falling smartphone shipments; 50%: new development uses microservices; 8 inches: big difference in cell phone radiation; 100 million: extra spam messages blocked per day with tensorflow; 29%: drop in Google's charge per click; 10 million: concurrent players in Fortnite; 185 miles: fiber-optic cable in Summit supercomputer; 40-50: free Buzzfeed quizes produced per month; 9237: papers with code; 100,000,000x: improvement needed for AI to be more efficient than our good old meat brain; 7.2%: drop in US hiring; 

  • Quoteable Quotes:
    • @pczarkowski: As I keep telling people, if you have a kubernetes strategy you've already failed. Kubernetes should be an implementation detail at the tactical level to deal with the strategic imperative of solving the problems that are halting the flow of money.
    • EFF: EU countries that do not have zero rating practices enjoyed a double digit drop in the price of wireless data after a year. In comparison, the countries with prevalent zero rating practices from their wireless carriers consistently saw data prices increase. 
    • @samred: Jorgensen didn't mince words: he blamed the drop in the [EA] series' uptake by the developers' focus on a single-player campaign, as opposed to having a promised battle royale mode ready for fans in time for the game's launch.
    • Newzoo: The games market took more than 35 years to grow to a $35 billion business in 2007. This year, that same market is expected to generate $137.9 billion in revenues. In only 11 years, an astounding $100 billion of additional value was created.
    • Mark Fontecchio: we find that more companies are turning to HR software and the data it contains for strategic insights. According to 451 Research’s Voice of the Enterprise: Data & Analytics, 28% of businesses run analytics on their employee behavior data, roughly the same number that analyze IT infrastructure data.
    • Anonymous: it’s hard to compete with the sheer quantity of data that tech firms have, or the scale of their integration into people’s lives. Retail investors have to put their money somewhere. They’re currently putting it into traditional financial firms. But there’s no reason that Google and Facebook shouldn’t be accepting deposits, facilitating payments, making loans, managing assets, running quantitative investment funds.
    • Legogris: Having used ECS quite a bit, I do not recommend anyone building a new stack based on it. Kubernetes solves everything ECS solves, but usually better and without sveral of the issues mentioned here. Last time I checked, AWS was still lagging behind Azure and GCP on Kubernetes, but I have a strong feeling they're prioritizing improving EKS over ECS.
    • Tom Vanderbilt: Oral rehydration therapy is a classic case of what has been called “reverse innovation”: taking a technology or solution born of the resource constraints in developing countries and adopting it in wealthier ones.
    • @angela_walch: If this process doesn't demonstrate both centralization (4 people kept critical bug secret for MONTHS) and fiduciary-level trust reposed in these same people, I don't know what would. #crypto #zcash #blockchain #veilofdecentralization #codersasfiduciaries
    • Sean Mallon: Big data is going to make it easier for strategists to develop successful loyalty programs. Many brands are using big data to create loyalty programs already. Since all online gaming providers are partnered with a brick-and-mortar casino with decades of experience in their jurisdiction, they can pull data on the success of loyalty programs from their land-based operations. They can also use customer engagement data to determine the preferences of their patrons. All of this data will help them create the most effective rewards program as possible.
    • Kaelin Rooney: What is most surprising about the development of microtransactions is that the microtransactions themselves are already more profitable than game sales (Activision Blizzard and Ubisoft have already reported as such)! (Sillicur, 2018) (Henry, 2017) If this trend continues, there are no surface reasons for them not to become the norm across the board, the profit margins are just incredible.
    • titzer: As for the history. The history is wrong. The first iteration of Wasm was in fact a pre-order encoded AST. No stack. The second iteration was a post-order encoded AST, which we found through microbenchmarks, actually decoded considerably faster. The rub was how to support multi-value returns of function calls, since multi-value local constructs can be flattened by a producer. We considered a number of alternatives that preserved the AST-like structure before settling on that a structured stack machine is actually the best design solution, since it allowed the straightforward extension to multi-values that is there now (and will ship by default when we reach the two-engine implementation status). As for the present. Wasm blocks and loops absolutely can take parameters; it's part of the multi-value extension which V8 implemented already a year ago. Block and loop parameters subsume SSA form and make locals wholly unnecessary (if that's your thing). 
    • Stephen Kuenzli: Conway’s Law rules AWS Organizations, too. A good place to start for many organizations is to create a set of accounts partitioned by delivery phase for each department or business unit. 
    • shmerl: [Zero rating] is  an ugly anti-competitive symptom of the actual disease - data caps. Curing the symptom helps marginally. What needs fixing is the disease
    • Mitch Wagner: For Amazon, the cloud is the little engine that could. Amazon Web Services comprised just 11% of the company's overall sales in 2018, but delivered more operating income than all other business units combined. For calendar 2018, AWS sales were $25.65 billion, up 47% from $17.45 billion year-over-year. Sales growth accelerated in 2018 over 2017 compared with 2017 over 2016, which saw 43% sales growth.
    • Leo Laporte: Podcasts are the 17th century English coffee houses of the 21st century coffee house. 
    • Gerald Jay Sussman: We're in real trouble. We haven't the foggiest idea how to compute very well. 
    • James Beswick: Right now in cloud, the real jobs are available at large companies who have either been early adopters or have new projects that were started in the cloud while the rest of their infrastructure is not. Anyone in their immediate orbit will see roles appearing in the cloud too — principally consulting firms and support vendors. These are where the best cloud jobs exist in 2019.
    • Hobson Lane: A lot of genius ideas for how to construct programs as graphs of interdependent, incrementally improving approximations. Does an end run around Big O. To my untrained eye these look a lot like neural nets where each node is complex computer in its own right.
    • Rich Miller: Hyperscale players are already investing in reserving huge chunks of future capacity through “right of first offer” (ROFO) deals with developers, who offer existing tenants first shot at negotiating deals for new space before it is offered to the broader market. “We now have the hyperscale guys wanting to control the vacancy in their building to control growth,” 
    • @davidgerard: "Nerds have a complicated relationship to change; it's awesome when we are the ones creating the change, but it's untrustworthy when it comes from outside." @jeamland on systemd as an example.
    • Carlos Jones: Summit, which occupies an area equivalent to two tennis courts, used more than 27,000 powerful graphics processors in the project. It tapped their power to train deep-learning algorithms, the technology driving AI’s frontier, chewing through the exercise at a rate of a billion billion operations per second, a pace known in supercomputing circles as an exaflop. “Deep learning has never been scaled to such levels of performance before,” says Prabhat
    • Shoshana Zuboff: Surveillance capitalism in general has been so successful because most of us feel so beleaguered, so unsupported by our real-world institutions, whether it’s health care, the educational system, the bank … It’s just a tale of woe wherever you go. The economic and political institutions right now leave us feeling so frustrated. We’ve all been driven in this way toward the internet, toward these services, because we need help. And no one else is helping us. That’s how we got hooked
    • stratechery: All of this is critical context for understanding Spotify’s strong interest in the podcasting space. Spotify needs (1) a way to differentiate its service from Apple Music in particular, and (2) content that it does not have to pay for on a marginal cost basis. Gimlet Media fits the bill in both cases.
    • @bcantrill: This has occurred to me so many times when reading the [Google SRE] book: given how nauseatingly arrogant it is as printed, I can't even imagine how obscenely arrogant it must have been before being edited...
    • Google: ClusterFuzz has found more than 16,000 bugs in Chrome and more than 11,000 bugs in over 160 open source projects integrated with OSS-Fuzz. It is an integral part of the development process of Chrome and many other open source projects. ClusterFuzz is often able to detect bugs hours after they are introduced and verify the fix within a day.
    • @Lukasaoz: BTC has failed at all of its stated design goals. It has the worst carbon/utility tradeoff in the history of mankind. Paying for coffee using BTC emits 205kg of CO2. This is equivalent to 205kWh of coal powered mains usage in the UK.
    • @nathanielpopper: We lost the $150 million in cryptocurrencies that we held for customers because our founder died and he was the only one with the passwords to the wallets. Welcome to the financial future!
    • Masha Borak: The world’s most watched TV show is about to get streamed over a 5G network in 4K ultra high-definition. 
    • ssivark: I would expect experienced folk to better understand people, have figured out how to work effectively in a team, strategically prioritize what needs to be done, nudge meetings/conversations in the right direction, and mentor younger team members and help them out with advice in tricky situations. If your mental model does not value that, of course you would not value experience. You need a critical mass of experience in an organization to do this effectively -- it can't just fall on the shoulders of one person in middle management with a few years of technical experience and an MBA. If you step back and defocus, the above qualities look a lot like leadership pixie dust sprinkled liberally throughout the organization -- the lack of which correlates with several repeated failure modes one might see in young Silicon Valley companies today.
    • @johncutlefish: Most ppl don't realize how deep you need to go to make something "simple".  They think it just pops out that way. And if you show your work it scares them.
    • James Beswick: For most in IT, the paradox of the cloud is going to heavily influence our roles going forward. You will have to keep learning new skills at a much faster rate than peers in other industries, but you’re running up a down escalator that’s speeding up. The rate at which cloud is gobbling up the entire ecosystem creates opportunities for the experienced but ultimately will make it much harder for those switching careers or entering at a junior level.
    • @mipsytipsy: People act like "Why did X happen?" is some deep existential question. But honestly, most of the time it's shorthand for "What was the context when X happened?" And event context f is the one thing metrics based tools cannot tell you.  THIS IS WHY YOUR PROBLEMS SEEM HARD
    • Arthur Herman: In the twenty-first century, supremacy will belong to the nation that controls the future of information technology, which is quantum. As we will see, it would be a mistake to assume that the United States is destined to be in this position. In the topsy-turvy, counterintuitive world of quantum mechanics and quantum computing, decades-long dominance in IT doesn’t automatically translate into dominance in the coming era. But strategy and commitment of resources, including funding, almost certainly will—and with it, the balance of the future.
    • Neal Ford: Mature microservices architecture requires at least some maturity in DevOps practices...synergy between architecture and DevOps is one of the super powers of the microservices architectural style because it delegates responsibility more intelligently
    • @Obdurodon: One of the dumbest things you can do in high-scale systems is assume that any of your inputs will be randomly or evenly distributed. They'll clump, in both space and time, so you'd best prepare for every unfortunate (but not unpredictable) convergence.
    • Daniel Lemire: A Canadian startup built around electric taxis failed. One of their core findings is that electric cars must be recharged several times a day, especially during the winter months. Evidently, the need to constantly recharge the cars increases the costs. I think that this need to constantly recharge cars challenges the view that once we have electric self-driving cars, we can just send our cars roaming the streets, looking for new passengers, at least in the cold of winter in Canada.
    • Eric Holloway: From this off-the-cuff analysis, we see there is a huge performance gap between AI and the human mind, even though the AI may outperform a human on the task. Thus, while it is correct to say AI can outperform humans when we are measuring only task accomplishment, it is comparing apples and oranges to say that AlphaGo Zero is outperforming the human mind. From this cursory analysis, there appears to be a stark quantitative line between the performance of minds and machines.
    • sosilkj: My takeaway, based on observations over the past few years, is that for many knowledge-work professions today, particular software engineering, there's a rather steep discount curve applied to one's experience. In many cases, your experience simply doesn't matter at all -- or, worst case, it counts against you. This obviously depends heavily on what 'segment' of the job market you're in -- nonprofits and the public sector come to mind as probably exceptions -- but I think it holds for a good portion of the software job market today.
    • ARL: U.S. Army researchers and colleagues at the University of California, Santa Barbara have found a persistent gap in basic knowledge about the use of artificial intelligence (AI), and it is unknown which AI aspects will or will not help military decision-making. The researchers developed an online game in which players obtained points by making good decisions in each round; an AI was used to generate advice in each round, which was shown alongside the game interface. The AI also made suggestions about which decisions players should make, which they were free to accept or ignore. About 66% of human decisions disagreed with the AI, regardless of the number of errors in the suggestions. The researchers concluded these findings present a catch-22 for system designers: incompetent users need AI the most, but are least likely to be swayed by their rational justifications.
    • WIPO: Since artificial intelligence emerged in the 1950s, innovators and researchers have filed applications for nearly 340,000 AI-related inventions and published over 1.6 million scientific publications. Notably, AI-related patenting is growing rapidly: over half of the identified inventions have been published since 2013.
    • Will Knight: The US government appears to have decided that it’s simply too risky for a Chinese company to control too much 5G infrastructure.
    • James Hamilton: There were many errors and poor decisions that led to this accident. Many were made at the most senior levels of the 7th fleet and the naval leaders above them, but some of the lessons apply to all ships operating at sea. Some of these on-ship lessons that stood out for me were: 1) Make sure that there is an adequate visual watch; 2) Assume the worst when nearing other vessels and take early and decisive action to avoid a collision. It’s remarkably how fast “normal” distance can become an unavoidable collision; 3) Use the RADAR. Naval RADARs do an excellent job of delivering weapons and avoiding close encounters with enemy ships, but they aren’t always excellent when operating in very close range and they are difficult to use. Adding a commercial RADAR as was done on the USS California seems like a prudent safety decision; and 4) AIS data should be used as a primary source of anti-collision data right up there with visual and RADAR. Given the risk of collision when operating near friendly ports, it probably makes sense for naval vessels to broadcast AIS data just as the commercial traffic does. This might allow a commercial crew to make better informed decisions when operating near naval vessels.

  • We don't know what's up at Wells Fargo. They're down—hard—due to a fire, an attack, a hyper-active fire suppression system, or some other horrible combination of unlikely events. The mystery is why they haven't failed over? Let's wish them luck. This can't be an easy time for them. But this is a good time to reference how Netflix routinely practices evacuating entire regions: Project Nimble: Region Evacuation Reimagined. Netflix can fail over to a different region in 8 minutes! When haters talk about the expense of the public cloud, ask yourself, how much will this downtime cost Wells Fargo both in fiat currency and reputation points?

  • Autoscaling is a key architecture innovation of IaaS. The odd thing is it is still way harder to get right than it should be. 
    • Segment details their problems in When AWS Autoscale Doesn’t: Surprise 1: Scaling Speed Is Limited; Surprise 2: Short Cooldowns Can Cause Over-Scaling / Under-Scaling; Surprise 3: Custom CloudWatch Metric Support is Undocumented; However, like all AWS conveniences, target tracking autoscaling also brings with it a hidden layer of additional complexity. 
    • Lessons: The best way to find the right autoscaling strategy is to test it in your specific environment and against your specific load patterns; It's amazing how complex things get on the configuration side as you try to get smarter about autoscaling. The big downside of trying to be too clever here is that you wind up with a wildly brittle autoscaling setup that falls over as soon as the underlying assumptions around the relationship between your metrics and your scaling needs change. As in many things engineering, we've found that it's best to keep the configuration as simple as possible and use a solid foundation of reporting / alerting to give you an early heads up that you need to revisit and update your autoscaling strategy.
    • There's a good HN discussion where others share their pain as well. samstave: We were doing exactly this - but we had a flaw: we didnt handle the case when the AWS API was actually down. So we were constantly monitoring for how many running instances we had - but when the API went down, just as we were ramping up for our peak traffic - the system thought that none were running because the API was down - so it just kept continually launching instances. cdoxsey: This is how the cluster-autoscaler works in kubernetes. It sets the desired capacity based on the number of pods needing to be scheduled. Coupled with a horizontal pod autoscaler (which sets the replica count based on a metric) you get the best of both worlds. avitzurel: There are many limitations that you need to "read between the lines" with AWS auto scaling. For example, we have daemons reading messages from SQS, if you try to use auto scaling based on SQS metrics, you come to realize pretty quickly that CloudWatch is updated every 5 minutes. For most messages, this is simply too late.
    • jcrites: The effectiveness of dynamic scaling significantly depends on what metrics you use to scale. My recommendation for that sort of system is to auto-scaled based on the percent of capacity in use.
    • NathanKP: AWS employee here. If you are able to achieve consistent greater than 50% utilization of your EC2 instances or have a high percentage of spot or reserved instances then ECS on EC2 is still cheaper than Fargate. If your workload is very large, requiring many instances this may make the economics of ECS on EC2 more attractive than using Fargate. (Almost never the case for small workloads though). Additionally, a major use case for ECS is machine learning workloads powered by GPU's and Fargate does not yet have this support. With ECS you can run p2 or p3 instances and orchestrate machine learning containers across them with even GPU reservation and GPU pinning.
    • jedberg: Auto-scaling is depending on startup time. If your startup time for a new instance/container is 5 seconds, then you need to predict what your traffic will be in 5 seconds. If your startup time is 10 minutes, then you need to predict your traffic in 10 minutes. The choice of metric is important, but it needs to be a metric that predicts future traffic if you want to autoscale user facing services. CPU load is not that metric. The best way to do autoscaling is to build a system that is unique to your business to predict your traffic, and then use AWS's autoscaling as your backup for when you get your prediction wrong.

  • How can you invent the future when you don't even know you live in the past? Fiber: The Coming Tech Revolution―And Why America Might Miss It (interview). The US has very little fiber to the home. South Koreans feel coming to the US is like taking a peaceful rural vacation because life is so unconnected here. Communication just works there. They never have to worry about the expense or availability of communications. They've made sure there's fiber to the home and they've made sure there's lots of competition. The result is you can buy nearly unlimited capacity for $35 or $45 a month. In the US internet access is a commodity. The free market is assumed to compete. That's not the case. Companies pick the richest areas and divide up markets so they don't have to compete and never have to upgrade. Large parts of the US are left behind. China plans to have 80% connected to fiber in the next 5 years. New ways of living won't come from the US because we trail behind in communications. Leaders no more. The US is sinking into a darkness without recognizing the world is far ahead. Also, read Voltaire's Candide for insight into the dangers of "we live in the best of all possible worlds" thinking.

  • Looking for a different way to FaaS? Yes, you need to set up a server, but that's not really the point of serverless, is it? Ride the Serverless Wave with DigitalOcean's One-click DropletOpenFaaS is a framework for building serverless functions with Docker and Kubernetes which has first class support for metrics. You can run OpenFaaS on the cheapest node available for just $5 per month. 

  • Can Kubernetes Work at CERN? (slides) CERN is big. Sensors generate 1 PB/sec of data that they filter down to a more manageable 1-10 GB/s. All the data is stored on tape. They have a big OpenStack cloud deployed on 320,000 cores, 10,000 hypervisors, with 250 PB of data. 60-70PB of data is added per year. Already have 210 Kubernetes clusters, spanning 10-1000 nodes. This isn't enough so they set up a large distributed computing system that links over 200 sites around the world divided into tiers. CERN is tier 0. A set of tier 1 sites receive the data and start reconstruction. Smaller sites and universities link to the tier 1 sites to perform reconstruction, calibration, simulation, and analysis. At any moment ~400,000 jobs are run in parallel on 700,000 cores. The end result are plots like those used reconstruct a Higgs event in 2012. Even all this power is not enough. There are load spikes, especially before big conferences. They want to use Kubernetes federation so they can easily add new clusters through a single entry point. They like k8s tools, monitoring, uniform API, and it makes it easy to run the same applications internally at CERN and externally at other sites. They would like to operate on bare metal for efficiency reasons. They have clusters internally, at other sites, GKE, and Azure (7 public clouds total). Clusters can be joined to the control plane with kubefed join command. Imagine CERN receives new machines, how do they start using them? By joining them to the k8s federation they can run production work loads quickly. They use Yadage (Declarative Workflow Spec and Engine) to run workflows across the federation. Each step is a k8s job. Logs give feedback on how the jobs are doing. This allows physicists to reuse analysis. Analysis in physics is quite complex to define and run. Using Yadage supports sharing at a very high level on a common federated infrastructure. The output is usually a very innocent looking plot, but a spike in the right place can mean Eureka! Federated k8s makes life a lot easier. There's less code to maintain and moves the focus on the actual physics.

  • In short: the Golden Mean. The Hard Truth About Innovative Cultures: Tolerance for Failure but No Tolerance for Incompetence; Willingness to Experiment but Highly Disciplined; Psychologically Safe but Brutally Candid; Collaboration but with Individual Accountability;  Flat but Strong Leadership; 

  • Do you need to store a lot of data? David Rosenthal does a deep dive on the storage, ingress, and egress costs of various providers at varying levels of coldness and lock-in. Cloud For Preservation
    • An interruption in money supply is one the key worries you have with long term preservation. So you may want to balance higher cost geographic redundancy against lower cost, less redundant options. Another risk is the account closure risk. Google, Amazon, and other companies have been known to simply shutdown accounts because some faceless algorithm flipped the screw-you bit on. Another consideration is cheaper platforms suffer from data gravity problems. They don't have compute options close to the storage. And of course you have to worry about jurisdiction issues. As you might expect there is no one solution, but as always with DSHR you're benefiting from his great logic, data, and insight that you can use to make your own decisions. 
    • His rule of thumb: My guess is that more conservative institutions would need to operate at the 100PB scale in on-premise data center space before they could compete on raw cost with the storage-only services for preservation on nearline disk. The advantages of on-demand scaling are so large that institutions lacking the Internet Archive's audience cannot compete with the major cloud platforms for access, even ignoring the demands of "big data" access.

  • Is Google feeling Apple's privacy brand pinch? Towards Federated Learning at Scale: System Design: Our system enables one to train a deep neural network, using TensorFlow (Abadi et al., 2016), on data stored on the phone which will never leave the device. The weights are combined in the cloud with Federated Averaging, constructing a global model which is pushed back to phones for inference. An implementation of Secure Aggregation (Bonawitz et al., 2017) ensures that on a global level individual updates from phones are uninspectable. The system has been applied in large scale applications, for instance in the realm of a phone keyboard.

  • How do you produce a low-jitter 350 microsecond delay? Using physics. Let the signal run through 38 miles of cable ($27,000). Slowing Down A Stock Exchange. Why? Fairness in pricing. Humans and HFT traders experience a consistent delay. get_salled: IEX introduced this speed bump to allow institutional investors to not be forced to compete with the arbitrage algos. If your strategy was to arb between the US exchanges, you likely hit IEX less often (your algo will show the paths to IEX being slower so there will be less opportunity there); it's a space where you really can't compete without a ridiculous amount of capital and infrastructure.

  • If you love debug torture porn you'll love Understanding the Network Interface (NI) Card. It's a doozy.

  • Goto Chicago 2019 is on again this year. It starts April 28. 

  • google/clusterfuzz (article): scalable fuzzing infrastructure which finds security and stability issues in software. It is used by Google for fuzzing the Chrome Browser, and serves as the fuzzing backend for OSS-Fuzz. Runs on over 25,000 cores.

  • A First Look at the Crypto-Mining Malware Ecosystem: A Decade of Unrestricted Wealth: Our profit analysis reveals campaigns with multi-million earnings, associating over 4.3% of Monero with illicit mining. We analyze the infrastructure related with the different campaigns, showing that a high proportion of this ecosystem is supported by underground economies such as Pay-Per-Install services. We also uncover novel techniques that allow criminals to run successful campaigns.