Stuff The Internet Says On Scalability For September 7th, 2018

Hey, it's HighScalability time:

Get antsy waiting 60 seconds for a shot? Imagine taking over 300,000 photos over 14 years, waiting for Mount Colima to erupt. Sergio Tapiro studied, waited, and snapped.

Do you like this sort of Stuff? Please lend me your support on Patreon. It would mean a great deal to me. And if you know anyone looking for a simple book that uses lots of pictures and lots of examples to explain the cloud, then please recommend my new book: Explain the Cloud Like I'm 10. They'll love you even more.

  • 3.5 Pflop/s: fully synchronous tensorflow data-parallel training; 3.3 million: new image/caption training set; 32,408,715: queries sent to Pwned Passwords; 53%: Memory ICs Total 2018 Semi Capex; 11: story Facebook datacenter prison in Singapore; $740,357: ave cost of network downtime; 

  • Quotable Quotes:
    • @BenedictEvans: Recorded music: $18 billion. Cars: $1 trillion. Retail: $20 trillion.
    • @JoeEmison: Lies that developers tell (themselves): (1) This is the best stack/IaaS for us to use [reality: I know it and want to start now] (2) DevOps doesn’t matter until scaling [you’ll spend 30% of your time dealing with ops then] (3) We’ll just rebuild it if we get traction [hahahaha]
    • @sapessi: Lambda simplifies concurrency at the frontend, enforcing one event per function at a time. This makes it easy to reason about complex distributed system. Once inside the function, there's nothing wrong with multi-threading to do the work as efficiently as possible
    • @JonErlichman~ Comparing valuations: Amazon: $1 trillion; Combined $960 billion: Best Buy Macy’s Target Costco Nike  Sears  Home Depot Starbucks McDonald’s Barnes & Noble J.C. Penney Dollar Tree Office Depot Nordstrom Kroger Kohls
    • Kevin Kelly: The biggest invention in Silicon Valley was not the transistor but the start-up model, the culture of the entrepreneurial start-up.
    • Dare Obasanjo: Amazon made $2.2B from search ads last quarter. This is twice as much as Snapchat ($262M) and Twitter ($711M) combined. However still far from Google ($28B) and Facebook ($13.2B). Expect next step is for Amazon ads to start show scale  
    • Matthew Dillon: This is *very* impressive efficiency.  Whod a thought that one would be
      able to run an 8-core/16-thread CPU at full load at only 85W and still reap
      most of the benefit in a memory-heavy workload! This is *very* impressive efficiency.  Whod a thought that one would beable to run an 8-core/16-thread CPU at full load at only 85W and still reapmost of the benefit in a memory-heavy workload!
    • @laurencetratt: When we set out to look at how long VMs take to warm up, we didn’t expect to discover that they often don’t warm up. But, alas, the evidence that they frequently don’t warm up is hard to argue with. Of course, in some cases the difference to performance is small enough that one can live with it, but it’s often bad enough to be a problem. We fairly frequently see performance get 5% or more worse over time in a single process execution. 5% might not sound like much, but it’s a huge figure when you consider that many VM optimisations aim to speed things up by 1% at most. It means that many optimisations that VM developers have slaved away on may have been incorrectly judged to speed things up or slow things down, because the optimisation is well within the variance that VMs exhibit. 
    • Eli Bendersky: Just for fun, I rewrote the same benchmark in Go; two goroutines ping-ponging short message between themselves over a channel. The throughput this achieves is dramatically higher - around 2.8 million iterations per second, which leads to an estimate of ~170 ns switching between goroutines [3]. Since switching between goroutines doesn't require an actual kernel context switch (or even a system call), this isn't too surprising. For comparison, Google's fibers use a new Linux system call that can switch between two tasks in about the same time, including the kernel time.
    • @ben11kehoe: So instead of a processor [for @iRobot] that is maxed out by the features it supports at launch (which is the route to the lowest build cost), we've got headroom to keep pace even with software features we're planning for the generation of robots after this.
    • @NikolaiHampton: A friend of ours wrote a reasonably popular chrome extension (~70k users). As soon as it started to take off he had offers by scammers to buy the extension. An easy way for scammers to hijack your browser: find an extension with interesting permissions, buy, push an “update”.
    • @mark76stewart: I think it massively over-states the size of a tiny Kanye I’m afraid. No way is 50% “tiny”. That’s a Medium-sized Kanye. A tiny Kanye would have to be max 1ft tall to meet international standards
    • @auastro: Hi I'm Andy, I started with some ActiveRecord. Then dabbled in Hibernate. Before I knew it I was writing javascript stored procedures for BSON mongo queries just to feel normal. It was a dark time. I've been NoSQL free for almost 2 years now. I'm not proud of what I did, but I'm about healing now. I just wish I could undo the damage I did to my coworkers and end-users. Stay in CS theory class kids. ORMs and DSLs may see cool and edgy, maybe all your friends say they are doing them. Just say no.
    • @leftoblique: This is an easy statistical mistake for tech companies to make - and one we try to avoid in Chrome: It's easy to look at pregnancy (for example) as something which affects only 2% of your users. But it's really something that affects 40% of your users 5% of the time.
    • Tim Bray: When you’re pump­ing mes­sages around the In­ter­net be­tween het­ero­ge­neous code­bas­es built by peo­ple who don’t know each oth­er, shit is gonna hap­pen. That’s the whole ba­sis of the We­b: You can safe­ly ig­nore an HTTP head­er or HTML tag you don’t un­der­stand, and noth­ing break­s. It’s great be­cause it al­lows peo­ple to just try stuff out, and the use­ful stuff catch­es on while the bad ideas don’t break any­thing.
    • sonniesedge: Google has a monopoly on search rankings. We can't let them obtain a monopoly on websites.
    • tdammers: So here's what I would do: Start with a single server; But design your system such that splitting it up into separate services remains a possibility - that is, write modular code, and possibly even split things up into separate processes where that is feasible right now; Keep an eye on usage statistics and server load, and start preparing a scaling strategy when things approach the edge of your comfort zone. IIRC Google has a "10x rule" that says all systems should be able to handle 10x the current load - that's quite drastic, but being able to scale by a factor of at least 2 is the minimum headroom I'd want.
    • Martin Thompson: More Intel CPU microcode updates and again I see another reduction in CPU performance. The accumulation over this year is starting to get quite concerning.
    • Sandi Miller: Julia is the only high-level dynamic programming language in the “petaflop club,” having achieved 1.5 petaflop/s using 1.3 million threads, 650,000 cores and 9,300 Knights Landing (KNL) nodes to catalogue 188 million stars, galaxies, and other astronomical objects in 14.6 minutes on the world’s sixth-most powerful supercomputer. Julia is also used to power self-driving cars and 3-D printers, as well as applications in precision medicine, augmented reality, genomics, machine learning, and risk management.
    • Alexander Rubin: It is possible to create 40 million tables with MySQL 8.0 using shared tablespaces. ZFS provides an excellent compression ratio (with gzip) which can help by reducing the overhead of “schema per customer” architecture. Unfortunately, the new data dictionary in MySQL 8.0.12 suffers from the DICT_SYS mutex contention and causes constant “stalls”.
    • @KingTherapy: Combine this effort with AMP? Google wants to take the ball we all made together, off the field and move it to their own park. A Google curated replacement for the imperfect and inconsistent patchwork of protocols and methods that are today, shepherded by a commons.
    • @rmogull: As I get deeper into Azure I get… sadder. It’s a very capable platform but in Microsoft’s eagerness to support customers existing needs it leads them down dark paths of anti patterns. You *can* do cloud right on it, but the default is to do it wrong...Azure drives more towards traditional patterns (e.g. use of virtual appliances and IP-baseed security group rules). Azure fully supports the “right” way, but the documentation and UX steer more towards the anti patterns.
    • Memory Guy: Many readers have probably wondered why NAND flash fabs are so enormous. Although DRAM fabs used to be the largest, running around 60,000 wafers per month, NAND flash fabs now put that number to shame, running anywhere from 100,000-300,000 wafers per month. Why are they so huge? The reason is that you need to run that many wafers to reach the optimum equipment balance.  The equipment must be balanced or some of it will be sitting idle, and with some tools costing $50 million (immersion scanners) you want to minimize their idle time to the smallest possible number.
    • Dale Markowitz: Only seven fellow engineers and I maintained all the code running on our servers—and making tests work was time-consuming and error-prone. “We can’t sacrifice forward momentum for technical debt,” then-CEO Mike Maxim told me, referring to the cost of engineers building behind-the-scenes tech instead of user-facing features. “Users don’t care.” He thought of testing frameworks as somewhat academic, more lofty than practical.
    • Erik Bernhardsson: This is a bit of a rant but I really don’t like software that invents its own query language. There’s a trillion different ORMs out there. Another trillion databases with their own query language. Another trillion SaaS products where the only way to query is to learn some random query DSL they made up. I just want my SQL back. It’s a language everyone understands, it’s been around since the seventies, and it’s reasonably standardized. It’s easy to read, and can be used by anyone, from business people to engineers. Instead, I have to learn a bunch of garbage query languages because everyone keeps trying to reinvent the wheel.
    • @mipsytipsy: You can monitor for the known unknowns.  For everything else, you rely on first principles, instrumentation, exploration & curiosity.There's no shame in either!  But as @lizthegrey says, a mature team will have remediated the known unknowns.  All that's left are new mysteries.
    • Rob Aitken: One of the interesting architectural innovations in that kind of processing is to do what the Stanford did with a pixel-based processing system. In a system like that, the pixels are relatively independent of each other and exist in a 2D surface. All the yield problems you would get gluing two wafers together don’t affect you nearly as much as they do if you have a case of, ‘This wafer gets 75% yield and that wafer gets 75% yield, and when I put them together they get 30% yield.’ You have to build systems where the redundancy implicit in the 3D stacking works with you, not against you. But even if you don’t go to monolithic 3D, and you want to do compute in memory, or near memory, that gets into the data movement problem. If your system requires moving data from here to there, it doesn’t matter how clever your processor is or how fast it is because that’s not your limiting factor.
    • Benedict Evans: So: it’s possible that Tesla gets SLAM working with vision, and gets the rest of autonomy working as well, and its data and its fleet makes it hard for anyone else to catch up for years. But it’s also possible that Waymo gets this working and decides to sell it to everyone. It’s possible that by the time this starts to go mainstream, 5 or 10 companies get it working, and autonomy looks more like ABS than it looks like x86 or Windows. It’s possible that Elon Musk’s assertion that it should work with vision alone is correct, and 10 other companies then get it working. All of these are possible, but, to repeat, this answer is not a question of disruption, and this is not a matter of whether software people will beat non-software people - these are all software people. 

  • The fastest, most cost-effective way to serve requests is to respond immediately from the edge. Don't hit the origin server unless you absolutely have to! Run serverless on the origin to keep cost down and performance up. Getting away from the constraints of logical infrastructure boundaries means you can do amazing things. Serverless to the Max: Doing Big Things for Small Dollars with Cloudflare Workers and Azure Functions: The mechanics of querying Pwned Passwords via k-anonymity essentially boils down to there being 16^5 (just over 1 million) different possible queries that can be run with each one returning an average of 493 records (because there's 517M in total)...essentially it's a combination of Cloudflare cache, Azure Functions and Blob Storage as the underlying data structure... 99.62% of all requests were served directly from their infrastructure and never even hit Azure...As of today, there's 152 Cloudflare edge nodes around the world...The point being that massively high cache hit ratios delight customers! This level of performance is what makes my service viable for them in the way they're using it today...The second awesome thing is that the 99.62% cache hit ratio over that week led directly to a 99.62% reduction in requests to the origin website compared to not having Cloudflare in the picture. Or to put it another way, without Cloudflare I'd need to support 264 times more requests to support exactly the same traffic volumes...Last thing on Cloudflare - it's not just the requests to the origin that they help dramatically reduce, it's the egress bandwidth from Azure as well...That's 476.68GB worth of data I haven't had to pay Microsoft for.

  • Videos from RustConf 2018 are now available.

  • Why might you need a complex system like Kubernetes? What is Kubernetes? Optimise your hosting costs and efficiency: Or in other words, for every fifth virtual machine the overhead adds up to a full virtual machine. You pay for five but can use only four. You can’t escape from it, even if you’re on bare metal...You pay $1000 in EC2 instances on Amazon, you only actually use $100 of it...Having Kubernetes efficiently packing your infrastructure means that you get more computing for your money. You can do a lot more with a lot less...Netlify managed to migrate to Kubernetes, double their user base but still maintained the costs unchanged...Qbox — a company that focuses on hosted Elastic Search — managed to save again 50% per month on AWS bills!...OpenAI is a non-profit research company that focuses on artificial intelligence and machine learning. And they used Kubernetes to scale their machine learning model in the cloud. Wondering the details of their cluster? 128000 vCPUs That’s about 16000 MacBook Pros. Only $1280/hr for 128000 vCPU and $400 for the 256 Nvidia P100.

  • Don't let architecture aesthetics blind you to being in sympathy with the machine. When you're paying for compute, doing work in parallel always gives you higher utilization which costs less. How We Massively Reduced Our AWS Lambda Bill With Go: just resource discovery alone was going to cost us over $55 per month per customer - something that clearly wasn’t going to work with our planned pricing...In a single morning we refactored the code to use a single Lambda invocation every minute, operating on 20% of the customer base each time. This Lambda spawns a Goroutine per customer, which spawns a Goroutine per region, which spawns a Goroutine per AWS service. The Lambda execution time hasn’t increased significantly, because as before we’re mainly waiting on network I/O - we’re just waiting on a lot more responses at the same time in a single Lambda function. Cost per customer is now much more manageable however, becoming lower and lower with every sign up. Since refactoring in this way, we’ve started using Goroutines throughout the code base. In particular, when calling DynamoDB to fetch multiple items across different partition keys, doing this concurrently has brought a significant speed up. Also, Serverless Architecture Language

  • Shopify serves millions of requests every minute in support of over 600,000 merchants on Kubernetes. Iterating Towards a More Scalable Ingress. At first they used Google Cloud Load Balancer Controller (glbc) to route incoming requests to the corresponding services (cluster ingress). Shopify deploys software updates around 40 times per day and glbc underperformed in that scenario because it wasn't endpoint aware. They moved to ingress-nginx. To reduce the NGINX reload overhead that occurrs when services are updated, they implemented dynamic configurartion. Results: Up until the 99.9th percentile of request latencies both ingresses are very similar, but when we reach 99.99th percentile or greater, ingress-nginx outperforms glbc by multiple orders of magnitude.

  • By building Pushman we were able to reduce third party integrations from our platform and managed to cut our costs by 20x. Pushman has a message throughput of 32M with 4 production servers. PushMan: The Koinex standard for realtime experience: we have delivered more than 20 Billion realtime messages to our users in a single day...Pushman is only optimized for delivering the realtime value of an attribute with minimum latency to a large pool of online subscribers...Publishers send the messages to the pushman-pub component over HTTP which consequently pushes it to a low latency store. We use Redis as a transient store for published messages. pushman-sub component retrieves the messages from Redis over a TCP connection, over which multiple Redis PubSub subscriptions are multiplexed and subsequently sends them to subscribers over Websocket protocol...Golang comes out to be the real winner here. With the help of goroutines, Golang can create 1M concurrent execution units with just 4–8 GB of memory. And to no surprise we are running, 1.5M goroutines on an AWS m4.2xlarge production instance...We made use of the excellent netpoll library to [handle the The C1M problem]...we decided to take a full in-memory approach and stored everything in RAM. It was also supported by the realtime nature of messages Pushman delivers...We then decided to shard Pushman’s internal data structures to decrease the lock contention...With sharding, the code can be written in a more concurrent way and runs over all CPU cores. Hence, the latency of message sending operation was reduced dramatically...Since, there is a hard cap of 65K on the number of TCP connections to Redis, we had to multiplex multiple Redis PubSub subscriptions over one TCP connection...An application deployed over a single server will always have limited CPU and memory to use. For eg. ~1M instructions can be executed per second on a standard CPU core. This implies that if you have a server instance of 8 cores, you can at max send 8M messages/sec

  • We are used to compute clouds, how about a swarm robot cloud? Robotarium: A Robotics Lab Accessible to All. It's a remotely accessible swarm robotics lab you can upload your own code. The idea is you don't have to afford your own 100 robots. It's now used by 500 research groups. Researchers can use it for free. For profits can pay. An interesting thing they learned is that researchers want to program it using Mathlab.

  • Test your disaster recovery plans. An Azure failure in South Central US region can impact other regions. @imaterek: Hey there. At some level all of our Data centers are connected. So if one fail it will fall over to the other data centers. Also customer in Europe might have some resources hosted in the affected Data Center ^NW. Also, How Many Data Centers Needed World-Wide

  • Free is not always free. Let’s Encrypt must verify that you “own” a domain by checking that you have control of it. Easy for one or two domains, but what if you have over 3000 domains? That's the probllem AutoTrader solved in Let's Encrypt at Scale. Also, How Etsy Manages HTTPS and SSL Certificates for Custom Domains on Pattern

  • github/blockstack: A new decentralized internet. 

  • joshsharp/braid: A functional language with Reason-like syntax that compiles to Go. I’m working on a language I’m calling Braid, an ML-like language that compiles to Go. Braid’s syntax is heavily inspired by Reason, itself a more C-like syntax on top of OCaml. So really I’m writing a language that aims to be fairly similar to OCaml in what it can do, but visually a bit closer to Go. I’m not trying to reimplement OCaml or Reason 1:1 on top of Go, but build something sharing many of the same concepts.

  • AKSHAYUBHAT/DeepVideoAnalytics: A distributed visual search and visual data analytics platform. Deep Video Analytics is a platform for indexing and extracting information from videos and images. With latest version of docker installed correctly, you can run Deep Video Analytics in minutes locally (even without a GPU) using a single command.

  • Scipy Lecture Notes: Tutorials on the scientific Python ecosystem: a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert.

  • Facebook: FBOSS: Building Switch Software at Scale: In this paper, we present our ongoing experiences on overcoming the complexity and scaling issues that we face when designing, developing, deploying and operating an in-house software built to manage and support a set of features required for data center switches of a large scale Internet content provider. We present FBOSS, our own data center switch software, that is designed with the basis on our switch-as-a-server and deploy-early-and-iterate principles. We treat software running on data center switches as any other software services that run on a commodity server. We also build and deploy only a minimal number of features and iterate on it. These principles allow us to rapidly iterate, test, deploy and manage FBOSS at scale. Over the last five years, our experiences show that FBOSS’s design principles allow us to quickly build a stable and scalable network. As evidence, we have successfully grown the number of FBOSS instances running in our data center by over 30x over a two year period.