An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

The Morning Paper

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems Gan et al., Hardware implications. The top line shows the change in tail latency across a set of monolithic applications as operating frequency decreases.

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

John McCalpin

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode. It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. MCDRAM maximum latency (ns) 156.1

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

This spring: High-Performance and Low-Latency C++ (Stockholm) and ACCU (Bristol)

Sutter's Mill

Tue-Thu Apr 25-27: High-Performance and Low-Latency C++ (Stockholm). On April 25-27, I’ll be in Stockholm (Kista) giving a three-day seminar on “High-Performance and Low-Latency C++.” ” This contains updated and new material that reflects the latest C++ standards and compilers, with a focus to using modern C++11/14/17 effectively on modern hardware and memory architectures.

Latency: Will it undermine the most interesting 5G use cases?

VoltDB

Unfortunately, this means that the age-old Telco bugbears will rear their ugly heads again, including latency. 5G, as a fundamental requirement, mandates a 1 millisecond latency from the datasource to its destination. With the 5G revolution, operators will need to manage hundreds of edge deployments, and maintain the physical space and hardware to achieve 1ms of latency. This requires 1 ms network latency.

Latency: Will it undermine the most interesting 5G use cases?

VoltDB

Unfortunately, this means that the age-old Telco bugbears will rear their ugly heads again, including latency. 5G, as a fundamental requirement, mandates a 1 millisecond latency from the datasource to its destination. With the 5G revolution, operators will need to manage hundreds of edge deployments, and maintain the physical space and hardware to achieve 1ms of latency. This requires 1 ms network latency.

Invited Talk at SuperComputing 2016!

John McCalpin

Computer Architecture Computer Hardware Performance cache DRAM high performance computing memory bandwidth memory latency STREAM benchmark“Memory Bandwidth and System Balance in HPC Systems” If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

An empirical guide to the behavior and use of scalable persistent memory

The Morning Paper

higher latency and lower bandwidth)… We have found the actual behavior of Optane DIMMs to be more complicated and nuanced than the "slower, persistent DRAM" label would suggest. The read latency for Optane is 2x-3x higher than DRAM. Uncategorized Hardware Performance

Boosted race trees for low energy classification

The Morning Paper

The goal is to produce a low-energy hardware classifier for embedded applications doing local processing of sensor data. Race logic has four primary operations that are easy to implement in hardware: MAX, MIN, ADD-CONSTANT, and INHIBIT. Uncategorized Hardware Machine Learning

Why I hate MPI (from a performance analysis perspective)

John McCalpin

According to Dr. Bandwidth, performance analysis has two recurring themes: How fast should this code (or “simple” variations on this code) run on this hardware? This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood. The networking hardware.

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

John McCalpin

Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency. Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e.,

Time to First Byte: What It Is and Why It Matters

CSS Wizardry

The first—and often most surprising for people to learn—thing that I want to draw your attention to is that TTFB counts one whole round trip of latency. The reason is because mobile networks are, as a rule, high latency connections.

Intel discloses “vector+SIMD” instructions for future processors

John McCalpin

These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: [link] ). With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls. But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.

Cache 40

RPCValet: NI-driven tail-aware balancing of µs-scale RPCs

The Morning Paper

Last week we learned about the [increased tail-latency sensitivity of microservices based applications with high RPC fan-outs. Seer uses estimates of queue depths to mitigate latency spikes on the order of 10-100ms, in conjunction with a cluster manager.

Talk Video: Welcome to the Jungle (60 min version + Q&A)

Sutter's Mill

Now welcome to the hardware jungle. One of the slides I omitted to shorten this version of the talk highlighted that there are actually two issues when you go from “Disjoint (tightly coupled)” to “Disjoint (loosely coupled)”: reliability and latency , and both are important. (I

A case for managed and model-less inference serving

The Morning Paper

As we saw with the SOAP paper last time out, even with a fixed model variant and hardware there are a lot of different ways to map a training workload over the available hardware. Expectation 2: The choice of hardware should be hidden behind the same high level API for users.

Snap: a microkernel approach to host networking

The Morning Paper

Here are the bombshell paragraphs: Our datacenter applications seek ever more CPU-efficient and lower-latency communication, which Pony Express delivers. The desire for CPU efficiency and lower latencies is easy to understand. Snap: a microkernel approach to host networking Marty et al.,

Why Traditional Monitoring Isn’t Enough for Modern Web Applications

Dotcom-Montior

Network latency. Hardware resources. Network Latency. With the evolution of cloud technologies, such as Single Page Applications (SPAs), Web APIs, and Model View Controller (MVC), network latency has become a crucial factor to be monitored. Hardware Resources.

Updated Azure SQL Database Tier Options

SQL Performance

Gen 5 is the primary hardware option now for most regions since Gen 4 is aging out. New Hardware Configuration for Provisioned Compute Tier. The Gen4/Gen5 hardware configurations are considered "balanced memory and compute."

Azure 56

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains.

Cache 261

Bring Your Own Cloud (BYOC) vs. Dedicated Hosting at ScaleGrid

Scalegrid

This is why our BYOC pricing is less than our Dedicated Hosting pricing, as the costs listed for BYOC are only what you pay for ScaleGrid and don’t include your hardware costs. Deploying your application and database on the same VPC also provides the lowest possible latency path.

Cloud 175

How to maximize CPU performance for PostgreSQL 12.0 benchmarks on Linux

HammerDB

cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 1000 MHz - 4.00

Time protection: the missing OS abstraction

The Morning Paper

The paper sets out what we can do in software given today’s hardware, and along the way also highlights areas where cooperation from hardware will be needed in the future. Even when we do flush caches, the latency of flushing can itself be used as a channel!

Cache 69

A tale of two abstractions: the case for object space

The Morning Paper

Both abstractions must be implemented in a way that is efficient using existing hardware. The hardware perspective. At the hardware level we want to able to move data into, out of , and around memory, without consideration for persistence and reference lifetimes.

The Three Types of Performance Testing

CSS Wizardry

Things always always feel fast when we’re developing because, more often than not, we’re working on high-spec machines on dedicated networks, and also serving from localhost which removes the bulk of the latency and bandwidth issues that a real user would suffer.

Performance Tuning Java Applications in Linux

DZone

While Performance Tuning an application both Code and Hardware running the code should be accounted for. For low latency, applications use Concurrent Mark and Sweep Algorithm — CMS or G1 GC. Learn how to make your Java applications performance perfectly.

Tuning 141

Seamless offloading of web app computations from mobile device to edge clouds via HTML5 Web Worker migration

The Morning Paper

Edge servers are the middle ground – more compute power than a mobile device, but with latency of just a few ms. These use their regression models to estimate processing time (which will depend on the hardware available, current load, etc.).

Why OpenStack is like a Crowdfunded Viking Movie

VoltDB

Hardware Optimizers” want to get the maximum utilization out of hardware. These systems were designed to have a lifetime of half a decade or more, and rapidly changing hardware meant that the initial deployment had to be sized for 5-7 years out.

File systems unfit as distributed storage backends: lessons from ten years of Ceph evolution

The Morning Paper

Breaking that assumption allowed Ceph to introduce a new storage backend called BlueStore with much better performance and predictability, and the ability to support the changing storage hardware landscape. Supporting new storage hardware. ” Supporting modern hardware.

Automating chaos experiments in production

The Morning Paper

degraded hardware, transient networking problem) or, more often, because of some change deployed by Netflix engineers that did not have the intended effect. Automating chaos experiments in production Basiri et al., ICSE 2019.

Why OpenStack is like a Crowdfunded Viking Movie

VoltDB

Hardware Optimizers” want to get the maximum utilization out of hardware. These systems were designed to have a lifetime of half a decade or more, and rapidly changing hardware meant that the initial deployment had to be sized for 5-7 years out.

Narrowing the gap between serverless and its state with storage functions

The Morning Paper

Shredder is " a low-latency multi-tenant cloud store that allows small units of computation to be performed directly within storage nodes. " Narrowing the gap between serverless and its state with storage functions , Zhang et al., SoCC’19. "Narrowing

What is Serverless Architecture?

cdemi

Let's talk about the elephant in the room; Serverless doesn't really mean that there are no Software or Hardware servers. Performance - Serverless Functions that are used less frequently may suffer from warmup response latency, where the infrastructure needs some time to deploy the function.

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

The Morning Paper

Last time around we looked at the DeathStarBench suite of microservices-based benchmark applications and learned that microservices systems can be especially latency sensitive, and that hotspots can propagate through a microservices architecture in interesting ways.

Välkommen till Stockholm – An AWS Region is coming to the Nordics

All Things Distributed

This enables customers to serve content to their end users with low latency, giving them the best application experience. In 2011, AWS opened a Point of Presence (PoP) in Stockholm to enable customers to serve content to their end users with low latency.

Taiji: managing global user traffic for large-scale Internet services at the edge

The Morning Paper

The ability of a datacenter to handle traffic changes over time as capacity is added or removed, and hardware upgraded The routing needs to be able to tolerate failures without making the situation worse. Backend processing latency stays fairly constant during this operation.

Millions of tiny databases

The Morning Paper

This work is latency critical, because volume IO is blocked until it is complete. Larger cells have better tolerance of tail latency (e.g. Studies across three decades have found that software, operations, and scale drive downtime in systems designed to tolerate hardware faults.

Distributed Algorithms in NoSQL Databases

Highly Scalable

Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Read/Write latency. Read/Write requests are processes with a minimal latency.

SQL Server 2016 – It Just Runs Faster: Always On Availability Groups Turbocharged

SQL Server According to Bob

When we released Always On Availability Groups in SQL Server 2012 as a new and powerful way to achieve high availability, hardware environments included NUMA machines with low-end multi-core processors and SATA and SAN drives for storage (some SSDs).

Software-defined far memory in warehouse scale computers

The Morning Paper

This boils down to a single digit µs latency toleration in the tail for far memory, and in addition to security and privacy concerns, rules out remote memory solutions. Thus we’re fundamentally trading (de)-compression latency at access time for the ability to pack more data in memory.

Keeping up with Header Bidding’s performance requirements

VoltDB

Most existing adtech infrastructure simply can not achieve the required latency. VoltDB provides the necessary technology to achieve the latency required by header bidding. However, increasing hardware capacity doesn’t really solve the problem, and it introduces new ones. If increasing hardware is the “work harder” answer to header bidding, then “work smarter” is the better option.

Keeping up with Header Bidding’s performance requirements

VoltDB

Most existing adtech infrastructure simply can not achieve the required latency. VoltDB provides the necessary technology to achieve the latency required by header bidding. However, increasing hardware capacity doesn’t really solve the problem, and it introduces new ones. If increasing hardware is the “work harder” answer to header bidding, then “work smarter” is the better option.

Can You Afford It?: Real-world Web Performance Budgets

Alex Russell

It simulates a link with a 400ms RTT and 400-600Kbps of throughput (plus latency variability and simulated packet loss). Simulated packet loss and variable latency, however, can make benchmarking extremely difficult and slow.

How Google PageSpeed Works: Improve Your Score and Search Engine Ranking

CSS - Tricks

Estimated Input Latency. Estimated Input Latency. It’s time to come to terms that your customers aren’t using the same powerful hardware as you. An excellent substitute for using a real device is to use Chrome DevTools hardware emulation mode.