article thumbnail

Supporting Diverse ML Systems at Netflix

The Netflix TechBlog

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems.

Systems 226
article thumbnail

Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries

DZone

TiDB is an open-source, distributed SQL database that supports Hybrid Transactional/Analytical Processing (HTAP) workloads. For external reasons, application traffic may surge and increase the pressure on the cluster. Ideally, a TiDB cluster should always be efficient and problem-free. However, reality is often unsatisfactory.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Building Netflix’s Distributed Tracing Infrastructure

The Netflix TechBlog

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. Using simple lookup indices in Cassandra gives us the ability to maintain acceptable read latencies while doing heavy writes.

article thumbnail

MySQL Key Performance Indicators (KPI) With PMM

Percona

A monitoring tool like Percona Monitoring and Management (PMM) is a popular choice among open source options for effectively monitoring MySQL performance. That said, it should also be monitored for usage, which will exhibit the traffic pressuring them. This is not an exhaustive list but an example of what we can watch for.

article thumbnail

Netflix at AWS re:Invent 2019

The Netflix TechBlog

Netflix shares how Amazon EC2 Auto Scaling allows its infrastructure to automatically adapt to changing traffic patterns in order to keep its audience entertained and its costs on target. Netflix runs dozens of stateful services on AWS under strict sub-millisecond tail-latency requirements, which brings unique challenges. Wednesday?—?December

AWS 100
article thumbnail

Netflix at AWS re:Invent 2019

The Netflix TechBlog

Netflix shares how Amazon EC2 Auto Scaling allows its infrastructure to automatically adapt to changing traffic patterns in order to keep its audience entertained and its costs on target. Netflix runs dozens of stateful services on AWS under strict sub-millisecond tail-latency requirements, which brings unique challenges. Wednesday?—?December

AWS 100
article thumbnail

Predictive CPU isolation of containers at Netflix

The Netflix TechBlog

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Cache 251