article thumbnail

Presentation: Modern Compute Stack for Scaling Large AI/ML/LLM Workloads

InfoQ

Jules Damji discusses which infrastructure should be used for distributed fine-tuning and training, how to scale ML workloads, how to accommodate large models, and how can CPUs and GPUs be utilized? By Jules Damji

Tuning 89
article thumbnail

Why applying chaos engineering to data-intensive applications matters

Dynatrace

Kafka Streams takes longer to recover from failures and presents volatile behavior. Optimized fault recovery We’re also interested in exploring the potential of tuning configurations to improve recovery speed and performance after failures and avoid the demand for additional computing resources. Recovery time of the latency p90.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning 215
article thumbnail

Machine Learning for Fraud Detection in Streaming Services

The Netflix TechBlog

We present a systematic overview of the unexpected streaming behaviors together with a set of model-based and data-driven anomaly detection strategies to identify them. Data Featurization A complete list of features used in this work is presented in Table 1. The features mainly belong to two distinct classes.

C++ 318
article thumbnail

AWS re:Invent 2017: How Netflix Tunes EC2

Brendan Gregg

My last talk for 2017 was at AWS re:Invent, on "How Netflix Tunes EC2 Instances for Performance," an updated version of my [2014] talk. Our team looks after the BaseAMI, kernel tuning, OS performance tools and profilers, and self-service tools like Vector. We help where we can. Many other Netflix staff spoke at re:Invent ( list here ).

Tuning 59
article thumbnail

Reinventing our Dynatrace Core Values

Dynatrace

“It was an iterative process allowing us to reflect on the past, present, and future of Dynatrace, discuss our findings, and explore the results received in the employee experience discovery.” Stay tuned. ” Dynatrace CEO, Rick McConnell Now, our refreshed Dynatrace Core Values are established.

article thumbnail

AWS re:Invent 2017: How Netflix Tunes EC2

Brendan Gregg

My last talk for 2017 was at AWS re:Invent, on "How Netflix Tunes EC2 Instances for Performance," an updated version of my [2014] talk. Our team looks after the BaseAMI, kernel tuning, OS performance tools and profilers, and self-service tools like Vector. We help where we can. Many other Netflix staff spoke at re:Invent ( list here ).

Tuning 52