Sat.Oct 12, 2019 - Fri.Oct 18, 2019

Evolving Michelangelo Model Representation for Flexibility at Scale

Uber Engineering

Michelangelo , Uber’s machine learning (ML) platform, supports the training and serving of thousands of models in production across the company.

Digital Twins and Real-Time Digital Twins: What’s the Difference?

ScaleOut Software

Digital twins are typically used in the field of product life-cycle management (PLM) to model the behavior of individual devices or components within a system. This assists in their design and development and helps lower costs.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Everything as Code


Dynatrace news. We’re no longer living in an age where large companies require only physical servers, with similar and rarely changing configurations, that could be manually maintained in a single Datacenter.

Code 185

Delta: A Data Synchronization and Enrichment Platform

The Netflix TechBlog

Part I: Overview Andreas Andreakis , Falguni Jhaveri , Ioannis Papapanagiotou , Mark Cho , Poorna Reddy , Tongliang Liu Overview It is a commonly observed pattern for applications to utilize multiple datastores where each is used to serve a specific need such as storing the canonical form of data (MySQL etc.), providing advanced search capabilities (ElasticSearch etc.), caching (Memcached etc.), and more. Typically when using multiple datastores, one of them acts as the primary store, and the others as derived stores. Now the challenge becomes how to keep these datastores in sync. We have observed a series of distinct patterns which have tried to address multi-datastore synchronization, such as dual writes, distributed transactions, etc. However, these approaches have limitations in regards to feasibility, robustness, and maintenance. Beyond data synchronization, some applications also need to enrich their data by calling external services. To address these challenges, we developed Delta. Delta is an eventual consistent, event driven, data synchronization and enrichment platform. Existing Solutions Dual Writes In order to keep two datastores in sync, one could perform a dual write, which is executing a write to one datastore following a second write to the other. The first write can be retried, and the second can be aborted should the first fail after exhausting retries. However, the two datastores can get out of sync if the write to the second datastore fails. A common solution is to build a repair routine, which can periodically re-apply data from the first to the second store, or does so only if differences are detected. Issues: Implementing the repair routine typically is tailored work which may not be reusable. Also, data between the stores remain out of sync until the repair routine is applied. The solution can become increasingly complicated if more than two datastores are involved. Finally, the repair routine can add substantial stress to the primary data source during its activity. Change Log Table When mutations (like an insert, update and delete) occur on a set of tables, entries for the changes are added to the log table as part of the same transaction. Another thread or process is constantly polling events from the log table and writes them to one or multiple datastores, optionally removing events from the log table after acknowledged by all datastores. Issues: This needs to be implemented as a library and ideally without requiring code changes for the application using it. In a polyglot environment this library implementation needs to be repeated for each supported language and it is challenging to ensure consistent features and behavior across languages. Another issue exists for the capture of schema changes, where some systems, like MySQL, don’t support transactional schema changes [1][2]. Therefore, the pattern to execute a change (like a schema change) and to transactionally write it to the change log table does not always work. Distributed Transactions Distributed transactions can be used to span a transaction across multiple heterogeneous datastores so that a write operation is either committed to all involved stores or to none. Issues: Distributed transactions have proven to be problematic across heterogeneous datastores. By their nature, they can only rely on the lowest common denominator of participating systems. For example, XA transactions block execution if the application process fails during the prepare phase; moreover, XA provides no deadlock detection and no support for optimistic concurrency-control schemes. Also, certain systems like ElasticSearch, do not support XA or any other heterogeneous transaction model. Thus, ensuring the atomicity of writes across different storage technologies remains a challenging problem for applications [3]. Delta Delta has been developed to address the limitations of existing solutions for data synchronization, and also allows to enrich data on the fly. Our goal was to abstract those complexities from application developers so they can focus on implementing business features. In the following, we are describing “Movie Search”, an actual use case within Netflix that leverages Delta. In Netflix the microservice architecture is widely adopted and each microservice typically handles only one type of data. The core movie data resides in a microservice called Movie Service, and related data such as movie deals, talents, vendors and so on are managed by multiple other microservices (e.g Deal Service, Talent Service and Vendor Service). Business users in Netflix Studios often need to search by various criteria for movies in order to keep track of productions, therefore, it is crucial for them to be able to search across all data that are related to movies. Prior to Delta, the movie search team had to fetch data from multiple other microservices before indexing the movie data. Moreover, the team had to build a system that periodically updated their search index by querying others for changes, even if there was no change at all. That system quickly grew very complex and became difficult to maintain. Figure 1. Polling System Prior to Delta After on-boarding to Delta, the system is simplified into an event driven system, as depicted in the following diagram. CDC (Change-Data-Capture) events are sent by the Delta-Connector to a Keystone Kafka topic. A Delta application built using the Delta Stream Processing Framework consumes the CDC events from the topic, enriches each of them by calling other microservices, and finally sinks the enriched data to the search index in Elasticsearch. The whole process is nearly real-time, meaning as soon as the changes are committed to the datastore, the search indexes are updated. Figure 2. Data Pipeline using Delta In the following sections, we are going to describe the Delta-Connector that connects to a datastore and publishes CDC events to the Transport Layer, which is a real-time data transportation infrastructure routing CDC events to Kafka topics. And lastly we are going to describe the Delta Stream Processing Framework that application developers can use to build their data processing and enrichment logics. CDC (Change-Data-Capture) We have developed a CDC service named Delta-Connector, which is able to capture committed changes from a datastore in real-time and write them to a stream. Real-time changes are captured from the datastore’s transaction log and dumps. Dumps are taken because transaction logs typically do not contain the full history of changes. Changes are commonly serialized as Delta events so that a consumer does not need to be concerned if a change originates from the transaction log or a dump. Delta-Connector offers multiple advanced features such as: Ability to write into custom outputs beyond Kafka. Ability to trigger manual dumps at any time, for all tables, a specific table, or for specific primary keys. Dumps can be taken in chunks, so that there is no need to repeat from scratch in case of failure. No need to acquire locks on tables, which is essential to ensure that the write traffic on the database is never blocked by our service. High availability, via standby instances across AWS Availability Zones. We currently support MySQL and Postgres, including when deployed in AWS RDS and its Aurora flavor. In addition, we support Cassandra (multi-master). We will cover the Delta-Connector in more detail in upcoming blog posts. Kafka & Transport Layer The transport layer of Delta events were built on top of the Messaging Service in our Keystone platform. Historically, message publishing at Netflix is optimized for availability instead of durability (see a previous blog ). The tradeoff is potential broker data inconsistencies in various edge scenarios. For example, unclean leader election will result in consumer to potentially duplicate or lose events. For Delta, we want stronger durability guarantees in order to make sure CDC events can be guaranteed to arrive to derived stores. To enable this, we offered special purpose built Kafka cluster as a first class citizen. Some broker configuration looks like below. In Keystone Kafka clusters, unclean leader election is usually enabled to favor producer availability. This can result in messages being lost when an out-of-sync replica is elected as a leader. For the new high durability Kafka cluster, unclean leader election is disabled to prevent these messages getting lost. We’ve also increased the replication factor from 2 to 3 and the minimum insync replicas from 1 to 2. Producers writing to this cluster require acks from all, to guarantee that 2 out of 3 replicas have the latest messages that were written by the producers. When a broker instance gets terminated, a new instance replaces the terminated broker. However, this new broker will need to catch up on out-of-sync replicas, which may take hours. To improve the recovery time for this scenario, we started using block storage volumes (Amazon Elastic Block Store) instead of local disks on the brokers. When a new instance replaces the terminated broker, it now attaches the EBS volume that the terminated instance had and starts catching up on new messages. This process reduces the catch up time from hours to minutes since the new instance no longer have to replicate from a blank state. In general, the separate life cycles of storage and broker greatly reduce the impact of broker replacement. To further maximize our delivery guarantee, we used the message tracing system to detect any message loss due to extreme conditions (e.g clock drift on the partition leader). Stream Processing Framework The processing layer of Delta is built on top of Netflix SPaaS platform, which provides Apache Flink integration with the Netflix ecosystem. The platform provides a self-service UI which manages Flink job deployments and Flink cluster orchestration on top of our container management platform Titus. The self-service UI also manages job configurations and allows users to make dynamic configuration changes without having to recompile the Flink job. Delta provides a stream processing framework on top of Flink and SPaaS that uses an annotation driven DSL (Domain Specific Language) to abstract technical details further away. For example, to define a step that enriches events by calling external services, users only need to write the following DSL and the framework will translate it into a model which is executed by Flink. Figure 3. Enrichment DSL Example in a Delta Application The processing framework not only reduces the learning curve, but also provides common stream processing functionalities like deduplication, schematization, as well as resilience and fault tolerance to address general operational concerns. Delta Stream Processing Framework consists of two key modules, the DSL & API module and Runtime module. The DSL & API module provides the annotation based DSL and UDF (User-Defined-Function) APIs for users to write custom processing logic (e.g filter and transformation). The Runtime module provides DSL parser implementation that builds an internal representation of the processing steps in DAG models. The Execution component interprets the DAG models to initialize the actual Flink operators and eventually run the Flink app. The architecture of the framework is illustrated in the following Chart. Figure 4. Delta Stream Processing Framework Architecture This approach has several benefits: Users can focus on their business logic without the need of learning the specifics of Flink or the SPaaS framework. Optimization can be made in a way that is transparent to users, and bugs can be fixed without requiring any changes to user code (UDFs). Operating Delta applications is made simple for users as the framework provides resilience and failure tolerance out of the box and collects many granular metrics that can be used for alerts. Production Usages Delta has been running in production for over a year and has been playing a crucial role in many Netflix Studio applications. It has helped teams implement use cases such as search indexing, data warehousing, and event driven workflows. Below is a view of the high level architecture of the Delta platform. Figure 5. High Level Architecture of Delta Stay Tuned We will publish follow-up blogs about technical details of the key components such as Delta-Connector and Delta Stream Processing Framework. Please stay tuned. Also feel free to reach out to the authors for any questions you may have. Credits We would like to thank the following persons that have been involved in making Delta successful at Netflix: Allen Wang , Charles Zhao , Jaebin Yoon , Josh Snyder , Kasturi Chatterjee , Mark Cho , Olof Johansson , Piyush Goyal , Prashanth Ramdas , Raghuram Onti Srinivasan , Sandeep Gupta , Steven Wu , Tharanga Gamaethige , Yun Wang , and Zhenzhong Xu. References [link] [link] Martin Kleppmann, Alastair R. Beresford, and Boerge Svingen. 2019. Online Event Processing. Queue 17, 1, pages 40 (February 2019), 21 pages. DOI: [link] Delta: A Data Synchronization and Enrichment Platform was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story. change-data-capture stream-processing data-synchronization big-data event-driven-systems

PostgreSQL Connection Pooling: Part 1 – Pros & Cons


A long time ago, in a galaxy far far away, ‘threads’ were a programming novelty rarely used and seldom trusted. In that environment, the first PostgreSQL developers decided forking a process for each connection to the database is the safest choice.

Why I Returned to Windows


[link]. I’m a Unix guy. Note that I did not say Linux. When I started my career, many small to mid-sized companies were running on minicomputers from companies such as IBM, Digital Equipment Corporation (DEC), PR1ME Computer, and others.

AI-powered custom log metrics for faster troubleshooting


Dynatrace news. You might already use Dynatrace Log Monitoring to gain direct access to the log content of your system’s mission-critical processes. Log Monitoring is a great way to search for text patterns like log files with errors or exceptions.

More Trending

Invisible mask: practical attacks on face recognition with infrared

The Morning Paper

Invisible mask: practical attacks on face recognition with infrared Zhou et al., arXiv’18. You might have seen selected write-ups from The Morning Paper appearing in ACM Queue.

Tuning 114

The Marvel of Observability


Marvel at this article! You may also like: The Observability Pipeline. “[You’ve] You’ve] been fighting with one arm behind your back. What happens when [you’re] finally set free?” — Paraphrasing Carol Danvers, a.k.a. Captain Marvel.

Improve Cloud Foundry observability with the immutable OneAgent BOSH release


Dynatrace news. Cloud Foundry BOSH is a powerful tool that combines release engineering, deployment, and life cycle management of distributed software in the cloud.

Cloud 163

ML Platform Meetup: Infra for Contextual Bandits and Reinforcement Learning

The Netflix TechBlog

Faisal Siddiqi Infrastructure for Contextual Bandits and Reinforcement Learning?—? theme of the ML Platform meetup hosted at Netflix, Los Gatos on Sep 12, 2019. Contextual and Multi-armed Bandits enable faster and adaptive alternatives to traditional A/B Testing.

“I was told to buy a software or lose my computer: I ignored it.” A study of ransomware

The Morning Paper

“I was told to buy a software or lose my computer. I ignored it”: a study of ransomware Simoiu et al., SOUPS 2019. This is a very easy to digest paper shedding light on the prevalence of ransomware and the characteristics of those most likely to be vulnerable to it.

Writing About Performance [Prompts]


Prompts to banish writer's block. Trying to write an article but have nothing to write about? You're in the right place! This is the solution to all your writer's block needs! No more excuses, just solutions.

How to maximize CPU performance for PostgreSQL 12.0 benchmarks on Linux


HammerDB doesn’t publish competitive database benchmarks, instead we always encourage people to be better informed by running their own.

ML Platform Meetup: Infra for Contextual Bandits and Reinforcement Learning

The Netflix TechBlog

Faisal Siddiqi Infrastructure for Contextual Bandits and Reinforcement Learning?—? theme of the ML Platform meetup hosted at Netflix, Los Gatos on Sep 12, 2019. Contextual and Multi-armed Bandits enable faster and adaptive alternatives to traditional A/B Testing.

HackPPL: a universal probabilistic programming language

The Morning Paper

HackPPL: a universal probabilistic programming language Ai et al., MAPL’19. The Hack programming language, as the authors proudly tell us, is “ a dominant web development language across large technology firms with over 100 million lines of production code.” ” Nail that niche!

Monitoring Prow Resources With Prometheus and Grafana


She is monitoring Prow resources very closely. At Loodse we’re making extensive use of Prow , Kubernetes’ own CI/CD framework , for our public and private projects.

The Flow Framework™ – Treating your software features as business assets


If you who haven’t read Project to Product yet or any of my previous posts on the four key flow items from the Flow Framework , let me give you a bit of background.

Business Models and Trust

Edge Perspectives

Three years ago I sketched out three dimensions of business model evolution in response to the mounting performance pressure of the Big Shift. In this blog post, I want to highlight the role of this business model evolution in restoring trust in our corporations.

Ten-Ton Widgets

CSS - Tricks

At a recent conference talk (sorry, I forget which one), there was a quick example of poor web performance in the form of a third-party widget. The example showed a site that installed the widget in order add a "email us" button fixed to the bottom right of the viewport.

Efficient Enterprise Testing — Integration Tests (Part Three)


Efficiency is everything! This part of the series will show how to verify our applications with code-level as well as system-level integration tests. performance junit integration testing system testing enterprise testing

Two kernel mysteries and the most technical talk I've ever seen

Brendan Gregg

If you start digging into Linux kernel internals, like function disassembly and profiling, you may run into two mysteries of kernel engineering, as I did: 1. What is this "__fentry__" stuff at the start of _every_ kernel function? Profiling? Why doesn't it massively slow down the Linux kernel?

C++ 83

Netlify Build Plugin for SpeedCurve

Tim Kadlec

Netlify hosted their JAMStack Conf in San Francisco this past Wednesday. Quibbles with the JAMStack name aside, there were some great talks in the schedule and they’ve started to fill up my watch later list.

How It Works: SQL Server Lock Iteration / Enumeration

SQL Server According to Bob

When executing a query to enumerate the locks, such as select * from sys.dm_tran_locks, tran_locks, how does SQL Server scan the locks and avoid impacting the overall concurrency?

How Much Did Poor Quality Software Cost in 2018?


Quality over quantity. Poor-quality software has huge and growing economic consequences for organizations in the United States. But what are the actual monetary costs — and how can your technology company mitigate them?

Understanding Event Loss with Extended Events

SQL Performance

My colleague, Erin Stellato, recently asked me a question about where and why event loss could happen with Extended Events.

SpaceX Spending $10 Billion to Make the Internet 20ms Faster


Elon Musk’s need for speed. As the media coverage of the project ramps up, you may have heard of Starlink – SpaceX’s new satellite constellation project.

Workflow Considerations for Using an Image Management Service

CSS - Tricks

There are all these sites out there that want to help you with your images. They do things like optimize your images and help you serve them performantly. Here's the type of service I mean. Cloudinary. ImageEngine. imgix. Akami Image Manager. KeyCDN Image Processing. CloudImage. ImageOptim API. Netlify Image Transformation. That's a very good thing.

Media 45

Why Are Bug Tracking Tools so Important for Testing Teams?


We've got to keep track of these bugs! Identifying bugs is one of the crucial phases in the software development lifecycle. Tracking the bug ensures quality assurance of software as well as eliminates the risk of post-release glitches.

Act locally, connect globally with IoT and edge computing

All Things Distributed

There are places so remote, so harsh that humans can't safely explore them (for example, hundreds of miles below the earth, areas that experience extreme temperatures, or on other planets).

IoT 62

IoT Monitoring for Today and Tomorrow


The two buzz words this year have been the “Connected Car” and “IoT Device.” ” The automotive industry has taken the Read More. The post IoT Monitoring for Today and Tomorrow appeared first on Apica. IoT Monitoring

IoT 40

Network Throttling: Monitor the User Experience


Network Throttling When it comes to monitoring web application performance, not only is it necessary emulate user actions, but also network conditions of end-user devices. Network throttling allows you to control connection speeds to better match the experience of real users, allowing you to see web application behavior in specific network conditions. Network connections can… The post Network Throttling: Monitor the User Experience appeared first on Dotcom-Monitor Web Performance Blog.

How to Make Your Website Load Faster


Faster loading? Sign me up! If you implement testing your landing page you will get to know that you are wasting your money by just throwing them inside pits. The reason behind this is that your pages are simply not loading as expected as fast enough.

AV1 Image File Format (AVIF)


The major application for high quality compression of photos is Internet speed. While bandwidth does continue to increase each year, so does the quality and size of most Internet media.

Media 52

IoT Monitoring for Today and Tomorrow


The two buzz words this year have been the “Connected Car” and “IoT Device.” ” The automotive industry has taken the Read More. The post IoT Monitoring for Today and Tomorrow appeared first on Apica. IoT Monitoring

IoT 40

The new MongoDB support for NServiceBus

Particular Software

If you're using (or considering) MongoDB with NServiceBus, we've got good news: we've just released our official MongoDB persister. This replaces the previous community-created (but now abandoned) MongoDB persistence options and is now a fully supported part of the Particular Service Platform.

Improving Neo4J OGM Performance


We'll help you improve performance! You may also like: 7 Simple Ways to Improve Website and Database Performance. Overview. I've been investigating US federal lobbying using Open Data published by the US government.