article thumbnail

Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

DZone

Welcome to the first post in our exciting series on mastering offline data pipeline's best practices, focusing on the potent combination of Apache Airflow and data processing engines like Hive and Spark. Working together, they form the backbone of many modern data engineering solutions.

article thumbnail

What is IT automation?

Dynatrace

And what are the best strategies to reduce manual labor so your team can focus on more mission-critical issues? This requires significant data engineering efforts, as well as work to build machine-learning models. Big data automation tools. Creating a sound IT automation strategy. So, what is IT automation?

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

A Day in the Life of an Experimentation and Causal Inference Scientist @ Netflix

The Netflix TechBlog

At Netflix, our data scientists span many areas of technical specialization, including experimentation, causal inference, machine learning, NLP, modeling, and optimization. Together with data analytics and data engineering, we comprise the larger, centralized Data Science and Engineering group.

Analytics 207
article thumbnail

A case for ELT

Abhishek Tiwari

Cheap storage and on-demand compute in the cloud coupled with the emergence of new big data frameworks and tools are forcing us to rethink the whole ETL and data warehousing architecture. There is a strong argument for ELT i.e. extract, load, and transform model. Classic ETL. Late transformation. Challenges.

article thumbnail

Formulating ‘Out of Memory Kill’ Prediction on the Netflix App as a Machine Learning Problem

The Netflix TechBlog

We at Netflix, as a streaming service running on millions of devices, have a tremendous amount of data about device capabilities/characteristics and runtime data in our big data platform. With large data, comes the opportunity to leverage the data for predictive and classification based analysis.

Big Data 179
article thumbnail

Incremental Processing using Netflix Maestro and Apache Iceberg

The Netflix TechBlog

For example, a job would reprocess aggregates for the past 3 days because it assumes that there would be late arriving data, but data prior to 3 days isn’t worth the cost of reprocessing. Backfill: Backfilling datasets is a common operation in big data processing. data arrives too late to be useful).

article thumbnail

Optimizing data warehouse storage

The Netflix TechBlog

Some of the optimizations are prerequisites for a high-performance data warehouse. Sometimes Data Engineers write downstream ETLs on ingested data to optimize the data/metadata layouts to make other ETL processes cheaper and faster.

Storage 203