Big Data and Data - Technology Performance Pulse

Cutting Big Data Costs: Effective Data Processing With Apache Spark

DZone

SEPTEMBER 14, 2023

In today's data-driven world, efficient data processing plays a pivotal role in the success of any project. Apache Spark , a robust open-source data processing framework, has emerged as a game-changer in this domain.

Big Data

Big Data Processing Games Open Source

3 Performance Tricks for Dealing With Big Data Sets

DZone

AUGUST 21, 2021

This article describes 3 different tricks that I used in dealing with big data sets (order of 10 million records) and that proved to enhance performance dramatically. Trick 1: CLOB Instead of Result Set.

Big Data

Big Data Performance Tuning Mobile

What is Greenplum Database? Intro to the Big Data Database

Scalegrid

MAY 13, 2020

It can scale towards a multi-petabyte level data workload without a single issue, and it allows access to a cluster of powerful servers that will work together within a single SQL interface where you can view all of the data. This feature-packed database provides powerful and rapid analytics on data that scales up to petabyte volumes.

Big Data

Big Data Database Artificial Intelligence Open Source

Write Optimized Spark Code for Big Data Applications

DZone

MARCH 7, 2023

Apache Spark is a powerful open-source distributed computing framework that provides a variety of APIs to support big data processing. Broadcast variables can be used to efficiently distribute large read-only data structures, such as lookup tables, to worker nodes.

Big Data

Big Data Code Tuning Open Source

ScyllaDB Trends – How Users Deploy The Real-Time Big Data Database

Scalegrid

NOVEMBER 25, 2019

ScyllaDB is an open-source distributed NoSQL data store, reimplemented from the popular Apache Cassandra database. ScyllaDB offers significantly lower latency which allows you to process a high volume of data with minimal delay. So what are some of the reasons why users would pick ScyllaDB vs. Cassandra? Google Cloud.

Big Data

Big Data Database Open Source Azure

How Amazon is solving big-data challenges with data lakes

All Things Distributed

JANUARY 20, 2020

Amazon's worldwide financial operations team has the incredible task of tracking all of that data (think petabytes). At Amazon's scale, a miscalculated metric, like cost per unit, or delayed data can have a huge impact (think millions of dollars). The team is constantly looking for ways to get more accurate data, faster.

Big Data

Big Data Logistics Retail Government

In-Stream Big Data Processing

Highly Scalable

AUGUST 20, 2013

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. This system has been designed to supplement and succeed the existing Hadoop-based system that had too high latency of data processing and too high maintenance costs.

Big Data

Big Data Processing Lambda Database

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

OCTOBER 17, 2018

To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog.

Big Data

Big Data Latency Transportation Traffic

Introduction to Azure Data Lake Storage Gen2

DZone

FEBRUARY 1, 2023

Built on Azure Blob Storage, Azure Data Lake Storage Gen2 is a suite of features for big data analytics. Azure Data Lake Storage Gen1 and Azure Blob Storage's capabilities are combined in Data Lake Storage Gen2. For instance, Data Lake Storage Gen2 offers scale, file-level security, and file system semantics.

Azure

Azure Storage Big Data Analytics

What is IT operations analytics? Extract more data insights from more sources

Dynatrace

MAY 1, 2023

IT operations analytics is the process of unifying, storing, and contextually analyzing operational data to understand the health of applications, infrastructure, and environments and streamline everyday operations. ITOA collects operational data to identify patterns and anomalies for faster incident management and near-real-time insights.

Analytics

Analytics Artificial Intelligence Big Data Open Source

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

AUGUST 3, 2018

From driver and rider locations and destinations, to restaurant orders and payment transactions, every interaction on Uber’s transportation platform is driven by data.

Big Data

Big Data Transportation Engineering Storage

Kubernetes for Big Data Workloads

Abhishek Tiwari

DECEMBER 27, 2017

Kubernetes has emerged as go to container orchestration platform for data engineering teams. In 2018, a widespread adaptation of Kubernetes for big data processing is anitcipated. Organisations are already using Kubernetes for a variety of workloads [1] [2] and data workloads are up next. Key challenges. Performance.

Big Data

Big Data Storage Benchmarking Hardware

Auto-Diagnosis and Remediation in Netflix Data Platform

The Netflix TechBlog

JANUARY 13, 2022

By Vikram Srivastava and Marcelo Mayworm Netflix has one of the most complex data platforms in the cloud on which our data scientists and engineers run batch and streaming workloads. As our subscribers grow worldwide and Netflix enters the world of gaming , the number of batch workflows and real-time data pipelines increases rapidly.

Big Data

Big Data Infrastructure Metrics Hardware

An overview of end-to-end entity resolution for big data

The Morning Paper

DECEMBER 13, 2020

An overview of end-to-end entity resolution for big data , Christophides et al., It’s an important part of many modern data workflows, and an area I’ve been wrestling with in one of my own projects. ACM Computing Surveys, Dec. 2020, Article No.

Big Data

Big Data Open Source Processing Analytics

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

The Netflix TechBlog

MARCH 4, 2024

Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).

Tuning

Tuning Efficiency Big Data Engineering

What is a data lakehouse? Combining data lakes and warehouses for the best of both worlds

Dynatrace

OCTOBER 4, 2022

While data lakes and data warehousing architectures are commonly used modes for storing and analyzing data, a data lakehouse is an efficient third way to store and analyze data that unifies the two architectures while preserving the benefits of both. What is a data lakehouse? How does a data lakehouse work?

Artificial Intelligence

Artificial Intelligence Analytics Storage Government

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

The Netflix TechBlog

OCTOBER 28, 2021

Data Engineers of Netflix?—?Interview Interview with Pallavi Phadnis This post is part of our “ Data Engineers of Netflix ” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. Pallavi Phadnis is a Senior Software Engineer on the Product Data Science and Engineering team.

Data Engineering

Data Engineering Engineering Big Data Software Engineering

Data Engineers of Netflix?—?Interview with Kevin Wylie

The Netflix TechBlog

JULY 15, 2021

Data Engineers of Netflix?—?Interview Interview with Kevin Wylie This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. Kevin Wylie is a Data Engineer on the Content Data Science and Engineering team.

Data Engineering

Data Engineering Engineering Entertainment Big Data

Performance Monitoring Dashboards in the Age of Big Data Pollution

Rigor

MAY 22, 2019

Big data is like the pollution of the information age. The Big Data Struggle and Performance Reporting. Alternatively, a number of organizations have created their own internal home-grown systems for managing and distilling web performance and monitoring data. Insights At Any Level. Automatic Updates.

Big Data

Big Data Monitoring Performance Metrics

Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

DZone

DECEMBER 27, 2023

Welcome to the first post in our exciting series on mastering offline data pipeline's best practices, focusing on the potent combination of Apache Airflow and data processing engines like Hive and Spark. Working together, they form the backbone of many modern data engineering solutions.

Best Practices

Best Practices Data Engineering Big Data Games

Experiences with approximating queries in Microsoft’s production big-data clusters

The Morning Paper

SEPTEMBER 8, 2019

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of users to run some pretty complex queries. VLDB’19. A sizable fraction of the jobs are much larger. Implementation.

Big Data

Big Data Analytics Latency Azure

Conducting log analysis with an observability platform and full data context

Dynatrace

APRIL 20, 2023

Modern organizations ingest petabytes of data daily, but legacy approaches to log analysis and management cannot accommodate this volume of data. based financial services group, discussed how the bank uses log monitoring on the Dynatrace platform with an emphasis on observability and security data.

Analytics

Analytics Infrastructure Storage Efficiency

Data Democratization and How to Get Started?

DZone

JUNE 29, 2020

Today data is an important factor for business success. In every business, it has been observed that data is playing a game-changing moment to improve business performance. Data is important and necessary in this increasingly competitive world. Data comes in large volumes everywhere and in complex structures.

Games

Games Performance Big Data Analytics

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

Dynatrace

FEBRUARY 16, 2023

How do you get more value from petabytes of exponentially exploding, increasingly heterogeneous data? The short answer: The three pillars of observability—logs, metrics, and traces—converging on a data lakehouse. To solve this problem, Dynatrace launched Grail, its causational data lakehouse , in 2022.

Analytics

Analytics Innovation Metrics Database

DynatraceGo! APAC 2021: Lessons in thick data and keeping pace with the market

Dynatrace

AUGUST 10, 2021

This year’s conference agenda was packed full of choices, including: Keynotes : Topics included accelerating digital transformation, with Dynatrace CIO Mike Maciag, and Spatial Collapse: The Great Acceleration of Turning Data Into an Asset, with Tricia Wang from Sudden Compass. We’ve all heard it: data is one of your biggest assets.

DevOps

DevOps Innovation Big Data Cloud

Giving data a heartbeat

Dynatrace

SEPTEMBER 9, 2019

I love data. I have spent virtually my entire career looking at data. Synthetic data, network data, system data, and the list goes on. As much as I love data, data is cold, it lacks emotion. As much as I love data, data is cold, it lacks emotion. Often, 4s is too slow.

Big Data

Big Data Metrics Virtualization Monitoring

Big / Bug Data: Analyzing the Apache Flink Source Code

DZone

DECEMBER 21, 2020

Applications used in the field of Big Data process huge amounts of information, and this often happens in real time. Naturally, such applications must be highly reliable so that no error in the code can interfere with data processing. It is an open-source framework for distributed processing of large amounts of data.

Code

Code Java Big Data Open Source

Data Engineers of Netflix?—?Interview with Dhevi Rajendran

The Netflix TechBlog

JUNE 1, 2021

Data Engineers of Netflix?—?Interview Interview with Dhevi Rajendran Dhevi Rajendran This post is part of our “Data Engineers of Netflix” interview series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.

Data Engineering

Data Engineering Engineering Software Engineering Big Data

Data Engineers of Netflix?—?Interview with Samuel Setegne

The Netflix TechBlog

JUNE 1, 2021

Data Engineers of Netflix?—?Interview Interview with Samuel Setegne Samuel Setegne This post is part of our “Data Engineers of Netflix” interview series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix. For example?—?clinical

Data Engineering

Data Engineering Engineering Big Data Healthcare

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

InfoQ

APRIL 15, 2024

He specifically delved into Venice DB, the NoSQL data store used for feature persistence. At the QCon London 2024 conference, Félix GV from LinkedIn discussed the AI/ML platform powering the company’s products. By Rafal Gancarz

Artificial Intelligence

Artificial Intelligence Big Data Data Engineering Latency

How Our Paths Brought Us to Data and Netflix

The Netflix TechBlog

SEPTEMBER 18, 2020

and what the role entails by Julie Beckley & Chris Pham This Q&A provides insights into the diverse set of skills, projects, and culture within Data Science and Engineering (DSE) at Netflix through the eyes of two team members: Chris Pham and Julie Beckley. What was your path to working in data? There’s us to the right!

Storage Systems Big Data Azure

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

The Morning Paper

MAY 14, 2019

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., For the services under study, Seer has a sweet spot when trained with around 100GB of data and a 100ms sampling interval for measuring queue depths. ASPLOS’19. Seer in action.

Big Data

Big Data Cloud Performance Hardware

Delta: A Data Synchronization and Enrichment Platform

The Netflix TechBlog

OCTOBER 15, 2019

Part I: Overview Andreas Andreakis , Falguni Jhaveri , Ioannis Papapanagiotou , Mark Cho , Poorna Reddy , Tongliang Liu Overview It is a commonly observed pattern for applications to utilize multiple datastores where each is used to serve a specific need such as storing the canonical form of data (MySQL etc.), caching (Memcached etc.),

Transportation

Transportation Architecture Storage Processing

What Should You Know About Graph Database’s Scalability?

DZone

JANUARY 20, 2023

There is a countless number of enterprises, particularly Internet giants, that have explored ways to make graph data processing scalable. It has been a norm to perceive that distributed databases use the method of adding cheap PC(s) to achieve scalability (storage and computing) and attempt to store data once and for all on demand.

Scalability

Scalability Big Data Hardware Internet

Cutting Big Data Costs: Effective Data Processing With Apache Spark

3 Performance Tricks for Dealing With Big Data Sets

Trending Sources

What is Greenplum Database? Intro to the Big Data Database

Write Optimized Spark Code for Big Data Applications

ScyllaDB Trends – How Users Deploy The Real-Time Big Data Database

How Amazon is solving big-data challenges with data lakes

In-Stream Big Data Processing

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Introduction to Azure Data Lake Storage Gen2

What is IT operations analytics? Extract more data insights from more sources

Databook: Turning Big Data into Knowledge with Metadata at Uber

Kubernetes for Big Data Workloads

Auto-Diagnosis and Remediation in Netflix Data Platform

An overview of end-to-end entity resolution for big data

Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…

What is a data lakehouse? Combining data lakes and warehouses for the best of both worlds

Data Engineers of Netflix?—?Interview with Pallavi Phadnis

Data Engineers of Netflix?—?Interview with Kevin Wylie

Performance Monitoring Dashboards in the Age of Big Data Pollution

Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

Experiences with approximating queries in Microsoft’s production big-data clusters

Conducting log analysis with an observability platform and full data context

Data Democratization and How to Get Started?

Data lakehouse innovations advance the three pillars of observability for more collaborative analytics

DynatraceGo! APAC 2021: Lessons in thick data and keeping pace with the market

Giving data a heartbeat

Big / Bug Data: Analyzing the Apache Flink Source Code

Data Engineers of Netflix?—?Interview with Dhevi Rajendran

Data Engineers of Netflix?—?Interview with Samuel Setegne

QCon London: Lessons Learned From Building LinkedIn’s AI/ML Data Platform

How Our Paths Brought Us to Data and Netflix

Driving down the cost of Big-Data analytics - All Things Distributed

Microsoft Azure Event Hubs

Scaling for Success: Why Scalability Is the Forefront of Modern Applications

Snowflake Workload Optimization

Introduction to Grafana, Prometheus, and Zabbix

What is software automation? Optimize the software lifecycle with intelligent automation

Understanding gRPC Concepts, Use Cases, and Best Practices

Turbocharge Your Apache Spark Jobs for Unmatched Performance

Stream Processing vs. Batch Processing: What to Know

What is a Distributed Storage System

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

Delta: A Data Synchronization and Enrichment Platform

What Should You Know About Graph Database’s Scalability?

Stay Connected