In-Stream Big Data Processing

Highly Scalable

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. The engine can also involve relatively static data (admixtures) loaded from the stores of Aggregated Data.

Kubernetes for Big Data Workloads

Abhishek Tiwari

Kubernetes has emerged as go to container orchestration platform for data engineering teams. In 2018, a widespread adaptation of Kubernetes for big data processing is anitcipated. Obviously, containerized data workloads have their own challenges.

Performance Monitoring Dashboards in the Age of Big Data Pollution

Rigor

Big data is like the pollution of the information age. The Big Data Struggle and Performance Reporting. Alternatively, a number of organizations have created their own internal home-grown systems for managing and distilling web performance and monitoring data.

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog. Uber is committed to delivering safer and more reliable transportation across our global markets.

Experiences with approximating queries in Microsoft’s production big-data clusters

The Morning Paper

Experiences with approximating queries in Microsoft’s production big-data clusters Kandula et al., Microsoft’s big data clusters have 10s of thousands of machines, and are used by thousands of users to run some pretty complex queries.

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

The Morning Paper

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., For the services under study, Seer has a sweet spot when trained with around 100GB of data and a 100ms sampling interval for measuring queue depths.

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

From driver and rider locations and destinations, to restaurant orders and payment transactions, every interaction on Uber’s transportation platform is driven by data. Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

How to Optimize Elasticsearch for Better Search Performance

DZone

In today's world, data is generated in high volumes and to make something out of it, extracted data is needed to be transformed, stored, maintained, governed and analyzed. One of the top trending open-source data storage that responds to most of the use cases is Elasticsearch.

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

The Netflix TechBlog

Can I run a check myself to understand what data is behind this metric?” Now, imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by few critical customer facing services (e.g. Design a flexible data model ? —?Represent

Advancing Application Performance with NVMe Storage, Part 3

DZone

NVMe storage's strong performance, combined with the capacity and data availability benefits of shared NVMe storage over local SSD, makes it a strong solution for AI/ML infrastructures of any size. big data ai data storage ml nvme peperformanceNVMe Storage Use Cases.

Advancing Application Performance With NVMe Storage, Part 2

DZone

Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access, and data protection. big data performance data storage ssd nvme gpu ai ml

Probabilistic Data Structures for Web Analytics and Data Mining

Highly Scalable

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. The picture above depicts the fact that this data set basically occupies 40MB of memory (10 million of 4-byte elements).

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Uber Engineering

Maintaining Uber’s large-scale data warehouse comes with an operational cost in terms of ETL functions and storage. Once identified, … The post Less is More: Engineering Data Warehouse Efficiency with Minimalist Design appeared first on Uber Engineering Blog.

Scaling Uber’s Apache Hadoop Distributed File System for Growth

Uber Engineering

Three years ago, Uber Engineering adopted Hadoop as the storage ( HDFS ) and compute ( YARN ) infrastructure for our organization’s big data analysis. Architecture Apache Hadoop Big Data Big Data Analysis Garbage Collection Hadoop Hadoop Distributed File System Hadoop Infra & Analytics Team HBase HDFS HDFS Federation HIVE Hoodie Infrastructure JAR Jenkins NameNode PRC Puppet Router-based Federation Uber Uber Data Foundation Team Uber Engineering View File System WebHDFS YARN

Driving down the cost of Big-Data analytics - All Things Distributed

All Things Distributed

Driving down the cost of Big-Data analytics. The Amazon Elastic MapReduce (EMR) team announced today the ability to seamlessly use Amazon EC2 Spot Instances with their service, significantly driving down the cost of data analytics in the cloud. All Things Distributed.

NoSQL Data Modeling Techniques

Highly Scalable

At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

Data Mining Problems in Retail

Highly Scalable

Retail is one of the most important business domains for data science and data mining applications because of its prolific data and numerous optimization problems such as optimal prices, discounts, recommendations, and stock levels that can be solved using data analysis methods.

Retail 175

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Uber Engineering

Cluster management, a common software infrastructure among technology companies, aggregates compute resources from a collection of physical hosts into a shared resource pool, amplifying compute power and allowing for the flexible use of data center hardware. Architecture Apache Hadoop Apache Spark Big Data Capacity Planning Cassandra Cluster Management Data Center Hardware MySQL Peloton Redis Uber Uber Engineering Unified Resource Scheduler Workload Cluster

MapReduce Patterns, Algorithms, and Use Cases

Highly Scalable

Applications: Log Analysis, Data Querying. Applications: Log Analysis, Data Querying, ETL, Data Validation. Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Applications: ETL, Data Analysis.

Fast Intersection of Sorted Lists Using SSE Instructions

Highly Scalable

Big Data Fundamentals Lucene algorithm index information retrieval lucene simd sseIntersection of sorted lists is a cornerstone operation in many applications including search engines and databases because indexes are often implemented using different types of sorted structures.

5 data integration trends that will define the future of ETL in 2018

Abhishek Tiwari

ETL refers to extract, transform, load and it is generally used for data warehousing and data integration. There are several emerging data trends that will define the future of ETL in 2018. Unified data management architecture. Common in-memory data interfaces.

Giving data a heartbeat

Dynatrace

I love data. I have spent virtually my entire career looking at data. Synthetic data, network data, system data, and the list goes on. As much as I love data, data is cold, it lacks emotion. Dynatrace news.

When Performance Matters, Think NVMe

DZone

The demand for more IT resource-intensive applications has significantly increased today, whether it is to process quicker transactions, gain real-time insight, crunch big data sets, or to meet customer expectations.

Advancing Application Performance with NVMe Storage, Part 1

DZone

With big data on the rise and data algorithms advancing, the ways in which technology has been applied to real-world challenges have grown more automated and autonomous.

Python at Netflix

The Netflix TechBlog

Device interaction for the collection of health and other operational data is yet another Python application. We are heavy users of Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools that help us detect capacity regressions.

Reimagining Experimentation Analysis at Netflix

The Netflix TechBlog

After recreating the dataset, you can plot the raw numbers and perform custom analyses to understand the distribution of the data across test cells. However, the default reports only provide a summary view of the data with some powerful but limited filtering options.

Tackling the Pipeline Problem in the Architecture Research Community

ACM Sigarch

big-data processing, machine learning, quantum computing, and so on). Computer architecture is an important and exciting field of computer science, which enables many other fields (eg. For those of us who pursued computer architecture as a career, this is well understood.

Fast key-value stores: an idea whose time has come and gone

The Morning Paper

Coupled with stateless application servers to execute business logic and a database-like system to provide persistent storage, they form a core component of popular data center service archictectures. We’ve seen similar high marshalling overheads in big data systems too.)

Cache 109

World’s Top Web Performance Leaders To Watch

Rigor

His most recent book, Hacking Web Performance , walks the reader through improving performance from initial load and data transfer to resource loading and the overall user experience. Would you like to be a part of Big Data and this incredible project?

A case for ELT

Abhishek Tiwari

Cheap storage and on-demand compute in the cloud coupled with the emergence of new big data frameworks and tools are forcing us to rethink the whole ETL and data warehousing architecture. Then we perform frequent batch ETL from application databases to a data warehouse.

Even more amazing papers at VLDB 2019 (that I didn’t have space to cover yet)

The Morning Paper

Their dataset has about 7B edges… Meanwhile, AnalyticDB is Alibaba’s real-time OLAP RDBMS handling 10PB of data (in excess of 100 trillion rows!). Hyper Dimension Shuffle describes how Microsoft improved the cost of data shuffling, one of the most costly operations, in their petabyte-scale internal big data analytics platform, SCOPE. We’ve been covering papers from VLDB 2019 for the last three weeks, and next week it will be time to mix things up again.

London Calling! An AWS Region is coming to the UK!

All Things Distributed

This region will provide even lower latency and strong data sovereignty to local users. Yesterday, AWS evangelist Jeff Barr wrote that AWS will be opening a region in South Korea in early 2016 that will be our 5th region in Asia Pacific.

Expanding the AWS Cloud: Introducing the AWS Europe (London) Region

All Things Distributed

The council has deployed IoT Weather Stations in Schools across the City and is using the sensor information collated in a Data Lake to gain insights on whether the weather or pollution plays a part in learning outcomes.

AWS 86

I Used The Web For A Day On A 50 MB Budget

Smashing Magazine

Many of us are lucky enough to be on mobile plans which allow several gigabytes of data transfer per month. Failing that, we are usually able to connect to home or public WiFi networks that are on fast broadband connections and have effectively unlimited data. The Cost Of Mobile Data.

Cache 100

DynamoDB for Location Data: Geospatial querying on DynamoDB datasets

All Things Distributed

Over the past few years, two important trends that have been disrupting the database industry are mobile applications and big data.

Expanding the Cloud: Introducing the AWS Asia Pacific (Mumbai) Region

All Things Distributed

Seamless ingestion of large volumes of sensed data. AdiMap uses Amazon Kinesis to process real-time streaming online ad data and job feeds, and processes them for storage in petabyte-scale Amazon Redshift. Advanced problem solving that connects big data with machine learning.

USENIX LISA 2018: CFP Now Open

Brendan Gregg

Join us for 3 days in Nashville at LISA'18. Post by Brendan Gregg and Rikki Endsley. USENIX’s LISA conference is the premier event for topics in production system engineering.

40+ Best Web Development Blogs of 2018

KeyCDN

It’s awesome for discovering how grid systems, CSS animation, Big Data, etc all play roles in real-world web design. It includes tutorials, links to data-visualization tools, design resources and articles that cite real-world business experiments.

Expanding the Cloud: Introducing Amazon QuickSight

All Things Distributed

We live in a world where massive volumes of data are generated from websites, connected devices and mobile apps. However, the data infrastructure to collect, store and process data is geared toward developers (e.g., Big data challenges.

Cloud 84

Expanding the AWS Cloud – Introducing the AWS Europe (Stockholm) Region

All Things Distributed

WirelessCar is also using a host of AWS security services to ensure that the data it collects for its customers is secure, and also that the services the company offers are reliable as well.

AWS 104

Expanding the Cloud: Introducing the AWS Asia Pacific (Seoul) Region

All Things Distributed

We believe that with the launch of the Seoul Region, AWS will enable many more enterprise customers in Korea to reduce the cost of their IT operations and innovate faster in critical new areas such as big data analysis, Internet of Things, and more.

Games 83

Spice up your Analytics: Amazon QuickSight Now Generally Available in N. Virginia, Oregon, and Ireland.

All Things Distributed

Previously, I wrote about Amazon QuickSight , a new service targeted at business users that aims to simplify the process of deriving insights from a wide variety of data sources quickly, easily, and at a low cost.

Expanding the Cloud – An AWS Region is coming to Hong Kong

All Things Distributed

The new region will give Hong Kong-based businesses, government organizations, non-profits, and global companies with customers in Hong Kong, the ability to leverage AWS technologies from data centers in Hong Kong.