In-Stream Big Data Processing

Highly Scalable

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. The engine can also involve relatively static data (admixtures) loaded from the stores of Aggregated Data.

Performance Monitoring Dashboards in the Age of Big Data Pollution


Big data is like the pollution of the information age. The Big Data Struggle and Performance Reporting. Alternatively, a number of organizations have created their own internal home-grown systems for managing and distilling web performance and monitoring data.

Kubernetes for Big Data Workloads

Abhishek Tiwari

Kubernetes has emerged as go to container orchestration platform for data engineering teams. In 2018, a widespread adaptation of Kubernetes for big data processing is anitcipated. Obviously, containerized data workloads have their own challenges.

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

Uber Engineering

To accomplish this, Uber relies heavily on making data-driven decisions at every level, from forecasting rider demand during high traffic events to identifying and addressing bottlenecks … The post Uber’s Big Data Platform: 100+ Petabytes with Minute Latency appeared first on Uber Engineering Blog. Uber is committed to delivering safer and more reliable transportation across our global markets.

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices

The Morning Paper

Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices Gan et al., For the services under study, Seer has a sweet spot when trained with around 100GB of data and a 100ms sampling interval for measuring queue depths.

Databook: Turning Big Data into Knowledge with Metadata at Uber

Uber Engineering

From driver and rider locations and destinations, to restaurant orders and payment transactions, every interaction on Uber’s transportation platform is driven by data. Data powers Uber’s global marketplace, enabling more reliable and seamless user experiences across our products for riders, … The post Databook: Turning Big Data into Knowledge with Metadata at Uber appeared first on Uber Engineering Blog.

How to Optimize Elasticsearch for Better Search Performance


In today's world, data is generated in high volumes and to make something out of it, extracted data is needed to be transformed, stored, maintained, governed and analyzed. One of the top trending open-source data storage that responds to most of the use cases is Elasticsearch.

Advancing Application Performance with NVMe Storage, Part 3


NVMe storage's strong performance, combined with the capacity and data availability benefits of shared NVMe storage over local SSD, makes it a strong solution for AI/ML infrastructures of any size. big data ai data storage ml nvme peperformanceNVMe Storage Use Cases.

Advancing Application Performance With NVMe Storage, Part 2


Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access, and data protection. big data performance data storage ssd nvme gpu ai ml

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Uber Engineering

Maintaining Uber’s large-scale data warehouse comes with an operational cost in terms of ETL functions and storage. Once identified, … The post Less is More: Engineering Data Warehouse Efficiency with Minimalist Design appeared first on Uber Engineering Blog.

Probabilistic Data Structures for Web Analytics and Data Mining

Highly Scalable

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. The picture above depicts the fact that this data set basically occupies 40MB of memory (10 million of 4-byte elements).

Scaling Uber’s Apache Hadoop Distributed File System for Growth

Uber Engineering

Three years ago, Uber Engineering adopted Hadoop as the storage ( HDFS ) and compute ( YARN ) infrastructure for our organization’s big data analysis. Architecture Apache Hadoop Big Data Big Data Analysis Garbage Collection Hadoop Hadoop Distributed File System Hadoop Infra & Analytics Team HBase HDFS HDFS Federation HIVE Hoodie Infrastructure JAR Jenkins NameNode PRC Puppet Router-based Federation Uber Uber Data Foundation Team Uber Engineering View File System WebHDFS YARN

Driving down the cost of Big-Data analytics - All Things Distributed

All Things Distributed

Driving down the cost of Big-Data analytics. The Amazon Elastic MapReduce (EMR) team announced today the ability to seamlessly use Amazon EC2 Spot Instances with their service, significantly driving down the cost of data analytics in the cloud. All Things Distributed.

Data Mining Problems in Retail

Highly Scalable

Retail is one of the most important business domains for data science and data mining applications because of its prolific data and numerous optimization problems such as optimal prices, discounts, recommendations, and stock levels that can be solved using data analysis methods.

Retail 175

NoSQL Data Modeling Techniques

Highly Scalable

At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Uber Engineering

Cluster management, a common software infrastructure among technology companies, aggregates compute resources from a collection of physical hosts into a shared resource pool, amplifying compute power and allowing for the flexible use of data center hardware. Architecture Apache Hadoop Apache Spark Big Data Capacity Planning Cassandra Cluster Management Data Center Hardware MySQL Peloton Redis Uber Uber Engineering Unified Resource Scheduler Workload Cluster

MapReduce Patterns, Algorithms, and Use Cases

Highly Scalable

Applications: Log Analysis, Data Querying. Applications: Log Analysis, Data Querying, ETL, Data Validation. Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Applications: ETL, Data Analysis.

Fast Intersection of Sorted Lists Using SSE Instructions

Highly Scalable

Big Data Fundamentals Lucene algorithm index information retrieval lucene simd sseIntersection of sorted lists is a cornerstone operation in many applications including search engines and databases because indexes are often implemented using different types of sorted structures.

5 data integration trends that will define the future of ETL in 2018

Abhishek Tiwari

ETL refers to extract, transform, load and it is generally used for data warehousing and data integration. There are several emerging data trends that will define the future of ETL in 2018. Unified data management architecture. Common in-memory data interfaces.

When Performance Matters, Think NVMe


The demand for more IT resource-intensive applications has significantly increased today, whether it is to process quicker transactions, gain real-time insight, crunch big data sets, or to meet customer expectations.

Advancing Application Performance with NVMe Storage, Part 1


With big data on the rise and data algorithms advancing, the ways in which technology has been applied to real-world challenges have grown more automated and autonomous.

Python at Netflix

The Netflix TechBlog

Device interaction for the collection of health and other operational data is yet another Python application. We are heavy users of Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools that help us detect capacity regressions.

Tackling the Pipeline Problem in the Architecture Research Community

ACM Sigarch

big-data processing, machine learning, quantum computing, and so on). Computer architecture is an important and exciting field of computer science, which enables many other fields (eg. For those of us who pursued computer architecture as a career, this is well understood.

Fast key-value stores: an idea whose time has come and gone

The Morning Paper

Coupled with stateless application servers to execute business logic and a database-like system to provide persistent storage, they form a core component of popular data center service archictectures. We’ve seen similar high marshalling overheads in big data systems too.)

Cache 97

A case for ELT

Abhishek Tiwari

Cheap storage and on-demand compute in the cloud coupled with the emergence of new big data frameworks and tools are forcing us to rethink the whole ETL and data warehousing architecture. Then we perform frequent batch ETL from application databases to a data warehouse.

London Calling! An AWS Region is coming to the UK!

All Things Distributed

This region will provide even lower latency and strong data sovereignty to local users. Yesterday, AWS evangelist Jeff Barr wrote that AWS will be opening a region in South Korea in early 2016 that will be our 5th region in Asia Pacific.

Retail 101

USENIX LISA 2018: CFP Now Open

Brendan Gregg

Join us for 3 days in Nashville at LISA'18. Post by Brendan Gregg and Rikki Endsley. USENIX’s LISA conference is the premier event for topics in production system engineering.

I Used The Web For A Day On A 50 MB Budget

Smashing Magazine

Many of us are lucky enough to be on mobile plans which allow several gigabytes of data transfer per month. Failing that, we are usually able to connect to home or public WiFi networks that are on fast broadband connections and have effectively unlimited data. The Cost Of Mobile Data.

Cache 97

Expanding the AWS Cloud: Introducing the AWS Europe (London) Region

All Things Distributed

The council has deployed IoT Weather Stations in Schools across the City and is using the sensor information collated in a Data Lake to gain insights on whether the weather or pollution plays a part in learning outcomes.

AWS 92

DynamoDB for Location Data: Geospatial querying on DynamoDB datasets

All Things Distributed

Over the past few years, two important trends that have been disrupting the database industry are mobile applications and big data.

Expanding the Cloud: Introducing the AWS Asia Pacific (Mumbai) Region

All Things Distributed

Seamless ingestion of large volumes of sensed data. AdiMap uses Amazon Kinesis to process real-time streaming online ad data and job feeds, and processes them for storage in petabyte-scale Amazon Redshift. Advanced problem solving that connects big data with machine learning.

40+ Best Web Development Blogs of 2018


It’s awesome for discovering how grid systems, CSS animation, Big Data, etc all play roles in real-world web design. It includes tutorials, links to data-visualization tools, design resources and articles that cite real-world business experiments.

Expanding the Cloud: Introducing Amazon QuickSight

All Things Distributed

We live in a world where massive volumes of data are generated from websites, connected devices and mobile apps. However, the data infrastructure to collect, store and process data is geared toward developers (e.g., Big data challenges.

Cloud 87

Expanding the AWS Cloud – Introducing the AWS Europe (Stockholm) Region

All Things Distributed

WirelessCar is also using a host of AWS security services to ensure that the data it collects for its customers is secure, and also that the services the company offers are reliable as well.

AWS 104

Expanding the Cloud: Introducing the AWS Asia Pacific (Seoul) Region

All Things Distributed

We believe that with the launch of the Seoul Region, AWS will enable many more enterprise customers in Korea to reduce the cost of their IT operations and innovate faster in critical new areas such as big data analysis, Internet of Things, and more.

Games 83

Spice up your Analytics: Amazon QuickSight Now Generally Available in N. Virginia, Oregon, and Ireland.

All Things Distributed

Previously, I wrote about Amazon QuickSight , a new service targeted at business users that aims to simplify the process of deriving insights from a wide variety of data sources quickly, easily, and at a low cost.

Expanding the Cloud – An AWS Region is coming to Hong Kong

All Things Distributed

The new region will give Hong Kong-based businesses, government organizations, non-profits, and global companies with customers in Hong Kong, the ability to leverage AWS technologies from data centers in Hong Kong.

Välkommen till Stockholm – An AWS Region is coming to the Nordics

All Things Distributed

The new region will give Nordic-based businesses, government organisations, non-profits, and global companies with customers in the Nordics, the ability to leverage the AWS technology infrastructure from data centers in Sweden.

Expanding the Cloud ? introducing the Asia Pacific (Sydney) Region.

All Things Distributed

All Things Distributed. Werner Vogels weblog on building scalable and robust distributed systems. Expanding the Cloud â?? introducing the Asia Pacific (Sydney) Region. By Werner Vogels on 12 November 2012 05:00 AM. Permalink. Comments ().

Should You Use ClickHouse as a Main Operational Database?


how many messages was send for some time period and how much it cost) and a typical key/value queries like: “return 1 message by the message id” Using a columnar analytical database can be a big challenge here. In my case, I’m using this data as a simulation of text messages, and will show how we can use ClickHouse as a backend for an API. Loading the JSON data to Clickhouse. Updating / deleting data in ClickHouse.

Allez, rendez-vous à Paris – An AWS Region is coming to France!

All Things Distributed

The new European region, coupled with the existing AWS Regions in Dublin and Frankfurt, and a future one in London, will provide customers with quick, low-latency access to websites, mobile applications, games, SaaS applications, Big Data analysis, Internet of Things applications, and more.

IoT 99

Rethinking the 'production' of data

All Things Distributed

How companies can use ideas from mass production to create business with data. In this way, designers are part of an ecosystem in which the functionalities of simulations, data and people come together, enabling them to develop better products faster. Value creation through data.

Expanding the AWS Cloud: Introducing the AWS Canada (Central) Region

All Things Distributed

AWS data centers in Canada will draw from a regional electricity grid that is 99 percent powered by hydropower. Having the ability to provide these services locally enables Box to better serve Canadian enterprises looking for cloud solutions while ensuring their data is stored inside Canada.

AWS 89