Optimizing data warehouse storage

The Netflix TechBlog

At this scale, we can gain a significant amount of performance and cost benefits by optimizing the storage layout (records, objects, partitions) as the data lands into our warehouse. Increase in storage space.

Why Compute and Storage Should Be Decoupled for Log Management at Scale

DZone

On a small scale, this isn’t problematic but when dealing with large-scale deployments, organizations end up using lots of computing, storage, and human resources just to manage their indexes. data storage log managament data stores observability data efficient log management and monitoringMost log management solutions store log data in a database and enable search by storing an index of the data. As the database grows in size, so does the index management cost.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Storage handling improvements increase retention of transaction data for Dynatrace Managed

Dynatrace

Using existing storage resources optimally is key to being able to capture the right data over time. Increased storage space availability. The compression of transaction data older than three days can free up to 50% more storage space in your Dynatrace Managed Cluster.

The APPEND_ONLY_STORAGE_INSERT_POINT Latch

SQL Performance

Continuing my series of articles on latches, this time I’m going to discuss the APPEND_ONLY_STORAGE_INSERT_POINT latch and show how it can be a major bottleneck for heavy update workloads where either form of snapshot isolation is being used. What Is the APPEND_ONLY_STORAGE_INSERT_POINT Latch?

Improving Spark Memory Resource With Off-Heap In-Memory Storage

DZone

performance storage spark caching in-memory spark caching spark memoryImprove your Spark memory. In the previous tutorial , we demonstrated how to get started with Spark and Alluxio. To share more thoughts and experiments on how Alluxio enhances Spark workloads, this article focuses on how Alluxio helps to optimize the memory utilization of Spark applications.

Advancing Application Performance with NVMe Storage, Part 3

DZone

NVMe Storage Use Cases. NVMe storage's strong performance, combined with the capacity and data availability benefits of shared NVMe storage over local SSD, makes it a strong solution for AI/ML infrastructures of any size. big data ai data storage ml nvme peperformanceThere are several AI/ML focused use cases to highlight.

Checksums in Storage Systems and Why the Enterprise Should Care

DZone

It’s really scary knowing that such corruptions are happening in the memory of our computers and servers – that is before they even reach the network and storage portions of the stack. That data must then be safely transported over a network to the storage system where it is written to disk. Well, if you’re using one of the storage protocols that lack end-to-end checksums (e.g. performance storage database checksum data corruption data safety

Building an elastic query engine on disaggregated storage

The Morning Paper

Building an elastic query engine on disaggregated storage , Vuppalapati, NSDI’20. Snowflake is a data warehouse designed to overcome these limitations, and the fundamental mechanism by which it achieves this is the decoupling (disaggregation) of compute and storage.

An Efficient Object Storage for JUnit Tests

DZone

To resolve the problem it was suggested to find more suitable data storage. One day I faced the problem with downloading a relatively large binary data file from PostgreSQL. There are several limitations to store and fetch such data (all restrictions could be found in official documentation ). For some internal reasons well known Amazon S3 bucket was chosen for this purpose. The choice affected the project's unit test base.

A Generalized workload generator for storage IO

n0derunner

The post A Generalized workload generator for storage IO appeared first on n0derunner. With help from the Nutanix X-Ray team I have created an IO “benchmark” which simulates a “General Server Virtualization” workload.

Azure Storage Queues learns some new tricks

Particular Software

Azure Storage Queues is a basic yet robust queueing service available on the Azure platform. For example, it lacks concepts like topics and subscriptions to build publisher and subscriber topologies—these are capabilities NServiceBus adds on top of Azure Storage Queues.

Using Diskspd to test SQL Server Storage Subsystems

SQL Shack

In this article, we will learn how to test our storage subsystems performance using Diskspd. The storage subsystem is one of the key performance factors for SQL Server because SQL Server storage engine stores database objects, tables, and indexes on the physical files.

Jellyfish: Cost-Effective Data Tiering for Uber’s Largest Storage System

Uber Engineering

Uber deploys a few storage technologies to store business data based on their application model. Problem.

Partitioned Hive Table Across Storage Systems Using Alluxio

DZone

However, Hive cannot access a single table directly using a single query with the data of this Hive table across different mediums of storage and different clusters. This becomes a need when the data volume grows too large to fit a single medium of storage or cluster, and also when the users need to take into account the following considerations: Storage cost, where some partitions are less important than others and can be stored on cheaper storage tiers.

Microsoft diskspd. Part 2, Accurately testing storage devices.

n0derunner

Often, when doing storage performance testing, we are interested in stressing the underlying storage devices, not the OS filesystem. Part 2, Accurately testing storage devices.

File systems unfit as distributed storage backends: lessons from ten years of Ceph evolution

The Morning Paper

File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution Aghayev et al., In this case, the assumption that a distributed storage backend should clearly be layered on top of a local file system. Breaking that assumption allowed Ceph to introduce a new storage backend called BlueStore with much better performance and predictability, and the ability to support the changing storage hardware landscape. Supporting new storage hardware.

Narrowing the gap between serverless and its state with storage functions

The Morning Paper

Narrowing the gap between serverless and its state with storage functions , Zhang et al., Shredder is " a low-latency multi-tenant cloud store that allows small units of computation to be performed directly within storage nodes. "

The AWS Storage Gateway - All Things Distributed

All Things Distributed

Expanding the Cloud - The AWS Storage Gateway. Today Amazon Web Services has launched the AWS Storage Gateway, making the power of secure and reliable cloud storage accessible from customersâ?? With the launch of the AWS Storage Gateway our customers can now integrate their on-premises IT environment with AWSâ??s s storage infrastructure. The AWS Storage Gateway is a service connecting an on-premises software appliance with cloud-based storage.

View from Nutanix storage during Postgres DB benchmark

n0derunner

The post View from Nutanix storage during Postgres DB benchmark appeared first on n0derunner. A quick look at how the workload is seen from the Nutanix CVM. In this example from prior post. The Linux VM running postgres has two virtual disks – one taking transaction log writes. The other is doing reads and writes from the main datafiles.

Advancing Application Performance with NVMe Storage, Part 1

DZone

With big data on the rise and data algorithms advancing, the ways in which technology has been applied to real-world challenges have grown more automated and autonomous. This has given rise to a completely new set of computing workloads for Machine Learning which drives Artificial Intelligence applications. AI/ML can be applied across a broad spectrum of applications and industries.

Intro to Redis Cluster Sharding – Advantages, Limitations, Deploying & Client Connections

High Scalability

Redis Cluster is the native sharding implementation available within Redis that allows you to automatically distribute your data across multiple nodes without having to rely on external tools and utilities. At ScaleGrid, we recently added support for Redis Clusters on our platform through our fully managed Redis hosting plans.

Expanding the Cloud - The AWS Storage Gateway

All Things Distributed

Today Amazon Web Services has launched the AWS Storage Gateway, making the power of secure and reliable cloud storage accessible from customers’ on-premises applications.

Configure failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups

SQL Shack

This article explores the configuration of Windows failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups.

Expanding the Cloud - Amazon S3 Reduced Redundancy Storage.

All Things Distributed

Expanding the Cloud - Amazon S3 Reduced Redundancy Storage. Today a new storage option for Amazon S3 has been launched: Amazon S3 Reduced Redundancy Storage (RRS). This new storage option enables customers to reduce their costs by storing non-critical, reproducible data at lower levels of redundancy. This has been an option that customers have been asking us about for some time so we are really pleased to be able to offer this alternative storage option now.

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective

The Netflix TechBlog

cloud-storage data data-infrastructure aws netflixBy Torio Risianto, Bhargavi Reddy, Tanvi Sahni, Andrew Park Continue reading on Netflix TechBlog ».

What is Greenplum Database? Intro to the Big Data Database

Scalegrid

High performance, query optimization, open source and polymorphic data storage are the major Greenplum advantages. Polymorphic Data Storage. Greenplum Database is a massively parallel processing (MPP) SQL database that is built and based on PostgreSQL.

Using JSONB in PostgreSQL: How to Effectively Store & Index JSON Data in PostgreSQL

Scalegrid

JSONB storage has some drawbacks vs. traditional columns: PostreSQL does not store column statistics for JSONB columns. JSONB storage results in a larger storage footprint. JSONB storage does not deduplicate the key names in the JSON.

2019 PostgreSQL Trends Report: Private vs. Public Cloud, Migrations, Database Combinations & Top Reasons Used

High Scalability

PostgreSQL is an open source object-relational database system that has soared in popularity over the past 30 years from its active, loyal, and growing community. For the 2nd year in a row, PostgreSQL has kept the title of #1 fastest growing database in the world according to the DBMS of the Year report by the experts at DB-Engines. So what makes PostgreSQL so special, and how is it being used today?

Persistent Storage for Amazon EC2

All Things Distributed

I would like to introduce to you the newest feature of Amazon EC2: Persistent local storage.

Expanding the Cloud ? Managing Cold Storage with Amazon Glacier

All Things Distributed

Managing Cold Storage with Amazon Glacier. With the introduction of Amazon Glacier , IT organizations now have a solution that removes the headaches of digital archiving and provides extremely low cost storage. Building and managing archive storage that needs to remain operational for decades if not centuries is a major challenge for most organizations. A Complete Storage Solution. storage that is directly accessible. All Things Distributed.

Back-to-Basics Weekend Reading - A Decomposition Storage Model

All Things Distributed

Not everybody agreed that the "N-ary Storage Model" (NSM) was the best approach for all workloads but it stayed dominant until hardware constraints, especially on caches, forced the community to revisit some of the alternatives. Combined with the rise of data warehouse workloads, where there is often significant redundancy in the values stored in columns, and database models based on column oriented storage took off. A Decomposition Storage Model , George P.

AWS Import/Export launches support for Legacy Storage Systems

All Things Distributed

Today Amazon Web Services takes another big step in making it easier to migrate legacy storage systems to the cloud through AWS Import/Export support for ingesting Punch Cards.

Netflix Drive

The Netflix TechBlog

Netflix Drive relies on a data store that will be the persistent storage layer for assets, and a metadata store which will provide a relevant mapping from the file system hierarchy to the data store entities. storage s3 studio infrastructure netflix

Media 164

Driving Storage Costs Down for AWS Customers - All Things.

All Things Distributed

Driving Storage Costs Down for AWS Customers. As we showed last week one of the services that is growing rapidly is the Amazon Simple Storage Service (S3). AWS today announced a substantial price drop per February 1, 2012 for Amazon S3 standard storage to help customers drive their storage cost down. Other storage tiers may see even greater cost savings. All Things Distributed. Werner Vogels weblog on building scalable and robust distributed systems.

MezzFS?—?Mounting object storage in Netflix’s media processing platform

The Netflix TechBlog

Mounting object storage in Netflix’s media processing platform By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team) MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. That file is stored in our object storage service, which splits and encrypts the file into separate chunks, storing the chunks in Amazon S3. Our object storage service splits objects into many parts and stores them in S3.

Media 179

Back-to-Basics Weekend Reading - RAID: High-Performance, Reliable Secondary Storage

All Things Distributed

RAID: High-Performance, Reliable Secondary Storage Peter Chen, Edward Lee, Garth Gibson, Randy Katz and David Patterson, ACM Computing Surveys, Vol 26, No. Disk arrays, which organize multiple, independent disks into a large, high-performance logical disk, were a natural solution to dealing with constraints on performance and reliability of single disk drives. The term "RAID" was invented by David Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987.

Azure Storage Persistence now faster in NServiceBus 6

Particular Software

If you're using Azure Storage Persistence and haven't upgraded to NServiceBus 6 yet, get ready for a tremendous performance boost for your application when you do especially if you make use of sagas. In the previous version of Azure Storage Persistence, looking up a saga by a correlation property was not as fast as looking it up by SagaId. Azure Table Storage, where saga data is stored, is limited to indexing on two columns: the Partition Key and the Row Key.

Follower Clusters – 3 Major Use Cases for Syncing SQL & NoSQL Deployments

Scalegrid

Note: ScaleGrid implements follower clusters using storage snapshots. And since the entire import is performed using storage snapshots, rather than a logical dump, the process is nearly instantaneous. Follower clusters are a ScaleGrid feature that allows you to keep two independent database systems (of the same type) in sync. Unlike cloning or replication, this allows you to maintain an active, point-in-time copy of your production data.

Making it Easier to Manage a Production PostgreSQL Database

Scalegrid

Note: The community has already started work on the zheap storage engine that overcomes this limitation. The past several years have seen increasing adoption for PostgreSQL. PostgreSQL is an amazing relational database. Feature-wise, it is up there with the best, if not the best. There are many things I love about it – PL/ PG SQL, smart defaults, replication (that actually works out of the box), and an active and vibrant open source community.

S3 - The Amazon Simple Storage Service

All Things Distributed

Go check out the Amazon Simple Storage Service (S3). This is Amazon.com’s Internet scale storage service available through a web service interface. Providing scalable, reliable, secure and fast storage is something that Amazon developers have already enjoyed for some time and now it is available to the developer community outside of Amazon S3 is another example of the Amazon Web Services mission: to expose all of the atomic-level pieces of the Amazon.com platform.

MySQL High Availability Framework Explained – Part III: Failover Scenarios

High Scalability

In this three-part blog series, we introduced a High Availability (HA) Framework for MySQL hosting in Part I, and discussed the details of MySQL semisynchronous replication in Part II. Now in Part III, we review how the framework handles some of the important MySQL failure scenarios and recovers to ensure high availability. MySQL Failover Scenarios. Scenario 1 – Master MySQL Goes Down. The Corosync and Pacemaker framework detects that the master MySQL is no longer available.

Identifying Optane drives in Linux

n0derunner

Storage Hardware and Devices nvme optaneThe easiest way to identify NVME drives backed by either NAND flash or Optane is to use $ lspci -v The output will look like this for NVME/NAND 00:0d.0

Uber Infrastructure in 2019: Improving Reliability, Driving Customer Satisfaction

Uber Engineering

Architecture General Engineering CPU Infrastructure Observability Productivity Reliability Search Infrastructure Storage Uber Eats Velocity