It’s time to upgrade the PTC System Monitor (PSM)!

Dynatrace

As a PSM system administrator, you’ve relied on AppMon as a preconfigured APM tool for detecting, diagnosing, and repairing problems that impact the operational health of your Windchill application suite. The post It’s time to upgrade the PTC System Monitor (PSM)!

Handling Failure in Long-Running Processes

DZone

In the previous posts in this series, we've seen some examples of long-running processes , how to model them, and where to store the state. But building distributed systems is hard. And if we are aware of the fallacies of distributed systems , then we know that things fail all the time. So how can we ensure that our long-running process doesn't get into an inconsistent state if something fails along the way?

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

In-Stream Big Data Processing

Highly Scalable

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream processing is the immediate need in many practical applications. This system has been designed to supplement and succeed the existing Hadoop-based system that had too high latency of data processing and too high maintenance costs.

How TripleLift Built an Adtech Data Pipeline Processing Billions of Events Per Day

High Scalability

What is the name of your system and where can we find out more about it? The system is the data pipeline at TripleLift. Over the last 3 years, the TripleLift data pipeline scaled from processing millions of events per day to processing billions. This processing can be summed up as the continuous aggregation and delivery of reporting data to users in a cost efficient manner. Why did you decide to build this system?

Understanding, detecting and localizing partial failures in large system software

The Morning Paper

Understanding, detecting and localizing partial failures in large system software , Lou et al., Partial failures ( gray failures ) occur when some but not all of the functionalities of a system are broken. In contrast, a process suffering a total failure can be quickly identified, restarted, or repaired by existing mechanisms, thus limiting the failure impact. 71% of failures are triggered by a specific environment condition, input, or faults in other processes.

Extending relational query processing with ML inference

The Morning Paper

Extending relational query processing with ML inference , Karanasos, CIDR’10. Raven is the system that Microsoft built to explore this question, and answer it with a resounding yes. The vision is that data scientists use their favourite ML framework to construct a model, which together with any data pre-processing steps and library dependencies forms a model pipeline.

Deployment challenges with large enterprise systems

Dynatrace

For small deployments, it isn’t a problem however when scaling up to hundred or even thousands of systems things can become complicated. Even when all the systems are mapped correctly by Dynatrace, identifying these systems is a real challenge. Why Process groups are not perfect. Dynatrace detects automatically identical processes and puts them under the same process group. Dynatrace will automatically group both systems. S _ for the system.

Making Windows Slower Part 2: Process Creation

Randon ASCII

Windows has long had a reputation for slow file operations and slow process creation. This weeks’ blog post covers a technique you can use to make process creation on Windows grow slower over time (with no limit), in a way that will be untraceable for most users! Unplanned serialization that lead to full-system UI hangs: 24-core CPU and I can’t move my mouse. Process handle leak in one of Microsoft’s add-ons to Windows: Zombie Processes are Eating your Memory.

Orbital edge computing: nano satellite constellations as a new class of computer system

The Morning Paper

Orbital edge computing: nanosatellite constellations as a new class of computer system , Denby & Lucia, ASPLOS’20. Only space system architects don’t call it request-response, they call it a ‘ bent-pipe architecture.’. Nanosatellite systems have a GSD of around 3.0m/px.

Wireless attacks on aircraft instrument landing systems

The Morning Paper

Wireless attacks on aircraft instrument landing systems Sathaye et al., Today’s paper is a good reminder of just how important it is becoming to consider cyber threat models in what are primary physical systems, especially if you happen to be flying on an aeroplane – which I am right now as I write this! The first fully operational Instrument Landing System (ILS) for planes was deployed in 1932. USENIX Security Symposium 2019.

Zombie Processes are Eating your Memory

Randon ASCII

Zombies probably won’t consume 32 GB of your memory like they did to me, but zombie processes do exist, and I can help you find them and make sure that developers fix them. GB of page/non-paged pool, few processes running, and no process using anywhere near enough to explain where the memory had gone: My machine has 96 GB of RAM – lucky me – and when I don’t have any programs running I think it’s reasonable to hope that I’d have at least half of it available.

Monitoring Processes with Percona Monitoring and Management

Percona

A few months ago I wrote a blog post on How to Capture Per Process Metrics in PMM. Since that time, Nick Cabatoff has made a lot of improvements to Process Exporter and I’ve improved the Grafana Dashboard to match. However, if you have a substantial part of process swapped out because of memory pressure, you would not see it. Especially for processes written in Go, the difference can be extreme. Processes by Disk IO.

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Scalegrid

In that environment, the first PostgreSQL developers decided forking a process for each connection to the database is the safest choice. It is difficult to fault their argument – as it’s absolutely true that: Each client having its own process prevents a poorly behaving client from crashing the entire database. On modern Linux systems, the difference in overhead between forking a process and creating a thread is much lesser than it used to be.

Unlocking Enterprise systems using voice

All Things Distributed

The challenge was that technology wasn't capable of processing real human conversation. The interfaces to our digital system have been dictated by the capabilities of our computer systems—keyboards, mice, graphical interfaces, remotes, and touch screens. As a result, they fail to deliver a truly seamless and customer-centric experience that integrates our digital systems into our analog lives. People interact with many different applications and systems at work.

Build automated self-healing systems with xMatters and Dynatrace (Part 1 of 3)

Dynatrace

In this three-part blog series, we’ll share the following three common problem scenarios that you can easily solve by building an automated self-healing system with Dynatrace and xMatters Flow Designer: Process crash. As a first use case, let’s explore how your DevOps teams can prevent a process crash from taking down services across an organization—in five easy steps. Use case #1: Prevent a process crash from taking down services across an organization.

Towards federated learning at scale: system design

The Morning Paper

Towards federated learning at scale: system design Bonawitz et al., This is a high level paper describing Google’s production system for federated learning. The FL system contains a number of privacy-enhancing building blocks, but the privacy guarantees of any end-to-end system will always depend on how they are used. At the core of the system is a federated learning approach called Federated Averaging , with an optional extension for Secure Aggregation.

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems

The Morning Paper

An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems Gan et al., Systems built with lots of microservices have different operational characteristics to those built from a small number of monoliths, we’d like to study and better understand those differences. In this paper we explore the implications microservices have across the cloud system stack. Operating system and network implications.

The Digital Twin: A Foundational Concept for Stateful Stream Processing

ScaleOut Software

Traditional stream-processing and complex event processing systems, such as Apache Storm and Software AG’s Apama , have focused on extracting interesting patterns from incoming data with stateless applications. This model offers key insights into how state data can be organized within stream-processing applications for maximum effectiveness. The post The Digital Twin: A Foundational Concept for Stateful Stream Processing appeared first on ScaleOut Software.

The Digital Twin: A Foundational Concept for Stateful Stream Processing

ScaleOut Software

Traditional stream-processing and complex event processing systems, such as Apache Storm and Software AG’s Apama , have focused on extracting interesting patterns from incoming data with stateless applications. This model offers key insights into how state data can be organized within stream-processing applications for maximum effectiveness. Architecture Featured Performance Programming Techniques digital twin Flink ScaleOut StreamServer Spark Storm stream processin

Teaching rigorous distributed systems with efficient model checking

The Morning Paper

Teaching rigorous distributed systems with efficient model checking Michael et al., It describes the labs environment, DSLabs , developed at the University of Washington to accompany a course in distributed systems. During the ten week course, students implement four different assignments: an exactly-once RPC protocol; a primary-backup system; Paxos; and a scalable, transactional key-value storage system. A visual debugger/system explorer.

Build automated self-healing systems with xMatters and Dynatrace (Part 2 of 3)

Dynatrace

In Part 1 we explored how DevOps teams can prevent a process crash from taking down services across an organization in five easy steps. In this alert, xMatters includes all the important incident information from Dynatrace, so there’s no need for you to visit additional system dashboards. Based on this contextual data, resources are prompted with their pre-configured response options, each of which kicks off a workflow across systems (based on the severity of the issue).

Delta: A Data Synchronization and Enrichment Platform

The Netflix TechBlog

Another thread or process is constantly polling events from the log table and writes them to one or multiple datastores, optionally removing events from the log table after acknowledged by all datastores. Another issue exists for the capture of schema changes, where some systems, like MySQL, don’t support transactional schema changes [1][2]. By their nature, they can only rely on the lowest common denominator of participating systems. Online Event Processing.

MezzFS?—?Mounting object storage in Netflix’s media processing platform

The Netflix TechBlog

Mounting object storage in Netflix’s media processing platform By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team) MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. It’s used extensively in our media processing platform, which includes services like Archer and runs features like video encoding and title image generation on tens of thousands of Amazon EC2 instances. MezzFS?—?Mounting

Media 183

Migrating a privacy-safe information extraction system to a Software 2.0 design

The Morning Paper

Migrating a privacy-safe information extraction system to a software 2.0 This is a comparatively short (7 pages) but very interesting paper detailing the migration of a software system to a ‘Software 2.0’ Replacing these rules with a machine learned component dramatically simplified the code base (45 Kloc deleted) and set the system back onto a growth and improvement trajectory. system. design , Sheng, CIDR’20.

Three Other Models of Computer System Performance: Part 1

ACM Sigarch

Computer systems, from the Internet-of-Things devices to datacenters, are complex and optimizing them can enhance capability and save money. Existing systems can be studied with measurement, while prospective systems are most often studied by extrapolating from measurements of prior systems or via simulation software that mimics target system function and provides performance metrics. It provides a quick calculation for what a system may or can’t do.

File systems unfit as distributed storage backends: lessons from ten years of Ceph evolution

The Morning Paper

File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution Aghayev et al., In this case, the assumption that a distributed storage backend should clearly be layered on top of a local file system. Ceph is a widely-used, open-source distributed file system that followed this convention [of building on top of a local file system] for a decade. ” Ten years of building on local file systems.

Integrate Atlassian Jira and Micro Focus ALM: Defect reporting and resolution is a business process

Tasktop

By integrating Atlassian Jira and Micro Focus ALM, you can automate the flow of defects between the two tools to eradicate manual overhead and accelerate the speed and accuracy of the defect reporting and resolution process. With software touching nearly every facet of the omnichannel customer experience, the process is considered a critical business issue. How the long is the process taking?

Who monitors the monitoring systems?

Adrian Cockcroft

In reality, in any non-trivial installation, there are multiple tools collecting, storing and displaying overlapping sets of metrics from many types of systems and different levels of abstraction. This is true for CPU, network, memory and storage, and also for bare metal, virtual machines, containers, processes, functions and threads. What if your monitoring systems fail? How do you even know when a monitoring system has failed? “Quis custodiet ipsos custodes?”?—?Juvenal

Three Other Models of Computer System Performance: Part 2

ACM Sigarch

The M/M/1 queue also assumes that the arrival rate is not affected by the unbounded number of tasks in the queue (called an “open system”). How long can the computer take to process (service) each packet (not counting waiting), denoted S? With two blog posts, we argue for more use of simple models beyond Amdahl’s Law. Part 1 previously discussed Bottleneck Analysis and Little’s Law, while this post (Part 2) presents the M/M/1 Queue.

Follower Clusters – 3 Major Use Cases for Syncing SQL & NoSQL Deployments

Scalegrid

Follower clusters are a ScaleGrid feature that allows you to keep two independent database systems (of the same type) in sync. Here are a few critical ways in which it differs from replication: You can control how frequently the destination system syncs from source – once a week, once a day, or even less frequently. This helps reduce the load on the source system. Since they are two independent systems, you have much more flexibility over the data that is synced.

EuroBSDcon: System Performance Analysis Methodologies

Brendan Gregg

In the past I've shared similar methodologies applied to other operating systems, and finished porting them to BSD for this talk. For my first trip to Paris I gave the closing keynote at [EuroBSDcon 2017] on performance methodologies, using FreeBSD 11.1 as an analysis target. It was a few days of work, which is really not bad.

What is Greenplum Database? Intro to the Big Data Database

Scalegrid

Greenplum Database is a massively parallel processing (MPP) SQL database that is built and based on PostgreSQL. When handling large amounts of complex data, or big data, chances are that your main machine might start getting crushed by all of the data it has to process in order to produce your analytics results. To fill this need for faster processing and enable quicker results, many organizations consider adopting an MPP database.

SQL Server on Linux: Why Do I Have Two SQL Server Processes

SQL Server According to Bob

When starting SQL Server on Linux why are there two (2) sqlservr processes? The simple answer is the first entry (85829) is not what you are used to on a Windows system as sqlservr.exe and does not listen for TDS traffic or open database files. The parent process handles basic configuration activities and then forks the child process. The parent process (WATCHDOG) becomes a lightweight monitor and the child process runs the sqlservr.exe process.

The challenges of monitoring a distributed system

Particular Software

I remember the first time I deployed a system into production. Once the system was deployed, I wanted to see if everything was working properly, so I ran through a simple checklist: Is my database up? Yes/No) If the answers to these questions were all yes, then the system was working correctly. If the answer to any of those questions was no, then the system wasn't working correctly and I needed to take action to correct it. How much CPU is this process consuming?

Scaling symbolic evaluation for automated verification of systems code with Serval

The Morning Paper

Scaling symbolic evaluation for automated verification of systems code with Serval Nelson et al., Serval is a framework for developing automated verifiers of systems software. This is the goal of push-button verification : you still have to produce a rigorous specification of how the system is intended to behave, but instead of then laboriously proving that it does so, you just ‘push a button’ and let the machine spit out a proof (or contradiction) for you.

Code 53

Migrating to Daimler/Mercedes-Benz AG’s New E/E System

Tasktop

By October 2020, Daimler/Mercedes-Benz AG will replace its supplier defect and test management tool for its internal E/E system. If you are a supplier working with the automotive giant and connected to its existing defect management and ALM systems, you will need to migrate or change projects to the new system as soon as possible. Three Significant Changes Between the Two Systems. Backend system is completely different. Processes for ticket exchange.

Third-order effects and software systems

Particular Software

At the height of the Cold War, the United States passed the Federal Aid Highway Act of 1956, giving birth to the Interstate Highway System. They can be observed in our software systems as well. Maybe we add some kind of software library to our system, with the expectation of being able to build features faster or have cleaner code. We might be pleasantly surprised when it also makes the system easier to monitor in production.

SLOG: serializable, low-latency, geo-replicated transactions

The Morning Paper

SLOG is another research system motivated by the needs of the application developer (aka, user!). Building correct applications is much easier when the system provides strict serializability guarantees. Strict serializability reduces application code complexity and bugs, since it behaves like a system that is running on a single machine processing transactions sequentially. Key to good intra-region performance is the deterministic nature of the processing.

Back-to-Basics Weekend Reading - Join Processing in Relational.

All Things Distributed

Werner Vogels weblog on building scalable and robust distributed systems. Back-to-Basics Weekend Reading - Join Processing in Relational Databases. In 1992 Priti Mishra and Margaret Eich conducted a survey on what was achieved until then in Join Processing and described in details the algorithms, the implementation complexity and the performance. Join Processing in Relational Databases , Priti Mishra and Margaret H. All Things Distributed.

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Netflix TechBlog

Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto , to process this data and periodically compute key information for a member or a video.

Best Practice for Creating Indexes on your MySQL Tables

Scalegrid

During this time, you are also likely to experience a degraded performance of queries as your system resources are busy in index-creation work as well. In this blog post, we discuss an approach to optimize the MySQL index creation process in such a way that your regular workload is not impacted. By having appropriate indexes on your MySQL tables, you can greatly enhance the performance of SELECT queries.

Exploring MySQL Binlog Server – Ripple

Scalegrid

This way, the load on the master is tremendously reduced, and at the same time, the binlog server serves the binlogs more efficiently to slaves since it does not have to do any other database server processing. How to optimize the MySQL index creation process in such a way that your regular workload is not impacted. The availability of a system is the percentage of time its services are up during a period of time.

Virtual consensus in Delos

The Morning Paper

While ultimately this new system should be able to take advantage of the latest advances in consensus for improved performance, that’s not realistic given a 6-9 month in-production target. That’s not just theoretical, Facebook actually did this in production while Delos was processing over 1.8