Proposal for a Realtime Carbon Footprint Standard

adrian cockcroft
7 min readApr 5, 2023

by Adrian Cockcroft

View through a door showing three small windows lined up
Three windows lined up to let the light in through a door in the Madrassa, Marrakech — Picture by Adrian

The existing approaches to carbon measurement provide delayed data for carbon accounting that is more akin to billing information in the way it is structured and queried. This proposal seeks to define a standard for real-time carbon and energy data as time-series data that would be accessed alongside and synchronized with the existing throughput, utilization and latency metrics that are provided for the components and applications in computing environments.

This would open up opportunities to leverage existing performance monitoring tools and dashboards, extend optimization tools like autoscalers and schedulers, and to build new kinds of carbon analysis and optimization tools and services.

The challenge is that accurate data isn’t available immediately, and cloud providers currently only provide monthly carbon data, with several months lag. This isn’t useful for workload optimization so proxy metrics like utilization and cost are being substituted. The open source Cloud Carbon Footprint Tool takes billing data (which is available monthly or hourly) as it’s input, along with estimated carbon intensity factors, and can produce workload carbon estimates, but not in real time.

Carbon measurements are by their nature imprecise and based on estimates and models that include variation. The usual technique for modeling fuzzy systems is Monte Carlo analysis, which computes with distributions rather than precise values. This standard proposes to represent an imprecise metric in a way that can be obtained from or input to a Monte Carlo model, by reporting three values: a most likely value, and a 95% confidence interval above and below it, effectively saying that there’s one chance in 20 that the value is outside the confidence interval, and the limits give separate guidance for how bad could it be, and how good could it be. In cases where a lot is known, and there isn’t much variation, the interval will be narrower than in cases where missing information is being estimated or power sources like wind and solar, that are highly variable in nature, are a significant part of the mix.

The quality of carbon intensity information improves over time, for energy consumption a predicted grid mix for the next 15 minutes becomes an actual grid mix for the past hour, and an audited grid mix in the energy providers billing records a month or so later. For market based scope 2 measurements the grid mix can change for up to a year as Renewable Energy Credits are traded. Cloud providers don’t disclose their own REC and Private Purchase Agreements, but these are taken into account in their monthly reports, with a few months delay. Scope 3 calculations are revised as more comprehensive supply chain data and better Life Cycle Analysis models are obtained. Given this background, for a given amount of energy reported at a point in time, the carbon footprint of that energy can be re-calculated as new information arrives, narrowing the confidence interval.

Energy data is also modeled and could be reported with confidence intervals, but it’s easier to measure locally in real-time, and less likely to be revised over time, so it seems plausible to report energy as a single fixed value.

Energy data can be obtained from power delivery systems in datacenters or from intelligent plugs or power strips. At the instance level, many CPU and system architectures have direct access to power consumption metrics (but this is often blocked for security reasons by cloud providers, as discussed below). In a time series reporting schema this should be measured as an energy difference for the time interval, which would be more accurate than sampling an instantaneous power level at a point in time.

Energy data at the entire machine or raw cloud instance level needs to be apportioned to virtual machines or cloud instances which run an operating system, and to pods, containers and processes running applications and background activities that consume CPU, memory and I/O resources. Additional attached devices such as backlit laptop or mobile device screens, fans and batteries should be taken into account. The power consumed by CPUs is dynamic, and increases as they work harder, and decreases if they need to recover from overheating. Clock rates also change, some ARM architectures include a mixture of high performance and low power cores in the same CPU, and Intel architecture CPUs have hyperthreading, so the performance of an execution thread will vary over time. I published a paper and presentation in 2006 that discusses the complexities of measuring CPU utilization. The apportionment model should be identified in the collected data and be a pluggable module. The model needs to apportion the energy in the current interval (say 100 Watts for one minute, 6000 joules) to the activities and memory usage of interest on the machine in that interval — idle time, background processing by operating system daemons, installed agents activity for security, performance and management (including itself if it’s running on the machine), and process, container or pod level activity that is running as application workloads. The information required to do a simple CPU utilization based allocation algorithm should already be available in the system monitoring instrumentation, but a dedicated agent would be more accurate by tracking process level activity and memory usage. I built a detailed process monitoring agent like this for Solaris about 25 years ago, and while it may sound complicated the overhead of this kind of analysis isn’t likely to be an issue. The Kepler project has developed what looks like a good approach to this, and I’ve started a discussion there to explore what would be needed https://github.com/sustainable-computing-io/kepler/discussions/600.

The Real Time Carbon Footprint calculation would update at the same interval rate as the resource utilization data, normally once per minute. The carbon intensity estimate would change less often, maybe every 15 minutes to an hour. The energy measurement data would ideally update every minute, but if it’s less often it would be averaged across the resource metric intervals. The output would be a time series in OpenTSDB format that consists of a timestamp, a value and a set of key/value pairs that categorize the data. This format could be coded as a Prometheus Exporter.

The above discussion focuses on Scope 2, energy consumption. However carbon emissions from fuel used by backup generators, Scope 1, also needs to be apportioned. It could be reported in real time, and during a power failure event that information could be used by a scheduler to send work to a different datacenter or zone, but it’s more likely to be provided as a historical average that changes every month. It’s also observed to be a small component of the carbon footprint, and is relatively easy to reduce to near zero by using zero carbon fuel along with larger battery backup capacity, so it’s likely not worth adding much complexity for. The realtime feed should contain the historical average, and this could be updated with more accurate data, along with other carbon updates, a month or so later by reprocessing the time series.

Security is always an issue when power is being measured. There are a number of key attacks and noisy neighbor measurements that are enabled by power signature analysis techniques. For this reason cloud vendors block access to power measurement interfaces by tenants applications. The underlying data is available to the cloud vendors themselves, so they would need to supply a one minute, per instance, energy use value as an AWS CloudWatch metric or the equivalent for other cloud providers. At the one minute granularity security concerns are minimized. In the absence of measured data, a model calibrated on datacenter systems could be used, taking into account CPU types, power usage efficiency (PUE), and over-all utilization levels. The energy model in use should be reported as a key/value in the data feed.

There is additional overhead for a cloud instance that needs to be accounted for beyond the CPU and memory footprint of some code. There’s a network traffic driven allocation of the energy use of the network switches that connect computers. There are also control plane overheads, shared storage volumes and supporting services to instrument. The cloud providers themselves need this data so they can measure, optimize and report the energy and carbon usage of their higher level services.

A sample OpenTSDB set of Datapoints for a single Kubernetes pod could look something like:

Metric: carbon.footprint.energy
Value: 123.4
Timestamp: Tue, 21 Mar 2023 17:47:26 -0700
Tags:
unit=joules
model=measured.v1 (or sampled, or estimated, or simulated)
pod=
node=
namespace=
instance=
project=
account=
region=us-east-1a
Metric: carbon.footprint.scope2
Value: 123.4
Timestamp: Tue, 21 Mar 2023 17:47:26 -0700
Tags:
unit=grams
model=local (or marketplace)
pod=
node=
namespace=
instance=
project=
account=
region=us-east-1a
Metric: carbon.footprint.scope2upper
Value: 234.5
Timestamp: Tue, 21 Mar 2023 17:47:26 -0700
Tags:
unit=grams
model=local (or marketplace)
pod=
node=
namespace=
instance=
project=
account=
region=us-east-1a
Metric: carbon.footprint.scope2lower
Value: 45.6
Timestamp: Tue, 21 Mar 2023 17:47:26 -0700
Tags:
unit=grams
model=local (or marketplace)
pod=
node=
namespace=
instance=
project=
account=
region=us-east-1a

I launched this proposal via a discussion at the Green Software Foundation and shared it as part of my QCon London talk on Cloud Provider Sustainability. I’m on vacation in Europe at the moment, but will be devoting more time to pushing this forward from mid-April onwards.

--

--

adrian cockcroft

Work: Technology strategy advisor, Partner at OrionX.net (ex Amazon Sustainability, AWS, Battery Ventures, Netflix, eBay, Sun Microsystems, CCL)