Relation between CPU Instructions and Power Consumption #1284

emoek · 2024-03-04T17:07:11Z

emoek
Mar 4, 2024

Looking at the documentation of Kepler, I understand that it builds upon the premise that resource utilization and power consumption are linear proportional. With this premise, per default using the instructions as splitting parameter, Kepler is attributing the power to the containers. However, looking at the data in Grafana as exemplarily visible in the attached figure, I can not recognize any relation between the two metrics. While the CPU instructions seem to follow a specific pattern, the Power Consumption seems to be kind of random.
Now, I am asking myself if I have understand Kepler wrong or if I am missing something. Would appreciate if someone could provide any insights. I have tried it both for disruptive scenarios (as visible in image) and also with more controlled load scenarios.

For the image:

Yellow: Power consumption of one pod/container with the "irate" and "sum" function applied
Green: CPU Instructions of one pod/container with the "irate" and "sum" function applied and scaled down for comparability

rootfs · 2024-03-04T20:51:08Z

rootfs
Mar 4, 2024
Maintainer

@emoek my eye sight reading is poor :D

Let's run the program to see if the correlation is strong

import prometheus_client
import requests
import numpy as np
import datetime

# Define the Prometheus instance URL and the gauges to retrieve
PROMETHEUS_URL = 'http://localhost:9090'
GAUGES = ['kepler_container_package_joules_total', 'kepler_container_cpu_instructions_total']

# Define the container name for filtering the gauges
CONTAINER_NAME = 'kepler-exporter'

# Define the Prometheus query to use for retrieving the gauges
QUERY = 'sum(rate({gauge}{{container_name=~"{container_name}"}}[1m]))'

# set end of query to current time, and start of query to 1 hour ago using the following format
# start=2015-07-01T20:10:30.781Z&end=2015-07-01T20:11:00.781Z&step=15s
end_time = datetime.datetime.now().isoformat() + 'Z'
start_time = (datetime.datetime.now() - datetime.timedelta(hours=1)).isoformat() + 'Z'

# Scrape the Prometheus instance and retrieve the gauges
gauges = {}
for gauge in GAUGES:
    query = QUERY.format(gauge=gauge, container_name=CONTAINER_NAME)
    # create the query string
    query = f'{PROMETHEUS_URL}/api/v1/query_range?query={query}&start={start_time}&end={end_time}&step=15s'
    response = requests.get(query)
    values = [float(sample[1]) for sample in response.json()['data']['result'][0]['values']]
    # print(f'Gauge {gauge}: {values}')
    gauges[gauge] = values

# Calculate the correlation between the two gauges
joules = gauges['kepler_container_package_joules_total']
instructions = gauges['kepler_container_cpu_instructions_total']
correlation = np.corrcoef(joules, instructions)[0, 1]

print(f'Correlation between kepler_container_package_joules_total and kepler_container_cpu_instructions_total: {correlation}')

My output is

Correlation between kepler_container_package_joules_total and kepler_container_cpu_instructions_total: 0.8485091316695434

1 reply

marceloamaral Mar 5, 2024
Maintainer

@rootfs can you also do the experiment with kepler_container_cpu_cycles_total?

ArneTR · 2024-03-05T06:39:06Z

ArneTR
Mar 5, 2024

What an interesting discussion here! Thanks @emoek for bringing it up.

@rootfs Is Kepler really making the assumption that CPU Instructions and Energy are necessarily linearly and also necessarily proportional connected?

given the following example:

One part of the code executes 100 instructions and takes 100 Joules. All of the instructions are SIMD
Another part of the code executes 100 instructions and takes 50 Joules. All of the instructions are simple integer additions.

Would it not to be expected that there is no proportionality or correlation between these two? Since some instructions are just more costly than others?

To my understanding the value of instructions is primarily chosen because it is the causal factor of activity that allows for process splitting but not because it is a proxy for energy.

The correlation to energy can only be derived if we have the type of instruction (load, mul, add, SIMD etc.).

So even making a statistical reasoning for it is no proof one way or the other if the base assumption is that there is another currently hidden variable that joins the two variables.

So in summary: If Instructions is a good or bad value cannot be reasoned about by just comparing Instructions to energy. And it is also not what is the design intention. Is that understanding correct?

2 replies

rootfs Mar 5, 2024
Maintainer

@ArneTR Kepler uses GHG protocol [1] to attribute both dynamic and idle power. The CPU instruction is used for dynamic power attribution.

https://middleware-conf.github.io/2021/pdf/middleware21keynotes-final3.pdf

rootfs Mar 5, 2024
Maintainer

If Instructions is a good or bad value cannot be reasoned about by just comparing Instructions to energy. And it is also not what is the design intention.

There is evidence of CPU instruction can estimate power well [1,2,3]. Since we are not able (nor it is efficient) to break down CPU instructions in more detail, we just use CPU instruction perf counter in Kepler. CPU time has issues since hyperthreading or similar resource partition mechanism in a way that CPU time cannot reliably tell how many logic gates are actually active.

marceloamaral · 2024-03-05T08:37:39Z

marceloamaral
Mar 5, 2024
Maintainer

@emoek, thank you for the discussion.

The dynamic power consumption is directly related to resource utilization. However, determining CPU utilization can vary based on the type of instructions and cache operations. In practice, CPU cycles typically exhibit a better correlation than CPU instructions because it also includes cache operations. Nonetheless, we expect a high correlation between the CPU package power and the container instructions or cycles as we saw in our experiments and many other research papers.

That is, both instructions or cycles should show a good correlation with the CPU power consumption.
Let's try to investigate the problem....

Regarding your experiment:

Just to confirm, are you doing the experiment on a bare-metal node? Can you share the Kepler logs in https://pastebin.com?
You mentioned that you are using the kepler_container_joules_total metric, which includes power consumption data for CPU, DRAM, GPU, and other components. To assess the correlation between CPU cycles and/or instructions, we should see the kepler_container_package_joules_total metric, as @rootfs did in his experiment. Also, let's see other metrics to double check if there is something wrong with the kepler_container_joules_total metric.
Can you also verify the source label of the *_joules_total* metrics?
Can you also verify the dynamic and idle power using the mode label in the *_joules_total* metrics?
Just to confirm, you are using the kepler_container_cpu_instructions_total, right?
Additionally, could you also verify the node metrics using kepler_node_package_joules_total? This will provide further insights into the power consumption
Lastly, as pointed out by @ArneTR, power consumption is not linear and can be influenced by factors such as CPU frequency, machine temperature, and the type of instructions being executed. So it is expected some variations

3 replies

emoek Mar 6, 2024
Author

Now I see my mistake, thanks!

I did not expect the container power consumption to be influenced that much by the "other" metric. In the papers I looked into in this context it always stated that the CPU is the main contributor of power consumption which is why I totally underestimated the impact by "kepler_container_other_joules_total" which seems to be the most dominant source. If I understand that metric correctly it takes system power meters such as ACPI into account which incorporates power consumption of the whole system. This would also fit my below comparison of the metrics showing that the sum of the containers' "other_joules" almost reaches the total node platform power consumption, and sometimes even does reach it.

I am doing Efficiency Benchmarking in Kubernetes in my masters thesis and therefore I am looking for representative workload for the calculation of the Energy Efficiency. Looking at the correlations below I can totally do that when focusing on the CPU and DRAM (I am not using GPU), but do you know about a utilization parameter that may be used as representative workload when using the total power consumption, including other components? Even the aggregation of all CPU instructions is not even an approximation of the total power consumption by the containers.

In my benchmarks on my testing microservice app I reach now the following correlations with three repetitions. One thing to note is that this is during the generation of load:
CPU instructions + Package Consumption: 0.978 / 0.976 / 0.975
CPU Cycles + Package Consumption: 0.997 / 0.989 / 0.992

This surprises me tbh as I expected the power attribution to be biased as it uses CPU instructions per default for the splitting.
I will do further experiments looking for changes in correlation when also using DRAM power consumption additionally etc.

emoek Mar 6, 2024
Author

The results above are when the application is deployed without resource constraints. When setting resource constraints (requests=limits) it results in:
CPU instructions + Package Consumption: 0.714 / 0.663 / 0.752
CPU Cycles + Package Consumption: 0.660 / 0.486 / 0.676

emoek Mar 6, 2024
Author

But that is if I am only using the instructions/cycles and package consumption of one container. I now tested it with all services thus the aggregation of the containers and then the correlation reaches better values again with 0.85 up to 0.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relation between CPU Instructions and Power Consumption #1284

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Relation between CPU Instructions and Power Consumption #1284

emoek Mar 4, 2024

Replies: 3 comments · 6 replies

rootfs Mar 4, 2024 Maintainer

marceloamaral Mar 5, 2024 Maintainer

ArneTR Mar 5, 2024

rootfs Mar 5, 2024 Maintainer

rootfs Mar 5, 2024 Maintainer

marceloamaral Mar 5, 2024 Maintainer

emoek Mar 6, 2024 Author

emoek Mar 6, 2024 Author

emoek Mar 6, 2024 Author

emoek
Mar 4, 2024

Replies: 3 comments 6 replies

rootfs
Mar 4, 2024
Maintainer

marceloamaral Mar 5, 2024
Maintainer

ArneTR
Mar 5, 2024

rootfs Mar 5, 2024
Maintainer

rootfs Mar 5, 2024
Maintainer

marceloamaral
Mar 5, 2024
Maintainer

emoek Mar 6, 2024
Author

emoek Mar 6, 2024
Author

emoek Mar 6, 2024
Author