Introduction

Joule Profiler is a tool for measuring program metrics from various sources, such as RAPL, perf_event or NVML, with a focus on energy consumption. It supports profiling CPU, GPU, and other system metrics at multiple scopes. Its modular and extensible architecture, written in Rust, allows new sources to be added easily, while minimizing overhead to provide reliable measurements.

It can be used through the CLI or via the exposed library, which offers more flexible configuration and the ability to add user-defined sources. Some traits are exposed through the crate API, enabling users to implement custom metric sources easily.

The supported hardware and systems depend on the sources you choose. Refer to each source’s documentation for details.

Joule Profiler is heavily inspired by JouleIt¹ but provides enhanced features and is written in Rust for better performance, safety, portability, and extensibility.

Phases

The main feature that distinguishes Joule Profiler from other profilers such as Alumet² or Scaphandre³ is its phases. It enables energy profiling on different parts of a program, called phases, allowing to identify which sections of execution contribute most to energy consumption.

Phases are detected via tokens printed to standard output and matched with a configurable regular expression. This approach may introduce overhead and noise depending on the system’s I/O performance.

In the future, we may implement an alternative using inter-process communication with language-specific wrappers to reduce the overhead introduced.

Phases

Quick Start

Getting started with Joule Profiler is easy and straightforward.

First, install the latest version:

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/install.sh | bash

Basic Measurement

The simplest way to measure energy consumption is to run your program with the profiler:

joule-profiler profile -- ./my-program arg1 arg2

This will execute your program and display energy consumption metrics in the terminal once it completes:

╔════════════════════════════════════════════════╗
║  Command                                       ║
╚════════════════════════════════════════════════╝
  python3 main.py
 ────────────────────────────────────────────────
  Duration            :       3013 ms
  Exit code           :          0

╔════════════════════════════════════════════════╗
║  Phase: START -> END                           ║
╚════════════════════════════════════════════════╝
  Duration            :       3013 ms
  Start token         :      START
  End token           :        END

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    7299194 µJ
  PACKAGE-0           :   15104492 µJ
  PSYS                :   35101074 µJ
  UNCORE-0            :     355224 µJ

Phases

Put some prints separating your program’s parts:

...

print("__SETUP_PHASE__", flush=True)

setup()

print("__WORKLOAD_PHASE__", flush=True)

workload()

print("__CLEANUP_PHASE__", flush=True)

cleanup()

...

Don’t forget to flush the standard output after each print, see troubleshooting.

Now, lauch Joule Profiler:

joule-profiler profile -- ./my-program arg1 arg2

You will get something like:

__SETUP__
__WORKLOAD__
__CLEANUP__
╔════════════════════════════════════════════════╗
║  Command                                       ║
╚════════════════════════════════════════════════╝
  python3 main.py
 ────────────────────────────────────────────────
  Duration            :       3013 ms
  Exit code           :          0

╔════════════════════════════════════════════════╗
║  Phase: START -> __SETUP__                     ║
╚════════════════════════════════════════════════╝
  Duration            :         10 ms
  Start token         :      START
  End token           : __SETUP__ (line 0)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :     184997 µJ
  PACKAGE-0           :     226806 µJ
  PSYS                :     387084 µJ
  UNCORE-0            :       4150 µJ

╔════════════════════════════════════════════════╗
║  Phase: __SETUP__ -> __WORKLOAD__              ║
╚════════════════════════════════════════════════╝
  Duration            :       1000 ms
  Start token         : __SETUP__ (line 0)
  End token           : __WORKLOAD__ (line 1)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    1856750 µJ
  PACKAGE-0           :    3949645 µJ
  PSYS                :    8571228 µJ
  UNCORE-0            :      85571 µJ

╔════════════════════════════════════════════════╗
║  Phase: __WORKLOAD__ -> __CLEANUP__            ║
╚════════════════════════════════════════════════╝
  Duration            :       1000 ms
  Start token         : __WORKLOAD__ (line 1)
  End token           : __CLEANUP__ (line 2)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    1398742 µJ
  PACKAGE-0           :    3311218 µJ
  PSYS                :    7377380 µJ
  UNCORE-0            :      58105 µJ

╔════════════════════════════════════════════════╗
║  Phase: __CLEANUP__ -> END                     ║
╚════════════════════════════════════════════════╝
  Duration            :       1003 ms
  Start token         : __CLEANUP__ (line 2)
  End token           :        END

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    1468383 µJ
  PACKAGE-0           :    3395080 µJ
  PSYS                :    7314697 µJ
  UNCORE-0            :     137817 µJ

Custom Output Format

By default, the output is shown in the terminal with a human-readable format. For further analysis or integration with other tools, you can export results in different formats:

# Export results as JSON for programmatic analysis
joule-profiler --json profile -- ./my-program

# Export results as CSV for spreadsheet analysis
joule-profiler --csv profile -- ./my-program

An output file will be generated at the end of the profiling. These formats make it easy to process results with scripts or import them into visualization tools.

Additional Options

Specify an Output File

You can choose where to export your results:

# Save results to a file (format determined by extension or --json/--csv flags)
joule-profiler -o results.json --json profile -- ./my-program

This is particularly useful when running multiple benchmarks.

GPU Profiling

If your system has an Nvidia GPU and you want to measure GPU energy consumption alongside CPU:

# Include GPU metrics in the measurement
joule-profiler --gpu profile -- ./my-program

Performance Counters

Joule Profiler supports perf_event performance counters on Linux systems, you can activate this feature with the --perf flag.

joule-profiler --perf profile -- ./my-program

GPU profiling requires the NVML library (part of NVIDIA driver installation).

Choosing RAPL Backend

By default, Joule Profiler measures RAPL counters using either perf_event (default) or Powercap backend depending on your system.

You can explicitly choose which backend to use:

# Use Powercap backend (requires root)
sudo joule-profiler --rapl-backend=powercap profile -- ./my-program

# Use perf_event backend (lower overhead, may require kernel configuration)
joule-profiler --rapl-backend=perf profile -- ./my-program

The choice of backend can affect measurement granularity and permission requirements. See the RAPL documentation for details on each backend.

For more advanced usage including phase-based profiling, iterations, and custom metric sources, see the examples folder.

Quick Install

Install the latest version with a single command:

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/install.sh | bash

This script automatically detects your operating system and architecture, downloads the correct pre-built binary, and installs it to /usr/local/bin/.

Note

You may be prompted for your sudo password during installation to move the binary to the system path.

Custom Installation

For more control over the installation process, you can pass arguments to the installation script.

Install to a custom directory, useful for non-root users or specific environment configurations (e.g., ~/.local/bin):

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/install.sh | bash -s -- --dir ~/.local/bin

Install a specific version:

If you need to install a specific version depending on the features you need:

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/install.sh | bash -s -- --version v0.1.0

List available versions:

You can see the list of available versions with the installer:

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/install.sh | bash -s -- --list

Non-interactive Mode, for use in automated scripts or CI pipelines:

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/install.sh | bash -s -- --yes

Build from Source

You can build the profiler from the sources by cloning the repository. It can be useful if you want to access the latest features not yet released, or if you intend to customize the source code.

# Clone the repository
git clone https://github.com/jwoirhaye/joule-profiler.git
cd joule-profiler

# Build and install system-wide
cargo build --release
sudo cp target/release/joule-profiler /usr/local/bin/

# Verify installation
sudo joule-profiler --version

Tip

System-wide installation (/usr/local/bin/) is recommended as the tool requires sudo to access RAPL counters.

Uninstall

To remove Joule Profiler from your system, you can use the automated uninstaller or remove the binary manually.

Using the uninstaller:

curl -fsSL https://raw.githubusercontent.com/jwoirhaye/joule-profiler/main/uninstall.sh | bash

Manually: If you installed the binary to the default location (/usr/local/bin/), you can simply delete it:

sudo rm /usr/local/bin/joule-profiler

Measurements

Overview

Joule Profiler minimizes the measurements overhead by performances only the required operations during program execution. Heavy computations like string formatting or allocation, data aggregation or transformation, heavy memory allocations and syscalls are deferred after the end of the profiling. During the program’ execution, it only reads sources sensors and store raw values in memory

Measurement Process

Initialization

To ensure accurate measurements from the first instruction:

The program is spawned as a child process
Immediately paused using SIGSTOP before execution begins
The process ID (PID) is shared with all metric sources via atomic storage and operations
Sources attach their instrumentation (e.g., perf_event counters scoped to the PID)
The program is resumed with SIGCONT

This allows pid filtering for sources implementing it, while minimizing the introduced overhead by stopping the program during the initialization of sources requiring per-process measurement.

Measurements

Each metric source runs in its own asynchronous task. When a measurement is triggered, every source reads their counters in parallel and stores raw values separately without data transformation. This concurrent approach allows to make the measurement of all sources at the same time and ensures precise timing control.

Measurements occur at specific points, the first triggered measurement is the initial baseline before resuming the program, then measurements are made for each phase transitions detected via regex pattern in stdout. The Final measurement is made after the completion of the program.

Post-Measurements Processing

We perform expensive operations only after the program completion. We convert raw sources values into structured metric objects and aggregate them across phases. These calculation do not have any impact on the results because the measurements are already made and the profiled program has already exited.

Phases

Overview

Phase-based profiling is the key feature that distinguishes Joule Profiler from other energy profilers. It enables energy measurement of specific sections of your program, allowing you to identify which parts contribute most to energy consumption.

Why Use Phases?

Identify Energy Hotspots

Instead of measuring total program energy, phases let you pinpoint expensive sections:

print("__INIT__", flush=True)
loading()

print("__WORK__", flush=True)
process()

print("__CLEANUP__", flush=True)

Each phase appears separately in the results, showing its individual energy consumption.

Exclude Initialization

Interpreted languages (Python, JavaScript, Ruby) have significant startup overhead that dominates short programs:

# Python interpreter initialization can be 50-100ms
# Your actual work might be only 10ms
sudo joule-profiler profile -- python my_script.py
# Result: 90% of energy is interpreter startup, not what you want to measure

If you want to exclude this initialization phase, then using phases might be a good answer:

# my_script.py
import heavy_libraries
model = load()  # Initialization overhead

print("__START__", flush=True)  # Mark the beginning of actual work
process()    # This is what you want to measure
print("__END__", flush=True)

sudo joule-profiler profile --token-pattern "__START__|__END__" -- python my_script.py

Now you measure the __START__ to __END__ phase, excluding interpreter startup and library loading.

Implementation Notes

Phases are detected by monitoring the program’s stdout for tokens matching a regular expression. The profiler spawns the program and capture the standard output. Then, it scans each line and triggers measurements when tokens matching the configured regex pattern are detected. Finally, it associates metrics to each phase.

Future versions may support lower-overhead mechanisms like inter-process communication with language-specific wrappers to minimize the overhead introduced by I/O operations.

Overview

Sources Overview

Joule Profiler is designed to be simple, portable, and focused on monitoring the components responsible for the majority of energy consumption, such as the CPU, GPU, and SoC. While it does not aim to support every possible energy source on every device, the profiler can be extended by implementing custom sources for specific devices or components.

Supported Architectures

CPU: The only target is Intel x86 architecture at the moment.
GPU: Current focus is on Nvidia GPUs.
OS: Only Linux-based systems are officially supported at the moment.

Available Sources

Intel RAPL: Measures CPU and DRAM energy domains.
- Implemented using either perf_event or Powercap on Linux systems.
- For details, see RAPL.
Nvidia GPUs (NVML): Provides energy and performance metrics for Nvidia GPUs, see NVML.
perf_event: Measures various performance counters like hardware or software on Linux systems, see perf_event.

Extending Joule Profiler

Users can implement new metric sources allowing the monitoring of additional devices or components beyond the default set. For guidance, see Adding a New Source.

RAPL

Overview

RAPL (Running Average Power Limit) is an Intel processor feature that allows real-time energy consumption measurements of CPU and memory subsystem.

This technology has been available on Intel processors since Sandy Bridge generation.

RAPL provides energy measurements at different scales (e.g.,CPU core, integrated GPU, package), enabling you to measure energy consumption per component and understand more precisely how each part of the system contributes to the overall power usage.

Architecture

RAPL interface exposes multiple power domains that allow measuring energy consumption of different parts of the processor and memory subsystem. Domains metrics are accessible through model-specific registers (MSRs) on the host system, enabling user to monitor power usage in real time.

Domains

Domain	Description
Package/PKG	Entire CPU socket. This includes cores and uncore components.
Core/PP0	Represents the CPU cores only. Useful for profiling per-core energy consumption.
Uncore/PP1	Covers the energy consumption of the integrated GPU if available.
DRAM	Dynamic random access memory attached to the integrated memory controller if supported.
PSYS	Entire SoC energy consumption, unique (only one PSYS for the entire SoC at most, available since Skylake generation)

Architecture of RAPL¹:

RAPL architecture

Note

Some of the domains may not appear depending on the processor architecture.

The PSYS domain can report the same consumption as an external wattmeter ¹, representing the entire computer consumption. These results, obtained on a laptop, should be interpreted with caution and could not reflect the real world.

G. Raffin and D. Trystram, “Dissecting the Software-Based Measurement of CPU Energy Consumption: A Comparative Analysis,” in IEEE Transactions on Parallel and Distributed Systems, vol. 36, no. 1, pp. 96-107, Jan. 2025, doi: 10.1109/TPDS.2024.3492336. ↩ ↩2

Powercap

Overview

Powercap¹ is a Linux kernel framework that provides a generic and standardized interface for power capping and energy monitoring accross different hardware power domains.

Instead of accessing low-level hardware registers (e.g MSRs), Powercap safely exposes energy metrics via sysfs.

The use of powercap instead of MSRs may seem disadvantageous and cause more overhead while measuring energy consumption, but there is actually no or an insignificant overhead introduced by the use of the powercap framework. Moreover, the abstraction provided by powercap increases the maintainibility, and powercap smoothens the energy results to avoid noise.

Sysfs File Structure

The Powercap framework exports energy data through the /sys/devices/virtual/powercap or /sys/class/powercap directories. Each physical CPU socket or hardware component is represented as a control type (usually intel-rapl).

Within intel-rapl, the hierarchy is organized by zones and subzones, which correspond to the RAPL domains (Package, Core, DRAM, etc.).

The structure typically looks like this:

/sys/devices/virtual/powercap
└── intel-rapl/
    ├── intel-rapl:0/                # Package 0 (CPU Socket 0)
    │   ├── name                     # Content: "package-0"
    │   ├── energy_uj                # Cumulative energy in microjoules
    │   ├── max_energy_range_uj      # Overflow value for the counter
    │   ├── intel-rapl:0:0/          # Subzone: Core (PP0)
    │   │   ├── name                 # Content: "core"
    │   │   └── energy_uj
    │   ├── intel-rapl:0:1/          # Subzone: DRAM
    │   │   ├── name                 # Content: "dram"
    │   │   └── energy_uj
    │   └── intel-rapl:0:2/          # Subzone: Uncore (PP1)
    │       ├── name                 # Content: "uncore"
    │       └── energy_uj
    └── intel-rapl:1/                # Package 1 (CPU Socket 1, if multi-socket)

To retrieve the domains measure energy consumption, the following files are accessed:

name: The name of the corresponding domain (Package, Core, Uncore, DRAM or Psys).
energy_uj: This is the core metric. It provides the current energy consumption in microjoules (µj).
max_energy_range_uj: This file gives the maximum value before the counter wraps back to zero.

Overflow Handling

Because the RAPL energy counters are stored in hardware with four bytes of information, they will eventually reach their maximum value and wrap around (overflow) back to zero.

The max_energy_range_uj file to indicate this threshold. To ensure accurate measurements, especially for long-running benchmarks, the monitoring tool must implement a robust overflow detection and correction logic.

To handle these overflows, the measurement worker thread performs frequent polling of the energy_uj files. By sampling the counters at a rate significantly higher than the theoretical minimum time it takes for a counter to wrap around, we can safely detect an overflow and correct it. The polling rate must be higher than the minimal period of an overflow, otherwise, an overflow could not be always detected. In the future, we might implement an adaptive overflow period to minimize polling and reduce the overhead it introduces, even so it is negligible under hundreds of hertz.

The period of an overflow will depend mostly on the CPU wattage, but also on the domain, the DRAM package will not overflow often, while the psys or package domains might overflow in several minutes during high power consuming workload.

Theoretically, for a processor with a consumption of 200 W and a max_energy_range_uj of 262143328850 (common domain max_energy_range_uj), the package domain which represents the entire CPU consumption will overflow in around 22 minutes with a constant consumption.

Here is a graphic of the theoretical overflow time depending on the CPU wattage:

Power (W)	Overflow time (min,sec)
100	43,41
200	21,50
300	14,33
400	10,55
500	8,44
600	7,16
700	6,14
800	5,27
900	4,51
1000	4,22

Note

These overflow times are indicative only and serve as a reference scale. Actual values vary across hardware, kernel behavior, and workloads and must not be considered fixed thresholds.

Powercap documentation ↩

perf_event

Overview

For a detailed overview of what perf_event is and how it works, see perf_event, here we will discuss only about the measurements of RAPL domains counters.

To measure the energy consumption of the RAPL domains through perf_event, we’re using the perf_event_open_sys rust crate, which is a wrapper around the perf_event_open Linux system call, used to open perf event counters. The crate also provides bindings to associated perf_event I/O controllers to manage the opened counters.

Overflow Handling ?

Unlike powercap, perf_event handles the MSRs overflows and store counters on eight bytes. Moreover, the counters start at zero, which make them easy to compute and also completely prevent overflows, for example, on a CPU consuming at an average of 200 W, it would take 2924 years to overflow the package domain.

$P = 200 W = 200 J/s = 2.0 \times 1 0^{8} μ J/s$

$t_{overflow} = \frac{E _{max}}{P} = \frac{2 ^{64} μ J}{2.0 \times 1 0 ^{8} μ J/s} ≃ 9.2 \times 1 0^{10} s ≃ 2924 years$

Thus, using perf_event, we do not need to implement polling strategies like with powercap, it is impossible for a domain counter to overflow.

Limitations

Unlike most perf_event counters, per-process energy consumption cannot be inferred for RAPL domains. This is because RAPL is not designed to expose per-process energy measurements, and perf_event does not perform aggregation of RAPL counters to estimate per-process domain-level energy consumption.

RAPL counters exposed through perf do not support event grouping, which prevents managing all domains via a single ioctl and necessitates independent management of each domain.

Backend Comparison

Should you use Powercap of perf_event ?

Powercap is a framework for controlling and limiting power while perf_event is a tool for measuring performance counters.
Thus, their design differs from each other, perf_event uses kernel mechanisms to minimize user–kernel transitions during data collection, reducing overhead for frequent measurements, while Powercap relies on a sysfs interface, where each read or write triggers a kernel entry, making it more suitable for infrequent operations or control tasks.

While they both use the same underlying technology (e.g., MSRs for Intel RAPL), they operate at different abstraction layers. perf_event provides a measurement-oriented interface optimized for profiling, whereas Powercap provides a control-oriented interface suitable for setting power limits or enforcing budgets.

Scenario	Recommended Interface	Reason
High-frequency, fine-grained energy measurement	perf_event	Minimal overhead introduced and less transition from user to kernel space
Moderate to low-frequency	perf_event or Powercap	Syscall overhead is acceptable, perf_event requires more configuration (perf_event_paranoid), while powercap is easy to use

Summary: You should always prefer to use perf_event if it is configured on your system, but powercap is turnkey and easy to use.

Why not use MSRs ?

We can access MSRs through the filesystem at /dev/cpu/{core}/msr, therefore, in principle we could read RAPL counters directly from the registers to minimize overhead.

In practice, direct MSR access from userspace does not necessarily provide better performance¹ than the powercap interface, which is already optimized for safe and efficient energy accounting. Reading MSRs from userspace requires a system call for each access, and repeated reads across multiple domains increase overhead. As a result, user-space MSR reads are generally slower than accessing the same counters through kernel-level drivers such as powercap, which can read and process the registers efficiently without repeated user to kernel context switches. In addition, raw MSR reads can exhibit greater variability at short time scales due to the absence of kernel-managed aggregation and coordinated sampling, whereas powercap provides more consistent and reproducible energy measurements by performing aggregation entirely within the kernel.

Moreover, using MSRs directly requires explicit management of overflow, unit scaling, and platform-specific behavior, reduces portability, and can introduce safety and consistency issues that higher-level kernel interfaces such as Linux’s powercap framework handle automatically.

G. Raffin and D. Trystram, “Dissecting the Software-Based Measurement of CPU Energy Consumption: A Comparative Analysis,” in IEEE Transactions on Parallel and Distributed Systems, vol. 36, no. 1, pp. 96-107, Jan. 2025, doi: 10.1109/TPDS.2024.3492336. ↩

Measurement Accuracy

Joule Profiler relies on the MSR (Model-Specific Register) counters provided by the RAPL interface to deliver CPU energy metrics at different scopes.

Measurement Unit

RAPL energy counters report measurements in microjoules ( $μ J$ ). This high-precision unit ( $1 0^{- 6}$ joules) theoretically allows for extremely detailed energy accounting. However, it is important to note that while the data format is precise, the effective accuracy is bound by the hardware’s update frequency and the specific implementation of the voltage regulators on the motherboard.

Limitations

While RAPL is a powerful tool, it has inherent constraints that users should be aware of when interpreting results.

Lack of Per-Process Attribution

Although the RAPL interface provides multiple domains for fine-grained energy profiling, measurements are performed at the hardware level. Thus, it does not natively support per-process energy attribution. This makes it difficult to accurately assess the isolated energy consumption of a single process.

Hardware Variability

The availability of specific power domains is strictly hardware-dependent. Domains such as DRAM or PSYS (Platform System) might not be available depending on the specific CPU generation or platform configuration. Energy measurements can also be noisier on some systems, such as laptops, due to more aggressive power management and variable clock speeds.

Temporal Resolution

Very short or highly variable workloads may not be measured accurately. The hardware counters update at a fixed rate, so rapid changes in energy consumption can be missed between samples. For this reason, longer-running workloads generally produce more reliable results than microbenchmarks.

NVIDIA Management Library (NVML)

Overview

The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing NVIDIA GPUs. It is the underlying library used by the standard nvidia-smi¹ tool and provides direct access to the GPU driver.

Joule Profiler interfaces with NVML using the nvml-wrapper² Rust crate to retrieve reliable hardware energy counters.

Architecture

NVML is distributed as libnvidia-ml.so on Linux and communicates directly with the NVIDIA GPU driver to query device state.

Setup & Requirements

To use the NVML source in Joule Profiler, your system must meet specific hardware and software requirements.

Driver Requirements

Joule Profiler accesses NVML via the shared library libnvidia-ml.so (usually in /usr/lib/ or /usr/lib64/ on Linux) installed alongside the NVIDIA drivers. You do not need to install the CUDA toolkit manually, but the base GPU drivers are required.

Note

These libraries are installed automatically with standard NVIDIA Display Drivers.

Hardware Support

Energy consumption metrics are available on NVIDIA GPUs based on the Volta architecture (e.g., Tesla V100, Titan V) and newer. This includes all modern consumer architectures starting from Turing (RTX 20 series). Older architectures (Pascal, Maxwell, Kepler) may report other metrics but do not expose the energy counters required by this profiler.

Verification

Before running Joule Profiler, you can verify that your drivers are correctly installed and that your GPU supports management queries using the standard nvidia-smi tool.

Run the following command in your terminal:

nvidia-smi -q -d POWER

The output should look similar to the following:

==============NVSMI LOG==============

Timestamp                                 : Sat Feb  7 11:17:58 2026
Driver Version                            : 550.163.01
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    GPU Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : 31.64 W
        Current Power Limit               : 140.00 W
        ...

Warning

If your output does not show “Power Draw” or “Energy Consumption,” your hardware may not support the necessary features for this profiler.

Metrics

Joule Profiler utilizes NVML to retrieve energy consumption metrics, focusing on the power usage of the GPU components.

Collected Metrics

Metric	Unit	Description	Scope
Energy Consumption	Millijoules (mJ)	Total energy consumed by the GPU since the driver was loaded or the counter was reset.	Per GPU Device

While NVML is capable of reporting various other telemetry data (such as fan speeds, temperature, and clock rates), Joule Profiler focuses specifically on the energy counters to compute the power usage of your application phases.

Precision and Overflow

The energy counters provided by NVML are 64-bit unsigned integers representing energy in millijoules (mJ).

Overflow Analysis

This unit allows energy to be measured over extremely long periods without realistic concern for integer overflow. For example, considering a GPU consuming approximately 300 W continuously:

$P = 300 W = 300 J/s = 3.0 \times 1 0^{5} mJ/s$

$t_{overflow} = \frac{E _{max}}{P} = \frac{2 ^{64} mJ}{3.0 \times 1 0 ^{5} mJ/s} ≃ 6.1 \times 1 0^{13} s ≃ 1.9 \times 1 0^{6} years$

Consequently, overflow of the NVML energy counter is generally ignored in our design as the time required to trigger it exceeds any practical profiling duration. However, in the unlikely event that an overflow does occur, we utilize wrapping subtraction to ensure measurements remain consistent without panicking.

perf_event

Overview

perf_event is the Linux kernel’s performance monitoring API that provides access to hardware, software and other like PMU supported counters.

It allows fine-grained measurement of low-level events such as CPU cycles, instructions retired, cache misses, branch mispredictions, context switches, PMU supported counters and more. These events can be used to analyze performance bottlenecks, execution efficiency, and in combination with hardware counters it can be used to help estimating energy and power-related insights.

The perf_event subsystem has been part of the Linux kernel since version 2.6.31 and is supported on most modern architectures, including x86, ARM, and RISC-V.

Architecture

The perf_event interface exposes a unified abstraction over multiple types of performance events. Events are collected by the kernel and accessed through the perf_event_open system call, which enables user-space tools to configure, start, stop, and read counters.

Events can be attached to a specific context like process, CPU core or control group, but Joule Profiler only supports per-process counters at the moment.

Event Types

perf_event supports several categories of events:

Event Type	Description
Hardware	CPU-provided events such as cycles, instructions, cache references, and cache misses.
Software	Kernel-provided counters such as context switches, page faults, and CPU migrations.
Hardware Cache	Cache-specific events broken down by cache level, operation, and result.
Tracepoint	Kernel tracepoints for observing system-level behavior.
Raw	Architecture-specific events accessed via raw event codes.
PMU-specific	Events exposed by hardware-specific Performance Monitoring Unit (PMU) like RAPL counters

perf_event event types:

perf_event events

Note

The availability and accuracy of events depend on the underlying hardware PMU and kernel configuration.

Access to certain events may require elevated privileges or relaxed kernel settings, see perf_event_paranoid.

How It Works

Kernel Counter Management

The Linux kernel manages performance counters through the perf_event subsystem. When a perf_event counter is opened, the Linux kernel allocates a counter and periodically reads the hardware Performance Monitoring Unit (PMU) registers. It maintains counter values across context switches and timer interrupts.

Counter Multiplexing

Hardware PMUs have limited counters (typically 4-8 per CPU core). When more events are requested than available counters, the kernel multiplexes them. The kernel multiplexing organize counters into groups and rotates them around every millisecond, meaning that only one group of counters can run at a time. To solve this issue, perf_event keeps track of when the timer is enabled and when it’s actually running, allowing to scale them when they are multiplexed. However, counter scaling can introduce an error because it is an estimation of the global value out of different local ones, thus weakening the detection of small variabilities.

Memory-Mapped Access

The standard read() system call requires a context switch (100-300ns overhead). The kernel offers an mmap interface for lower overhead. It maps a kernel memory page into user space, thus allowing to read counter state directly from this share memory via CPU instructions (e.g., rdpmc) without any system call taking 10 to 50 nanoseconds.

This is useful for high-frequency monitoring where system call overhead would be significant.

For now, Joule Profiler is using the read syscall, but we may implement an mmap version to reduce the overhead of the syscalls.

References

perf_event_paranoid

Overview

perf_event_paranoid is a kernel tunable that controls access permissions to performance monitoring events. It determines what level of privilege is required to use various perf_event features.

This setting is crucial for security, as unrestricted access to performance counters can potentially leak sensitive information about system behavior and other processes.

Permission Levels

The /proc/sys/kernel/perf_event_paranoid file accepts integer values that define access restrictions:

Level	Description	Access restrictions
-1	No restrictions	All users can access all events, including kernel profiling and CPU-specific events. Not recommended.
0	Relaxed (default on some systems)	Unprivileged users can perform per-process profiling but cannot profile kernel space or other users’ processes.
1	Moderate	Unprivileged users can only access CPU events (cycles, instructions). No access to kernel profiling or tracepoints.
2	Restricted (common default)	Unprivileged users cannot use `perf_event_open` at all. Only `CAP_PERFMON` or `CAP_SYS_ADMIN` capabilities allowed.
3	Fully restricted	Denies all access to perf events, even for privileged processes (rarely used).
4	Maximum restriction	Complete lockdown of perf subsystem.

Note

The default value varies by distribution but is set to 2 by default on most distro.

Checking Current Setting

# View current paranoid level
cat /proc/sys/kernel/perf_event_paranoid

Configuring perf_event_paranoid

# Set to level 1 (moderate restrictions)
sudo sysctl kernel.perf_event_paranoid=1

# Or directly write to proc
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Security Implications

Performance counters can expose sensitive information about other processes. An attacker could do a side-channel attack and measure execution time to infer cryptographic keys or other secrets, he could also observe cache behavior to extract data.

Safe Practices

Use capabilities instead of paranoid level:

# Grant CAP_PERFMON to specific binaries
sudo setcap cap_perfmon=ep /path/to/joule-profiler

Limit access to specific users:

# Add user to perf_users group (if your distro supports it)
sudo usermod -aG perf_users $USER

Run with sudo when needed:
```
sudo perf stat -e cycles ./my_program
```

Troubleshooting

“Permission denied” errors

Error: perf_event_paranoid level is 1, try setting it to 0 or launch Joule Profiler with root rights

Solution: Either lower perf_event_paranoid level, grant Joule Profiler CAP_PERFMON capability or launch it with root privileges (sudo).

Note

To access RAPL counters using perf_event, you need to set perf_event_paranoid level to 0, or launch the profiler with root privileges.

References

Accuracy and Limitations

Measurement Scopes

Events can be measured at different execution scopes but Joule Profiler only support per-process counters.

Scope	Description
Per-thread	Counts events generated by a single thread.
Per-process	Aggregates events across all threads of a process.
Per-CPU	Counts events occurring on a specific CPU core.
System-wide	Aggregates events across all CPUs.

Important

Some events like RAPL power counters are always system-wide and cannot be scoped to individual processes or threads.

Multiplexing and Accuracy

Hardware has limited performance counters (typically 4-8 per CPU core). When measuring more events than available counters, the kernel multiplexes them by rotating access, which can introduce overhead.

Multiplexed counters produce scaled estimates: scaled_value = value * (time_enabled / time_running), introducing measurement error. The more the ratio is close to one, the more the counters are accurate.

Examples

This section provides practical examples to help you get started with Joule Profiler.

Examples in Documentation

Throughout this documentation, you’ll find examples showing common use cases and best practices.

Examples in Repository

The Joule Profiler repository contains complete working examples:

Example Programs (examples/programs/): Sample programs in various languages.
Example Outputs (examples/data/): Real profiling results in all supported formats (terminal, JSON, CSV).

Clone the repository and explore the examples/ directory to see real-world usage and expected output formats, else see the differents examples:

Minimal Example
Phases Detection
Listing Sensors
Outputs
Use the Library
Source Implementation

Minimal Example

This guide shows the minimal example to use Joule Profiler.

Minimal Measurement

Measure any program’s energy consumption:

# Measure a simple command
sudo joule-profiler profile -- sleep 1

# Measure a Python script
sudo joule-profiler profile -- python my_script.py

# Measure a compiled program with arguments
sudo joule-profiler profile -- ./my-program arg1 arg2

Basic Phase Detection

Add phase markers to your program:

Python example:

# my_script.py
import time

print("__START__", flush=True)
time.sleep(1)
print("__END__", flush=True)

The profiler will measure energy separately for the __START__ to __END__ phase.

Phase Detection

This guide shows how to add phase markers to your programs for energy profiling.

Basic Concept

Print tokens to stdout and detect different sections of your code. The profiler detects these tokens and measures each section separately.

Add Phase Markers

Here are several examples in different languages:

Python:

print("__INIT__", flush=True)
load()

print("__PROCESSING__", flush=True)
process()

printf("__INIT__\n");
fflush(stdout);
load();

printf("__PROCESSING__\n");
fflush(stdout);
process();

Rust:

#![allow(unused)]
fn main() {
println!("__INIT__");
std::io::stdout().flush().unwrap();
load();

println!("__PROCESSING__");
std::io::stdout().flush().unwrap();
process();
}

Important

Always flush stdout immediately after printing tokens, buffered output may not be detected in time.

Choose a Token Pattern

Create a regex pattern that matches your tokens:

# Simple pattern matching __WORD__
--token-pattern "__[A-Z_]+__"

# Custom pattern matching [WORD]
--token-pattern "\[A-Z_]+\]"

# Specific tokens only
--token-pattern "INIT|WORK|END"

Common patterns:

__[A-Z_]+__ - Matches __INIT__, __WORK__, __END__
<<<.*>>> - Matches <<<phase1>>>, <<<phase2>>>
\[PHASE-[0-9]+\] - Matches [PHASE-1], [PHASE-2]

Run the Profiler

sudo joule-profiler profile --token-pattern "__[A-Z_]+__" -- python my_script.py

Joule Profiler will profile the program by detecting tokens matching the configured pattern in the standard output. It will make a measurement between each token to report energy and various metrics across phases.

Complete Example

script.py:

import time

print("__LOAD__", flush=True)
data = [i for i in range(1000000)]
time.sleep(0.5)

print("__COMPUTE__", flush=True)
result = sum(data)
time.sleep(0.5)

print("__DONE__", flush=True)

Command:

sudo joule-profiler profile --token-pattern "__[A-Z_]+__" -- python script.py

Output:

Phase 0: __LOAD__ → __COMPUTE__
  Duration: 502 ms
  package-0: 1.2 J

Phase 1: __COMPUTE__ → __DONE__
  Duration: 501 ms
  package-0: 1.5 J

If you encounter some issues with phases, see troubleshooting.

Sensors Listing Example

This example shows how to list all available sensors using Joule Profiler.

Minimal Example

./target/debug/joule-profiler list-sensors

Note

You should be able to list sensors without root privileges.

And it shows:

╔════════════════════════════════════════════════╗
║  Available Sensors                             ║
╚════════════════════════════════════════════════╝

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  Name                 | Unit 
 ────────────────────────────────────────────────
  PACKAGE-0            | µJ
  CORE-0               | µJ
  UNCORE-0             | µJ
  PSYS                 | µJ

With GPU support

If you want to list also your GPU devices, use the --gpu CLI flag:

./target/debug/joule-profiler --gpu list-sensors

╔════════════════════════════════════════════════╗
║  Available Sensors                             ║
╚════════════════════════════════════════════════╝

┌────────────────────────────────────────────────┐
│ NVML                                           │
└────────────────────────────────────────────────┘
  Name                 | Unit 
 ────────────────────────────────────────────────
  GPU-0                | mJ

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  Name                 | Unit 
 ────────────────────────────────────────────────
  PACKAGE-0            | µJ
  CORE-0               | µJ
  UNCORE-0             | µJ
  PSYS                 | µJ

With perf_event support

If you want to list also your GPU devices, use the --perf CLI flag:

./target/debug/joule-profiler --perf list-sensors

╔════════════════════════════════════════════════╗
║  Available Sensors                             ║
╚════════════════════════════════════════════════╝

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  Name                 | Unit 
 ────────────────────────────────────────────────
  PACKAGE-0            | µJ
  CORE-0               | µJ
  UNCORE-0             | µJ
  PSYS                 | µJ

┌────────────────────────────────────────────────┐
│ perf_event                                     │
└────────────────────────────────────────────────┘
  Name                 | Unit 
 ────────────────────────────────────────────────
  CPU_CYCLES           | count
  INSTRUCTIONS         | count
  CACHE_MISSES         | count
  BRANCH_MISSES        | count

Output Formats

Joule Profiler supports several output formats to accommodate different usage scenarios and workflows. The terminal format is the default and is intended for quick inspection and human-readable feedback. It displays structured results directly in the console, including the command executed, phase durations, and per-source metrics. This format is ideal when you want to observe the results interactively or debug a profiling run. For details, see Terminal Output Format. For now, the output formats are only available using the CLI.

For automated processing, further analysis, or integration with pipelines, Joule Profiler can export results in JSON format using the --json CLI flag. JSON provides a structured representation of the profiling session, including command information, phase metadata, metric values, and iteration details when applicable. This format is well-suited for programmatic consumption, logging, or exporting to other tools. See JSON Output Format for more information.

When metrics need to be analyzed in spreadsheet software, notebooks or imported into other tabular tools, the profiler can generate a CSV output via the --csv flag. Each metric is represented as a row with associated phase, iteration, and source information. This flattened format facilitates aggregation, plotting, or other analyses where a row-per-metric structure is advantageous. More details can be found in CSV Output Format.

Overall, the choice of output format depends on your workflow: the terminal is optimized for immediate human inspection, JSON is designed for programmatic workflows, and CSV is suited for tabular analyses and external data processing.

In the future, the output formats might be moved from the CLI module to another module, enabling users to use them outside of the CLI when they’re using the library.

For detailed example outputs, check the different output formats examples:

Terminal examples
JSON examples
CSV examples

Terminal Output Format

When using Joule Profiler in the terminal (default output), the results are displayed in a human-readable, structured format.
This section explains the different sections and the information shown.

The header includes theses fields by default:

Phase name - Interval covered by the phase (can include custom tokens if the program outputs them).
Duration - Time spent in this phase.
Start / End token - Detected markers in the program output.

Phases mark sections of the program you want to measure. By default, the profiler includes a single phase: START -> END.

╔════════════════════════════════════════════════╗
║  Phase: START -> END                           ║
╚════════════════════════════════════════════════╝
  Duration            :       1878 ms
  Start token         :      START
  End token           :        END

After the phase information, metrics from each source are displayed:

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :   46068364 µJ
  DRAM-0              :    1287350 µJ
  PACKAGE-0           :   66901990 µJ

Metrics are grouped by metric source (e.g., powercap, nvidia-nvml, perf, etc.)

Note

All metrics are reported for each phase.

Profile Example

The command:

joule-profiler profile -- python3 nbody.py 500000

Shows:

╔════════════════════════════════════════════════╗
║  Command                                       ║
╚════════════════════════════════════════════════╝
  python3 nbody.py 500000

╔════════════════════════════════════════════════╗
║  Phase: START -> END                           ║
╚════════════════════════════════════════════════╝
  Duration            :       1878 ms
  Start token         :      START
  End token           :        END

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :   46068364 µJ
  DRAM-0              :    1287350 µJ
  PACKAGE-0           :   66901990 µJ

Listing Sensors

Here is an example of sensors listing in terminal format:

joule-profiler list-sensors

Outputs:

╔════════════════════════════════════════════════╗
║  Available Sensors                             ║
╚════════════════════════════════════════════════╝
  Name                 | Unit  | Source         
 ────────────────────────────────────────────────
  PSYS-1               | µJ    | RAPL (perf_event)       
  PACKAGE-0            | µJ    | RAPL (perf_event)
  CORE-0               | µJ    | RAPL (perf_event)
  UNCORE-0             | µJ    | RAPL (perf_event)

JSON Output Format

When using Joule Profiler with --json, results are exported as structured JSON.
This format is suitable for programmatic processing, analysis, or exporting to other tools.

Fields

A JSON output of Joule Profiler contains these fields:

Field	Description
`command`	Program and arguments executed
`token_pattern`	Regex used to detect phase tokens
`exit_code`	Return code of the program
`phases`	Array of measured phases
`index`	Phase index
`start_token` / `end_token`	Phase markers
`timestamp`	Start timestamp
`duration_ms`	Phase duration in milliseconds
`metrics`	List of metrics collected by sources
`metrics[].name`	Sensor name
`metrics[].value`	Measured value
`metrics[].unit`	Unit of measurement (e.g., µJ)
`metrics[].source`	Metric source name (e.g., `RAPL`, `NVML`)

Profile Example

The command:

joule-profiler --json profile -- python3 nbody.py 500000

Displays:

{
  "command": "python3 nbody.py 500000",
  "token_pattern": "__[A-Z0-9_]+__",
  "exit_code": 0,
  "phases": [
    {
      "index": 0,
      "start_token": "START",
      "end_token": "END",
      "timestamp": 1769987285805,
      "duration_ms": 1895,
      "metrics": [
        {
          "name": "CORE-0",
          "value": 45010261,
          "unit": "µJ",
          "source": "powercap"
        },
        {
          "name": "DRAM-0",
          "value": 1302425,
          "unit": "µJ",
          "source": "powercap"
        },
        {
          "name": "PACKAGE-0",
          "value": 51537160,
          "unit": "µJ",
          "source": "powercap"
        }
      ]
    }
  ]
}

Listing Sensors

Here is an example of sensors listing in json format:

joule-profiler --json list-sensors

[
  {
    "name": "PSYS-1",
    "unit": "µJ",
    "source": "powercap"
  },
  {
    "name": "PACKAGE-0",
    "unit": "µJ",
    "source": "powercap"
  },
  {
    "name": "CORE-0",
    "unit": "µJ",
    "source": "powercap"
  },
  {
    "name": "UNCORE-0",
    "unit": "µJ",
    "source": "powercap"
  }
]

CSV Output Format

Joule Profiler can also export results as CSV using the --csv CLI flag.
This format is suitable for spreadsheets, scripting, or import into analysis tools.

CSV Header

The CSV file contains the following columns:

Column	Description
`phase_id`	Phase index
`phase_name`	Phase name, e.g., `"START -> END"`
`phase_duration_ms`	Duration of the phase in milliseconds
`metric_name`	Name of the metric (sensor)
`metric_value`	Measured value
`metric_unit`	Unit of the measurement (e.g., µJ)
`metric_source`	Source of the metric (e.g., `powercap`)
`start_token`	Phase start token
`end_token`	Phase end token
`start_line`	Line number in program output where phase started (optional)
`end_line`	Line number in program output where phase ended (optional)
`timestamp`	Start timestamp of the phase (ms since epoch)
`command`	Command executed
`exit_code`	Program exit code
`token_pattern`	Regex used to detect phase tokens

Profile Example

phase_id;phase_name;phase_duration_ms;metric_name;metric_value;metric_unit;metric_source;start_token;end_token;start_line;end_line;timestamp;command;exit_code;token_pattern
0;"START -> END";1859;CORE-0;45935552;µJ;powercap;START;END;;;1769987854341;"python3 nbody.py 500000";0;"__[A-Z0-9_]+__"
0;"START -> END";1859;DRAM-0;1283811;µJ;powercap;START;END;;;1769987854341;"python3 nbody.py 500000";0;"__[A-Z0-9_]+__"
0;"START -> END";1859;PACKAGE-0;66560987;µJ;powercap;START;END;;;1769987854341;"python3 nbody.py 500000";0;"__[A-Z0-9_]+__"

Which gives:

phase_name	phase_duration_ms	metric_name	metric_value	metric_unit	metric_source	start_token	end_token	timestamp	command	token_pattern
START -> END	1859	CORE-0	45935552	µJ	powercap	START	END	1769987854341	python3 nbody.py 500000	[A-Z0-9]+_
START -> END	1859	DRAM-0	1283811	µJ	powercap	START	END	1769987854341	python3 nbody.py 500000	[A-Z0-9]+_
START -> END	1859	PACKAGE-0	66560987	µJ	powercap	START	END	1769987854341	python3 nbody.py 500000	[A-Z0-9]+_

Listing Sensors

Here is an example of sensors listing in CSV format:

sensor;unit;source
PSYS-1;µJ;powercap
PACKAGE-0;µJ;powercap
CORE-0;µJ;powercap
UNCORE-0;µJ;powercap

which gives the following table:

sensor	unit	source
PSYS-1	µJ	powercap
PACKAGE-0	µJ	powercap
CORE-0	µJ	powercap
UNCORE-0	µJ	powercap

Note

CSV is semicolon-separated for compatibility with most spreadsheet software.

Each row represents one metric for a single phase.

Multiple iterations are flattened in sequence, making it easy to filter by iteration_id.

Programmatic tools can aggregate phases, iterations, and sources easily using the CSV columns.

Source Implementation

Implementing a new metric source in Joule Profiler is straightforward. By implementing the MetricReader trait, you only need to define the core measurement logic (measure, retrieve, get_sensors, to_metrics, get_name) and optionally override init or join if your source requires setup or polling logic. This design makes it easy to add new sources without boilerplate.

use joule_profiler_core::{
    sensor::{Sensor, Sensors},
    source::{MetricReader, MetricSourceError},
    types::{Metric, Metrics},
    unit::{MetricUnit, Unit, UnitPrefix},
};

const MY_SOURCE_UNIT: MetricUnit = MetricUnit {
    prefix: UnitPrefix::None,
    unit: Unit::Count,
};

#[derive(Default)]
struct MySource {
    count: u64,
}

impl MySource {
    pub fn new() -> Self {
        Self::default()
    }
}

impl MetricReader for MySource {
    type Type = u64;

    type Error = MetricSourceError;

    async fn measure(&mut self) -> Result<(), Self::Error> {
        self.count += 1;
        Ok(())
    }

    async fn retrieve(&mut self) -> Result<Self::Type, Self::Error> {
        let count = self.count;
        self.count = 0;
        Ok(count)
    }

    fn get_sensors(&self) -> Result<Sensors, Self::Error> {
        let sensor = Sensor::new("value", MY_SOURCE_UNIT, "my_source")
        Ok(vec![sensor])
    }

    fn to_metrics(&self, count: u64) -> Result<Metrics, Self::Error> {
        let metric = Metric::new("value", count, MY_SOURCE_UNIT, "my_source");
        Ok(vec![metric])
    }

    fn get_name() -> &'static str {
        "my_source"
    }
}

Overview

Joule Profiler is designed to minimize measurement overhead while maintaining high performance, modularity, extensibility, and a strong separation of concerns.

This architecture enables an efficient asynchronous scheduling and minimizes the overhead introduced into the results. It makes easy the integration of new metric sources and allows user to define their own metric sources extensions without modifying the core.

flowchart LR
%%{init: {'flowchart': {'nodeSpacing': 20, 'rankSpacing': 30}}}%%

subgraph Sources
    RAPL
    NVML(Nvidia-Nvml)
    PERF(Perf events)
end

JouleProfiler((JouleProfiler))
Orchestrator

subgraph OUTPUTS[Output formats]
    Terminal
    JSON
    CSV
end

JouleProfiler e1@-->|Measure phases| Orchestrator
e1@{ animate: true }

JouleProfiler -->|Retrieve event| Orchestrator
Orchestrator -->|Join + retrieve results| JouleProfiler

Orchestrator e2@-->|Schedule| Sources
e2@{ animate: fast }

Orchestrator -->|Join tasks| Sources
Orchestrator e3@-->|Measure events| Sources
e3@{ animate: true }

JouleProfiler --> Terminal
JouleProfiler --> JSON
JouleProfiler --> CSV

linkStyle 0 stroke:#e67e22, stroke-width:2px
linkStyle 1 stroke:#e74c3c, stroke-width:2px
linkStyle 2 stroke:#e74c3c, stroke-width:2px
linkStyle 3 stroke:#9b59b6, stroke-width:2px
linkStyle 4 stroke:#e74c3c, stroke-width:2px
linkStyle 5 stroke:#e67e22, stroke-width:2px
linkStyle 6 stroke:#27ae60, stroke-width:2px
linkStyle 7 stroke:#27ae60, stroke-width:2px
linkStyle 8 stroke:#27ae60, stroke-width:2px

At a high level, Joule Profiler is composed of three main layers:

Core Module - Contains all domain logic: orchestration, aggregation, and result modeling.
CLI Module - Responsible for user input, command line arguments parsing, and startup wiring.
Sources Workspace - Implementations of the different metric sources using the API traits.

Core Module

The Core Module is the engine of Joule Profiler, it defines the profiling model, coordinates measurements, and produces structured results.

The Core Module is responsible for the management and coordination of metric sources. It collects and aggregates results and exposes an easy to use API. It contains no user interface logic and no hardware-specific code.

The core coordinates all metric sources through a central orchestrator. It sends events to manage the metric sources and make the measurements. At the end of the profiling, it collects all the data from sources and produces a unified result set. Metric sources never interact with each other directly. All coordination flows through the core and is hidden from the sources.

This design ensure that the implementation of new metric sources, output formats, or CLI features should not require changes to the core domain logic. This ensures that existing workflows remain stable while the ecosystem grows.

During a profiling session, the core controls when measurements are made. Measurements are associated with phase boundaries and accumulated across iterations. All raw data is collected first, then processed once the measurements complete. This separation ensures consistent results and reduce the measurements overhead.

CLI Module

The CLI Module is the straightforward entry point for users.

It parses command-line input and configure the profiler to be usable quickly and easily while providing several configuration options. The CLI acts as an adapter between the user, the core domain, and the sources. Because of this separation, the CLI can evolve independently from the core logic and sources.

Here’s a summary of all the CLI arguments:

Argument	Short	Long	Default	Description
`verbose`	`-v`	`--verbose`	`0`	Verbosity level. Stack for more detail: `-v`, `-vv`, `-vvv`.
`rapl_path`		`--rapl-path`	Auto-detected	Override the base path for Intel RAPL counters. Falls back to `$JOULE_PROFILER_RAPL_PATH`, then `/sys/devices/virtual/powercap/intel-rapl`.
`sockets`	`-s`	`--sockets`	All	CPU sockets to measure (e.g., `0` or `0,1`).
`json`		`--json`	`false`	Export results as JSON. Conflicts with `--csv`.
`csv`		`--csv`	`false`	Export results as semicolon-separated CSV. Conflicts with `--json`.
`output_file`	`-o`	`--output-file`	`data<TIMESTAMP>.csv/json`	Custom output file path for CSV/JSON export.
`gpu`		`--gpu`	`false`	Enable GPU measurement support.
`perf`		`--perf`	`false`	Enable `perf_event` hardware counters.
`rapl_backend`		`--rapl-backend`	`Perf`	Choose RAPL backend: `powercap` or `perf`.
`command`		(subcommand)	(required)	The subcommand/program to execute and profile.

The subcommand profile has also some arguments:

Argument	Short	Long	Default	Description
`token_pattern`		`--token-pattern`	`__[A-Z0-9_]+__`	Regex to detect phase tokens in the program’s stdout. If the pattern contains a capture group, the captured text becomes the token name. Phases are computed between consecutive tokens (and from START/END).
`stdout_file`	`-o`	`--stdout-file`	None	Redirect the profiled program’s stdout to a file.
`rapl_polling`		`--rapl-polling`	None	RAPL counter polling frequency, in seconds.

The profiled command is provided after -- at the end of the arguments.

Sources Workspace

The Sources Workspace contains all metric source implementations (e.g., RAPL, perf_event, NVML, etc.). Each source is responsible for collecting a specific type of measurement data.

Metric sources are responsible for collecting raw measurement data and maintaining any internal state required for measurement. Sources never perform aggregation logic but provide functions to transform raw data to aggregated metrics at the end of the profiling.

Each source is isolated and independent from other sources and can be configured independently. This allows users to enable only the sources they need and makes it easy to add new ones.

The workspace is designed to support various sources, including built-in sources maintained by the project or user-defined sources. New sources can be added without modifying the core or the CLI, as long as they follow the expected interface.

Development Setup

This section explains how to set up a local development environment for Joule Profiler, build the project, run tests, and contribute effectively.

1. Prerequisites

Before starting, ensure you have the following installed:

System Requirements

Linux OS (Ubuntu, Debian, Fedora, Arch, etc.)
Intel CPU with RAPL support (most CPUs since Sandy Bridge, 2011)
Root access (required to read RAPL counters with Powercap backend)

Software Dependencies

rustc 1.70+ and cargo (official Rust installer recommended)

2. Build the Project

Build in release mode:

cargo build --release

This will produce the binary at:

target/release/joule-profiler

You can run it directly using:

./target/release/joule-profiler --version

For debug mode (faster building during development):

cargo build
./target/debug/joule-profiler --version

3. Run Tests

Run the full test suite with:

cargo test

For faster iteration during development:

cargo test --lib        # only library tests
cargo test --bins       # only binary/tests for CLI
cargo test --doc       # only doctests
cargo test -- --nocapture  # show test output

4. Formatting & Linting

Ensure code follows project style using rustfmt and clippy:

cargo fmt       # format code
cargo clippy    # lint code for warnings and suggestions

Note

Recommended to run these before committing changes and required for submitting pull requests.

5. Development Tips

Use the examples/ folder to experiment with different scripts and phase token patterns.
Use logging flags (-v, -vv, -vvv) for debugging purposes.

6. Notes

Note

Energy measurements require Intel CPUs and Linux with the powercap framework or perf_event kernel module.

Running programs under a virtual machine may give inaccurate readings due to limited access to RAPL counters.

Root access is required for RAPL counters, but you can test some library functionality without root.

By following these steps, you’ll have a fully functional development environment for Joule Profiler and be ready to contribute or test new features.

Adding a Custom Source

Joule Profiler is built on a modular architecture designed for extensibility. This allows users to integrate new metric sources such as custom hardware sensors, software counters, or external APIs without modifying the core profiling engine.

Trait Implementation

To create a valid source for Joule Profiler, your struct must implement the MetricReader trait.
Some provided methods are optional.

See the source implementation example to understand with a minimal example how to implement a source.

pub trait MetricReader: Send + 'static {
    /// Type of metrics returned by the reader.
    type Type: MetricReaderTypeBound;

    /// Error type produced by the reader.
    type Error: MetricReaderErrorBound;

    //
    // Optional methods
    //

    /// Initialize the source before measurements.
    fn init(&mut self) -> impl Future<Output = Result<(), Self::Error>> + Send;

    /// Cleanup or join logic after measurements.
    fn join(&mut self) -> impl Future<Output = Result<(), Self::Error>> + Send;

    //
    // Mandatory methods
    //

    /// Perform a measurement and update internal state.
    fn measure(&mut self) -> impl Future<Output = Result<(), Self::Error>> + Send;

    /// Retrieve the current metrics snapshot.
    fn retrieve(&mut self) -> impl Future<Output = Result<Self::Type, Self::Error>> + Send;

    /// Return all sensors provided by this source.
    fn get_sensors(&self) -> Result<Sensors, Self::Error>;

    /// Convert a snapshot into Joule Profiler metrics.
    fn to_metrics(&self, result: Self::Type) -> Result<Metrics, Self::Error>;

    /// Return the static name of the metric source.
    fn get_name() -> &'static str;
}

Best Practices

When implementing a metric source, keep the measure method lightweight and fast; any heavy computation or data processing should be done in to_metrics to ensure measurements do not introduce overhead or slow down profiling.

How to Register a Source

Once a source is implemented, adding it to the profiler is straightforward. You simply instantiate your source and register it with the JouleProfiler instance before starting to measure.

use joule_profiler::JouleProfiler;
use my_custom_source::MySource;

#[tokio::main]
async fn main() {
    // 1. Create the profiler
    let mut profiler = JouleProfiler::new();

    // 2. Instantiate your custom source
    let my_source = MySource::new();

    // 3. Register the source
    profiler.add_source(my_source);

    // 4. Start profiling
    let results = profiler.profile().await.unwrap();

    // 5. Use the results as you need
}

Contributing to Joule Profiler

First off, thank you for considering contributing to Joule Profiler ! We welcome contributions of all kinds: bug reports, documentation improvements, examples, or new features.

By contributing, you agree that your contributions will be licensed under the MIT License.

How to Contribute

1. Reporting Issues

If you find a bug, unexpected behavior, or have a feature request:

Check the GitHub repositoy issues to see if it has already been reported.
Open a new issue with a clear description of:
- Steps to reproduce (if applicable)
- Expected behavior
- Observed behavior
- Environment (OS, CPU, GPU, Joule Profiler version, etc.)
- Provide examples

2. Suggesting Improvements

We love feedback on improving the project. You can suggest improvements via:

Opening a GitHub issue labeled as enhancement.
Sharing examples, documentation updates, or performance suggestions.

3. Code Contributions (Pull Requests)

If you want to contribute code:

Make your changes in a branch in your fork.
Ensure existing tests pass and add new tests if needed.
Update documentation if your changes affect usage or examples.
Open a pull request to the main repository.
Reference or label related issues clearly in your PR.

Troubleshooting

Overview

Energy profiling operates at a low level, directly interfacing with hardware counters and kernel subsystems. This introduces several challenges:

Hardware Dependencies: Energy counters (RAPL, perf_event) require specific CPU support and may be disabled in BIOS
Privilege Requirements: Reading hardware counters typically requires elevated permissions or specific kernel configurations
Measurement Sensitivity: Energy readings are affected by system state, background processes, and thermal conditions
Platform Variations: Different CPUs, kernel versions, and system configurations expose metrics differently

Most issues stem from these fundamental constraints rather than bugs in the profiler itself.

See [common-issues](common issues.md) if you encounter an issue with the profiler and do not find it in the common issues, then please submit it on the GitHub repository and we will work on a solution.

Common Issues

Permission Denied

Problem: Cannot read RAPL counters or perf event.

Solutions: Run with root privileges when using powercap backend or configure perf_event_paranoid configuration for perf_event backend.

No RAPL Domains Found

Problem: No RAPL domains found.

This issue happens with powercap backend.

Check available sensors:

sudo joule-profiler list-sensors

Solutions:

Update Linux kernel to 3.13+ (required for RAPL support)
Check BIOS settings (some systems disable RAPL in firmware)
Verify CPU supports RAPL (most Intel CPUs since Sandy Bridge/2011, AMD since Zen)

Try specifying custom RAPL path:

sudo joule-profiler --rapl-path /sys/class/powercap/intel-rapl phases -- ./my-program

High Variance in Measurements

Problem: Energy measurements vary significantly between iterations.

Why This Happens

Short Program Duration: For programs running under 100ms, measurement overhead and system noise become proportionally significant:

Profiler initialization (spawning, pausing, counter attachment)
OS scheduling decisions and cache state variations
Hardware counter granularity relative to execution time
It also depends on the hardware energy management, on laptops the variance is significantly higher than on desktop computers, due to the energy saving policies.

System Activity: Background processes, thermal throttling, and frequency scaling cause variations.

Solutions

# 1. Use multiple iterations to reduce variance
sudo joule-profiler sphasese -n 20 -- ./my-program

# 2. Enable logging to see warnings and diagnostics
sudo joule-profiler -vv phases -- ./my-program

# 3. Minimize background processes
# Close browsers, IDEs, file syncing, etc.

# 4. Disable CPU frequency scaling (optional, for more stable results)
sudo cpupower frequency-set --governor performance

# 5. Profile longer-running programs
# Extend your workload or loop it internally:
./my-program --repeat 1000

For Short Programs: If your program must run quickly, increase internal iteration count rather than relying on profiler iterations.

Tokens Not Detected (Phases Mode)

Problem: No phases computed, warning “No tokens matching pattern”.

Check with logging:

sudo joule-profiler -v phases -- ./my-program

Solutions:

Verify tokens are printed to stdout (not stderr):

# Correct (stdout)
print("__INIT__")

# Wrong (stderr)
import sys
print("__INIT__", file=sys.stderr)

Incorrect phases measures

Solution: Flush the stdout on each print, otherwise the prints will be buffered and the phases will be merged.

Example output without stdout flushing:

╔════════════════════════════════════════════════╗
║  Phase: START -> __WORK_START__                ║
╚════════════════════════════════════════════════╝
  Duration            :        109 ms
  Start token         :      START
  End token           : __WORK_START__ (line 1)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    9317199 µJ
  DRAM-0              :     126281 µJ
  PACKAGE-0           :   10607910 µJ

╔════════════════════════════════════════════════╗
║  Phase: __WORK_START__ -> __WORK_END__         ║
╚════════════════════════════════════════════════╝
  Duration            :          0 ms
  Start token         : __WORK_START__ (line 1)
  End token           : __WORK_END__ (line 2)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :       5187 µJ
  DRAM-0              :          0 µJ
  PACKAGE-0           :          0 µJ

We can see that the WORK_START -> WORK_END phase is during 0ms, which is the longest running phase. To fix this issue, we need to flush the stdout on each print to avoid buffering:

# Python
print("__WORK_START__", flush=True)
# ... work ...
print("__WORK_END__", flush=True)

#![allow(unused)]
fn main() {
// Rust
println!("__WORK_START__");
use std::io::{self, Write};
io::stdout().flush().unwrap();
}

// C
printf("__WORK_START__\n");
fflush(stdout);

Which gives the following results:

════════════════════════════════════════════════╗
║  Phase: START -> __WORK_START__                ║
╚════════════════════════════════════════════════╝
  Duration            :         25 ms
  Start token         :      START
  End token           : __WORK_START__ (line 1)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    2244689 µJ
  DRAM-0              :      43273 µJ
  PACKAGE-0           :    2518737 µJ

╔════════════════════════════════════════════════╗
║  Phase: __WORK_START__ -> __WORK_END__         ║
╚════════════════════════════════════════════════╝
  Duration            :         77 ms
  Start token         : __WORK_START__ (line 1)
  End token           : __WORK_END__ (line 2)

┌────────────────────────────────────────────────┐
│ RAPL (perf_event)                              │
└────────────────────────────────────────────────┘
  CORE-0              :    6755859 µJ
  DRAM-0              :      80505 µJ
  PACKAGE-0           :    7568481 µJ

Here, we can see that the WORK_START -> WORK_END has more accurate results due to stdout flushing.

Invalid Regex Pattern

Problem: Error “Invalid regex pattern”

Common mistakes:

# Wrong: unescaped special characters
--token-pattern "[INIT]"

# Correct: escape brackets
--token-pattern "\[INIT\]"

# Wrong: unclosed group
--token-pattern "(INIT"

# Correct: balanced parentheses
--token-pattern "(INIT)"

Solution: Test your regex pattern before profiling:

Use online tools: https://regex101.com/
Test with grep: echo "your output" | grep -E "your-pattern"

Counter Overflow Warning

Problem: Warning about energy counter at 90%+ of max range on RAPL powercap backend.

Explanation: This is informational, not an error. Hardware energy counters have a maximum value before they wrap around. The profiler detects when counters approach this limit and handles overflow automatically.

Solution: This is normal for long-running systems or high power consumption. The measurement remains accurate - the warning is just informing you that internal counter wrapping was handled.

Missing Metrics

Problem: Expected metrics don’t appear in results.

Solutions:

Verify your hardware supports the requested counters (e.g., RAPL, perf_event support)
Check that metric sources are properly initialized (use -vv for detailed logging)
Ensure you have necessary permissions for the metric source
List available sensors: sudo joule-profiler list-sensors

High Overhead on Very Frequent Measurements

Problem: Profiler seems to slow down the program significantly

Cause: Too many phase transitions (measuring every few milliseconds)

Solution: Reduce measurement frequency:

Use phases only for major sections (not every small operation)
Batch work between phase tokens
Aim for phases lasting at least 100ms each

Keyboard shortcuts

Joule Profiler