Heap vs zero-heap: per-RTOS comparison
For each RTOS, how do per-call hot-path latencies differ between
heap mode (_create() / _destroy() API; kernel objects come from
the kernel's allocator) and zero-heap mode
(CONFIG_OVE_ZERO_HEAP=y; _init() / _deinit() API; caller-supplied
static storage; the heap is locked at ove_run() and any post-boot
allocation aborts the build or panics)?
What "heap-vs-zh" can and can't measure
The C-binding hot-path call site is the same FFI symbol in both
modes — ove_mutex_lock, ove_sem_take, … resolve to identical
code regardless of how the kernel object behind the handle was
allocated. So any C-column delta between heap and zh is not a
binding-side cost; it's the kernel itself behaving differently
because the underlying object lives at a different address with
different surrounding state.
Binding-specific deltas (CPP/Rust/Zig vs C) come from a different mechanism (wrapper code path) and are documented in the per-binding analysis — they are not repeated on this page, which compares the C column only.
Configuration
All numbers below are taken with
CONFIG_OVE_BENCHMARK_WORST_CASE_TIMING=y (Cortex-M7 I-cache, D-cache,
branch predictor, ART accelerator, and flash prefetch all disabled —
see benchmarks overview). The per-call iteration budget
was settled by a one-shot calibration pass (see "Iteration-count
calibration" on the overview page) and stays at 1 000 iterations on
the typical latency case.
Bucketing rule for verdicts: ±5% relative is treated as noise (scheduler jitter on a flash-fetch-bound execution path; absolute delta is typically <500 ns on the µs-range ops below). 5–10% is "marginal — worth a glance but not a structural signal". >10% is an outlier worth understanding.
FreeRTOS — C binding
| Case | Heap | Zero-heap | Δ% | Verdict |
|---|---|---|---|---|
time/time_get_us_overhead |
883 ns | 1.1 µs | +24.6% | sub-µs — see notes |
time/delay_1ms |
984.4 µs | 984.0 µs | 0% | RTOS tick |
thread/yield |
4.5 µs | 4.4 µs | −2.2% | noise |
thread/get_self |
2.6 µs | 2.5 µs | −3.8% | noise |
thread/sleep_1ms |
984.4 µs | 984.1 µs | 0% | RTOS tick |
thread/context_switch |
51.9 µs | 51.1 µs | −1.5% | noise |
sync/mutex_lock_unlock |
8.1 µs | 7.7 µs | −4.9% | noise |
sync/mutex_contention_2t |
8.2 µs | 8.8 µs | +7.3% | marginal |
sync/sem_take_give |
6.5 µs | 6.1 µs | −6.2% | marginal |
sync/event_signal_wait |
50.2 µs | 49.8 µs | −0.8% | noise |
sync/condvar_signal_wait |
34.2 µs | 33.6 µs | −1.8% | noise |
sync/recursive_mutex_lock_unlock |
10.2 µs | 9.4 µs | −7.8% | marginal |
queue/send_receive |
8.2 µs | 8.6 µs | +4.9% | noise |
queue/throughput_2t |
4.0 µs | 4.3 µs | +7.5% | marginal |
timer/start_stop |
67.7 µs | 71.2 µs | +5.2% | marginal |
eventgroup/set_get_bits |
7.3 µs | 6.8 µs | −6.8% | marginal |
workqueue/submit_execute |
57.7 µs | 56.7 µs | −1.7% | noise |
stream/send_recv_64B |
20.6 µs | 32.9 µs | +59.7% | real — see notes |
stream/throughput |
31.2 µs | 41.9 µs | +34.3% | real — see notes |
Honest read on FreeRTOS.
Per-call sync (mutex, sem, condvar, event, queue, eventgroup) is within ±10% across modes, with most rows under ±5%. The few marginal deltas in either direction are static-vs-pool address differences showing up on a flash-fetch-bound path — neither systematically favours one mode. The two non-noise outliers are real:
stream/send_recv_64B+60% (20.6 → 32.9 µs) andstream/throughput+34% (31.2 → 41.9 µs). This is a FreeRTOS-intrinsic regression, not a wrapper cost. The per-RTOS reports' wrapper-vs-native section shows the rawxStreamBuffer*baseline at 19.4 µs (heap) → 31.4 µs (zh) — a ~12 µs gap in the native FreeRTOS API itself, not in the oveRTOS wrapper. oveRTOS adds ~1 µs of wrapper overhead in both modes. IfxStreamBuffer*is on a hot path, FreeRTOS heap mode is the better choice.time_get_us_overhead+25% (883 → 1 100 ns). A ~200 ns regression on a sub-µs op — bench-harness loop overhead dominates, and the static-storage layout under zh shifts which flash bytes the loop reads first. No workload times anything at this granularity; flagged for completeness rather than concern.
NuttX — C binding
| Case | Heap | Zero-heap | Δ% | Verdict |
|---|---|---|---|---|
time/time_get_us_overhead |
3.0 µs | 3.1 µs | +3.3% | noise |
time/delay_1ms |
1.98 ms | 1.99 ms | 0% | RTOS tick |
thread/yield |
3.6 µs | 3.7 µs | +2.8% | noise |
thread/get_self |
2.6 µs | 2.6 µs | 0% | noise |
thread/sleep_1ms |
1.98 ms | 1.99 ms | 0% | RTOS tick |
thread/context_switch |
45.8 µs | 43.7 µs | −4.6% | noise |
sync/mutex_lock_unlock |
5.4 µs | 5.0 µs | −7.4% | marginal |
sync/mutex_contention_2t |
4.7 µs | 4.6 µs | −2.1% | noise |
sync/sem_take_give |
4.3 µs | 3.9 µs | −9.3% | marginal |
sync/event_signal_wait |
45.3 µs | 44.0 µs | −2.9% | noise |
sync/condvar_signal_wait |
62.8 µs | 61.7 µs | −1.8% | noise |
sync/recursive_mutex_lock_unlock |
8.5 µs | 8.0 µs | −5.9% | marginal |
queue/send_receive |
11.8 µs | 11.6 µs | −1.7% | noise |
queue/throughput_2t |
8.2 µs | 7.9 µs | −3.7% | noise |
timer/start_stop |
27.7 µs | 29.5 µs | +6.5% | marginal |
eventgroup/set_get_bits |
1.7 µs | 1.7 µs | 0% | noise |
workqueue/submit_execute |
67.0 µs | 69.2 µs | +3.3% | noise |
stream/send_recv_64B |
54.0 µs | 52.8 µs | −2.2% | noise |
stream/throughput |
69.3 µs | 67.8 µs | −2.2% | noise |
Honest read on NuttX.
Heap and zero-heap are interchangeable for per-call latency. Every
hot path sits within ±10% across modes, most under ±5%. The few
marginal deltas (mutex_lock_unlock −7%, sem_take_give −9%,
recursive_mutex_lock_unlock −6%, all favouring zh; timer/start_stop
+7% favouring heap) are 300–400 ns absolute on µs-range ops — flash
fetch ordering rather than a structural difference. No outlier
exceeds 2 µs absolute.
Zephyr — C binding
| Case | Heap | Zero-heap | Δ% | Verdict |
|---|---|---|---|---|
time/time_get_us_overhead |
2.5 µs | 2.4 µs | −4.0% | noise |
time/delay_1ms |
1.09 ms | 1.09 ms | 0% | RTOS tick |
thread/yield |
10.6 µs | 10.5 µs | −0.9% | noise |
thread/get_self |
971 ns | 855 ns | −11.9% | real — see notes |
thread/sleep_1ms |
1.09 ms | 1.09 ms | 0% | RTOS tick |
thread/context_switch |
56.5 µs | 56.6 µs | +0.2% | noise |
sync/mutex_lock_unlock |
3.0 µs | 2.9 µs | −3.3% | noise |
sync/mutex_contention_2t |
3.1 µs | 3.0 µs | −3.2% | noise |
sync/sem_take_give |
2.0 µs | 2.3 µs | +15.0% | real — see notes |
sync/event_signal_wait |
56.6 µs | 56.9 µs | +0.5% | noise |
sync/condvar_signal_wait |
65.1 µs | 65.1 µs | 0% | noise |
sync/recursive_mutex_lock_unlock |
3.1 µs | 3.0 µs | −3.2% | noise |
queue/send_receive |
4.7 µs | 4.5 µs | −4.3% | noise |
queue/throughput_2t |
3.7 µs | 3.3 µs | −10.8% | real — see notes |
timer/start_stop |
9.0 µs | 8.8 µs | −2.2% | noise |
eventgroup/set_get_bits |
9.0 µs | 9.1 µs | +1.1% | noise |
workqueue/submit_execute |
58.4 µs | 58.7 µs | +0.5% | noise |
stream/send_recv_64B |
14.3 µs | 14.3 µs | 0% | noise |
stream/throughput |
24.3 µs | 24.3 µs | 0% | noise |
Honest read on Zephyr.
Most hot paths read parity (within ±5%). Three rows show real signal:
thread/get_self−12% (971 → 855 ns). Zephyr resolvesk_current_getdifferently when the per-thread state lives in caller-supplied static storage vsk_object_alloc()-managed pool: the static path skips one level of indirection. ~120 ns absolute.sync/sem_take_give+15% (2.0 → 2.3 µs). Opposite sign — the zh-mode semaphore object lives at a different address that triggers an extra flash fetch on the take/give pair on Zephyr's k_sem state machine. ~300 ns absolute, smaller than the per-call wrapper overhead documented in the per-binding analysis.queue/throughput_2t−11% (3.7 → 3.3 µs). Two-thread producer/consumer with caller-supplied ring buffer. Static storage for the queue produces a more compact instruction stream than the heap-allocated path on Zephyr'sk_queue_*family. ~400 ns absolute, sign-stable across runs.
The longer ops (event_signal_wait 57 µs, condvar_signal_wait
65 µs, context_switch 57 µs, workqueue/submit_execute 58 µs)
all stay within ±2% — kernel work dominates and the
static-vs-pool address difference is below the noise floor.
Cross-RTOS summary
| RTOS | Per-call median |Δ| | Real signals | Verdict |
|---|---|---|---|
| FreeRTOS | ~3% | stream/send_recv_64B +60% and stream/throughput +34% — both intrinsic to the native xStreamBuffer* API behaving differently with caller-supplied static storage; not a wrapper cost |
Per-call sync interchangeable between modes; ZH stream throughput pays a real intrinsic cost — choose heap if streams are hot |
| NuttX | ~3% | None on per-call hot paths. Bidirectional placement deltas of <500 ns absolute. | Heap and ZH functionally interchangeable for per-call latency |
| Zephyr | ~3% | thread/get_self −12%, queue/throughput_2t −11%, sem_take_give +15% — all sub-µs absolute |
At parity for any practical workload; mode-specific deltas <500 ns absolute |
The take-aways:
- No binding-level overhead is introduced by zero-heap on any
RTOS. The wrapper hot-path is the same FFI symbol either way; the
audit at
tests/audit/hotpath_expected.yamlenforces this. - Per-call sync (mutex, sem, condvar, event, queue, eventgroup) is at parity within a few hundred nanoseconds on every RTOS in either mode. Pick zero-heap and take the compile-time guarantees against post-boot allocation as a free win.
- Two structural exceptions, both on FreeRTOS only:
stream/send_recv_64Bandstream/throughputregress ~12 µs and ~11 µs respectively under zero-heap. This regression is in the native FreeRTOSxStreamBuffer*API, not in the oveRTOS wrapper — verified by the wrapper-vs-native table in the per-RTOS report. If your FreeRTOS workload is bound on stream throughput, prefer heap mode.
Reproducing
The numbers above came from running the bench on STM32F746G-DISCO
with worst-case timing on
(CONFIG_OVE_BENCHMARK_WORST_CASE_TIMING=y — see
benchmarks overview):
make benchmarks-stm32f746g-discovery # FreeRTOS heap
make benchmarks-stm32f746g-discovery ZEROHEAP=1 # FreeRTOS zh
make benchmarks-stm32f746g-discovery-nuttx
make benchmarks-stm32f746g-discovery-nuttx ZEROHEAP=1
make benchmarks-stm32f746g-discovery-zephyr
make benchmarks-stm32f746g-discovery-zephyr ZEROHEAP=1
Each invocation builds 4 bindings (C, C++, Rust, Zig), flashes via
openocd, captures the picocom-recorded serial log, and writes both
output/<board>/<rtos>/_benchmarks{,_zeroheap}/report.md and
docs-site/docs/benchmarks/<rtos>-{heap,zeroheap}.md directly.