Per-language binding analysis
Companion to Heap vs zero-heap: for each
language binding (C, C++, Rust, Zig), what does the wrapper layer
actually cost vs calling the same ove_* FFI from C?
All numbers below are taken from the just-published per-RTOS reports
under CONFIG_OVE_BENCHMARK_WORST_CASE_TIMING=y — Cortex-M7 I-cache,
D-cache, branch predictor, ART accelerator, and flash prefetch all
disabled (see benchmarks overview). Two consequences for
how to read this page:
- Every per-iteration flash fetch costs the full flash latency. There are no hidden cache hits. Wrapper code that compiles to more bytes of flash takes proportionally longer than wrapper code that compiles to fewer.
- No cache-line locality argument applies to anything below. When a non-C binding is faster or slower than C by a few hundred nanoseconds on the same op, that's compiler codegen plus wrapper-code-size, not L1 D-cache placement.
Coverage matrix
| Binding | FreeRTOS heap | FreeRTOS ZH | NuttX heap | NuttX ZH | Zephyr heap | Zephyr ZH |
|---|---|---|---|---|---|---|
| C | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| C++ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Rust | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Zig | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
All bench bindings call ove_thread_start_scheduler() directly
instead of ove_run() — the bench's create/destroy cases require
post-init kernel allocation, which ove_run()'s zero-heap auto-lock
would block on NuttX. C/CPP have always done this; Rust exposes
ove::start_scheduler() and Zig ove.startScheduler() for the same
purpose.
Headline: per-call median |Δ| vs C
Excluding three categories of bench-design noise — mutex_contention_2t
(Zephyr timeslicing, see wrapper-vs-native notes),
the *_throughput* cases (2-thread producer/consumer where scheduling
alignment dominates), and ctx_switch (the ping-pong shape doubles
the wrapper cost) — the remaining per-call hot paths give:
| Config | C++ median / max | Rust median / max | Zig median / max |
|---|---|---|---|
| FreeRTOS heap | 2.8% / 32% | 2.2% / 74% | 0.7% / 62% |
| FreeRTOS ZH | 0.8% / 8% | 3.6% / 26% | 4.0% / 39% |
| NuttX heap | 0.9% / 6% | 2.0% / 12% | 1.5% / 6% |
| NuttX ZH | 1.4% / 11% | 3.6% / 21% | 1.1% / 14% |
| Zephyr heap | 2.5% / 9% | 5.4% / 27% | 2.2% / 31% |
| Zephyr ZH | 1.5% / 74% | 2.8% / 25% | 1.2% / 12% |
Median |Δ| sits at 0.7–5.4% for every binding × RTOS × mode. No binding-level overhead exceeds 6% on the typical path. The "max" column is dominated by stream-buffer outliers (more on these below); the median is what most workloads will see.
Per-call wrapper cost in absolute nanoseconds
Same comparison, but in absolute ns rather than %. For five per-call hot paths across three RTOSes:
FreeRTOS heap (cache-off baseline)
| Case | C | Δ C++ | Δ Rust | Δ Zig |
|---|---|---|---|---|
time/time_get_us_overhead |
883 ns | +317 ns | +217 ns | +117 ns |
thread/yield |
4 500 ns | −200 ns | −100 ns | 0 ns |
sync/mutex_lock_unlock |
8 100 ns | +100 ns | +400 ns | −100 ns |
sync/sem_take_give |
6 500 ns | −100 ns | +100 ns | −100 ns |
queue/send_receive |
8 200 ns | 0 ns | +1 100 ns | +300 ns |
NuttX heap
| Case | C | Δ C++ | Δ Rust | Δ Zig |
|---|---|---|---|---|
time/time_get_us_overhead |
3 000 ns | +100 ns | +100 ns | 0 ns |
thread/yield |
3 600 ns | +200 ns | +100 ns | −100 ns |
sync/mutex_lock_unlock |
5 400 ns | 0 ns | −100 ns | +100 ns |
sync/sem_take_give |
4 300 ns | −100 ns | +500 ns | −100 ns |
queue/send_receive |
11 800 ns | −200 ns | +100 ns | +700 ns |
Zephyr heap
| Case | C | Δ C++ | Δ Rust | Δ Zig |
|---|---|---|---|---|
time/time_get_us_overhead |
2 500 ns | −100 ns | +100 ns | 0 ns |
thread/yield |
10 600 ns | 0 ns | 0 ns | −200 ns |
sync/mutex_lock_unlock |
3 000 ns | +100 ns | +400 ns | −200 ns |
sync/sem_take_give |
2 000 ns | +200 ns | +100 ns | +100 ns |
queue/send_receive |
4 700 ns | −400 ns | +1 300 ns | −300 ns |
Three patterns from these tables:
- C++ and Zig are at parity with C on per-call sync — within ±300 ns, which is at or below the worst-case-timing scheduler noise floor. Negative deltas (C++/Zig "faster than C") show up wherever g++/zig emits one fewer flash fetch than gcc on the same FFI call shape — they're real but small.
- Rust adds a fixed adapter cost that's typically +100–500 ns
per call on simple ops (
mutex_lock,sem_take_give,time_get_us,thread/yield) and grows to +1 000–1 300 ns on ops where the wrapper does more work (queue/send_receivereads + writes a value throughResult<T>, so the adapter cost compounds with the value transfer). This matches theResult<T, ove::Error>wrap +Error::from_code(rc)shape inove::*Rust API. - NuttX has the smallest cross-binding spread. Median |Δ| across all 4 bindings × all NuttX hot paths is well under 200 ns. NuttX baseline ops are larger (4–12 µs vs Zephyr's 2–4 µs), so the same fixed adders are a smaller fraction of total time AND produce less variance because more code is shared between wrapper and native paths.
C++
The C++ wrappers are header-only RAII classes (ove::Mutex,
ove::Queue, …) using std::optional for setup/teardown lifecycle.
Methods inline cleanly under -O2.
Headline: median |Δ| 0.8–2.8% per call. The tightest non-C binding on most configs.
Outliers worth knowing:
- FreeRTOS heap
thread/create_destroy−68% /workqueue/create_destroy−67%: the C++Thread (ctor+dtor)andWorkQueue (ctor+dtor)use static stack memory rather than the kernel-pool allocation the C wrapper does. Different ownership semantics, not a "C++ faster than C" win in any philosophical sense. Both gated out under zero-heap (where*_create_destroycases don't compile). - Zephyr ZH
stream/send_recv_64B+74% (14.3 → 24.8 µs): a +10 µs absolute regression vs C on a 14 µs base. The C++ove::Stream::send_recvwrapper compiles to more bytes than the Cove_stream_send+ove_stream_receivepair on Zephyr ZH, and on a flash-fetch-bound execution path that extra code size shows up directly. Persistent across runs but Zephyr-zh-specific:Zephyr heapshows parity (14.3 vs 14.3) andFreeRTOS zhshows parity (32.9 vs 32.9). Worth a closer look at the Zephyr ZH release-build assembly forStream::send_recvif that path is on your hot path.
For per-call sync primitives (lock/unlock, take/give, signal/wait, yield) C++ stays inside ±300 ns of C on every RTOS+mode combination. The ergonomic wins (RAII, type-safe queue) come at zero measurable per-op cost on those paths.
Rust
Rust wrappers add a fixed adapter layer per FFI call:
Result<T, ove::Error>wrapping (constructor + return-value layout)Error::from_code(rc)decoder converting Cinttoove::ErrorOption<T>decoding viaLvCell::try_get()for shared state- Bounds/null checks the compiler can't elide through the FFI boundary
The DWT-direct measurement floor (~50 ns) makes the absolute adder
visible: typically +100–500 ns per simple FFI call, scaling to
+1 000–1 300 ns on ops where the wrapper itself does multiple FFI
hops or marshals values through Result.
Headline: median |Δ| 2.0–5.4% per call. Highest of the four bindings, consistently visible, never above 6% on the typical path.
The fixed cost is what it is — Result<T, E> is the safety
contract of the Rust API, and disabling it isn't a binding-side
optimisation. Three mitigations exist for tight loops:
_uncheckedvariants (used by the Rust bench'stime::get_us_unchecked()) skip the Result wrap when the caller is willing to ignore errors — used by 2 hot paths in the bench.- Inline batching: do many ops behind a single FFI call (e.g.
Queue::send_many(&items)) — not yet exposed in the public API. - Acceptance: 100–500 ns per call is below most application noise floors; the per-op overhead disappears in any workload that does I/O or any kernel work between calls.
Outliers worth knowing:
- FreeRTOS heap
thread/create_destroy−74% andworkqueue/create_destroy−73%: same static-stack-spawn pattern as C++.Thread::spawn+joinuses caller-supplied static memory and skips the kernel-pool round-trip. - FreeRTOS heap
stream/send_recv_64B+21% (20.6 → 25.0 µs) and Zephyr heapqueue/send_receive+27% (4.7 → 6.0 µs): the Rust wrapper's&mut [u8]length-check + bounds-check chain adds ~1 000 ns on top of the underlying FFI. Visible because both ops have a value-transfer body where the adapter cost compounds. - FreeRTOS zh
stream/send_recv_64B−26% (32.9 → 24.2 µs): negative delta — Rust faster than C by 8.7 µs. This is one of the cases where the C wrapper's compiled flash footprint exceeds the Rust wrapper's on FreeRTOS zh; under cache-off the larger code size pays its full fetch cost. Persistent across runs but cleanly the opposite sign of the same case on FreeRTOS heap.
The "Rust occasionally faster than C" rows (FreeRTOS heap
*_create_destroy, FreeRTOS zh stream/send_recv_64B,
Zephyr heap mutex_create_destroy −16%) all fall into one of two
buckets: different ownership semantics on construction paths
(static-stack-spawn skips kernel allocation), or compiled-code-size
asymmetry (Rust wrapper happens to be smaller than C on that op).
Neither is a wrapper "win"; just an honest accounting of the gap.
Zig
Zig wrappers go through comptime trampolines that resolve to direct
FFI calls under -OReleaseSafe. The pin tracker (debug-only)
compiles out, and method dispatch is monomorphised.
Headline: median |Δ| 0.7–4.0% per call. Often the tightest binding after C, with a small set of stream-path outliers.
Outliers worth knowing:
- FreeRTOS heap
stream/send_recv_64B+62% (20.6 → 33.4 µs): the largest single outlier in the page. +12.8 µs absolute on a 20.6 µs base.ove.Stream.send_recvdoes slice-bounds checks on the caller's buffer in addition to the FFI call; combined with Zig's compile-time-resolved alignment checks, the resulting code is bigger in flash than the Cove_stream_send+ receive pair. FreeRTOS zh shows the opposite sign for the same wrapper (Zig 20.2 µs vs C 32.9 µs, −38%) — cleanly demonstrating that this is compiled-code-size + flash-fetch-ordering, not a binding-level fixed cost. - Zephyr heap
mutex_create_destroy−31% (11.6 → 7.9 µs): Zig'sMutex.create+destroyuses a comptime-resolved literal for the mutex storage, whichk_mutex_initaccepts without going throughk_object_alloc(). Same different-ownership-semantics pattern as Rust/C++ on the create/destroy paths.
For per-call sync primitives (mutex_lock, sem_take,
event_signal, condvar_signal, recursive_mutex) Zig stays inside
±300 ns of C on every RTOS+mode pair. The wrapper invocation cost
is not reliably visible above the worst-case-timing noise floor on
the per-call sync paths.
Cross-binding summary
For an op that's ~5–8 µs in C under worst-case timing (typical
mutex_lock, sem_take_give, queue/send_receive):
| Binding | Per-call adder | Driver |
|---|---|---|
| C | 0 (baseline) | — |
| C++ | ±0–300 ns | inlined RAII / optional body, mostly parity with C |
| Rust | +100–500 ns typical, +1 000+ ns on value-marshalling ops | Result<T,E> wrap, Error::from_code, Option decode |
| Zig | ±0–300 ns | comptime trampoline; pin tracker elided in release |
For an op that's ~50–60 µs in C (context switch, condvar, workqueue submit) all four bindings sit within ±3% — the per-op fixed costs are below the scheduler noise floor.
Why Zephyr shows the biggest cross-binding spread
Zephyr's per-call sync primitives have the smallest absolute base of
the three RTOSes — k_mutex_lock+unlock is ~3 µs cache-off, vs
~5 µs on NuttX and ~8 µs on FreeRTOS. The same ~200–500 ns
binding-side adder is a larger fraction of 3 µs than of 8 µs, so the
percentage |Δ| values look bigger on Zephyr without the absolute
overhead actually being larger. The "max" 74% in Zephyr ZH C++ is a
+10 µs absolute outlier on a 14 µs base — the only genuinely large
binding-side gap on Zephyr.
Why NuttX shows the smallest spread
NuttX's POSIX-style sync primitives are the largest of the three
(pthread_mutex_lock+unlock ~5 µs, mq_* ~12 µs). More of the
total time is kernel work that's identical regardless of which
binding called it; the wrapper-side adder is proportionally tiny.
NuttX heap and zh both show median |Δ| under 4% across every
binding.
Heap-vs-zero-heap effect per binding
| Binding | Effect of heap → ZH | Why |
|---|---|---|
| C | Modest mode-dependent deltas (see heap-vs-zeroheap) | static-vs-pool allocation produces different addresses for kernel objects; flash-fetch order changes |
| C++ | Median stays 0.8–2.8%; one mode-specific +74% outlier on Zephyr ZH stream | optional<Stream> wrapper compiles to a different code-size shape under ZH static storage |
| Rust | Stable median across modes; max grows on Zephyr heap (queue/send_receive +27%) |
wrapper layer is mode-agnostic; the wrapper code is identical heap and ZH |
| Zig | *_create_destroy outliers gated out under ZH (per design); stream-path outliers flip sign between modes |
comptime-resolved storage produces different compile-time-fixed addresses, and on cache-off paths the resulting flash-fetch ordering can flip the sign vs C |
When to choose which binding
- C: lowest overhead on every config, no language-level safety. Choose when 100 ns per op matters and you're confident in your code.
- C++: clean RAII + type-safe queue at parity with C on per-call paths. One mode-specific rough edge on Zephyr ZH stream.
- Rust: best safety guarantees; +100–500 ns fixed cost per FFI call (typical) under worst-case timing. Use when correctness matters more than nanoseconds — most workloads.
- Zig: comptime safety at parity with C on per-call paths. Stream-path outliers flip sign by mode but never large in workload-relevant absolute terms.
How to read these percentages on your hardware
The published numbers are the upper bound of per-op cost on a Cortex-M7 platform with caches and accelerators turned off. On your target with caches enabled, the C-binding base op will be ~2.5–3× smaller in wall-clock time. The fixed wrapper adders (in absolute ns) do not change with caches enabled, but the percentage ratio increases because the denominator shrinks. If you need a quick estimate of the cache-on numbers, divide the C-binding base by ~2.7 and add the same fixed binding adder back; that recovers the cache-on per-call ratio.
The deltas in this analysis are stable across runs; the hot-path
audit at tests/audit/hotpath_expected.yaml ensures no hidden
allocator/vtable/panic-handler ever sneaks into the measured path.