appdimens-dynamic

Technical Performance Report: AppDimens Dynamic

This report documents the performance results after applying the 4 optimization phases for the current library.

[!NOTE] Build variants, R8, and how to read the numbers

With code shrinking and R8 on release builds (minifyEnabled = true), benchmark numbers can drop sharply versus debug without minify. Example ranges from the project harness:

Harness Approx. range (release + minify + R8)
Calculation Test (avg) ~82 ns – ~150 ns
Microbenchmark ~125 ns – ~155 ns
Macrobenchmark (estimated per-item cost in that harness under R8—not the scroll-duration / µs-per-row figures in the baseline table below) ~367 ns – ~380 ns

Unless a paragraph explicitly says otherwise, the benchmarks and tables in this document use debug builds without minify (e.g. connectedDebugAndroidTest, debug APK for BenchmarkActivity). See R8-PROGUARD.md if you enable R8 full mode (android.enableR8.fullMode=true in gradle.properties).

Benchmark dashboard — AppDimens Dynamic   Benchmark dashboard — additional capture


1. Applied Optimizations

Phase Component Description
F1 DimenCache.getBatch() Made the API public for batching N dimensions by the caller
F2 ShardWrapper 128-byte padding per shard to eliminate false sharing between cores
F3 ScreenFactors All @Volatile fields grouped in an object with 128-byte padding
F4 clearAll() lazySet() + manual 4× unrolling for mass clearing without redundant barriers

2. Benchmarks — Local JVM (Ubuntu Linux · JVM 17)

Executed via ./gradlew :library:testDebugUnitTest · 1,000,000 iterations per case · 5 trials, minimum reported.

Operation Result Status
Raw Math (no AR) per item < 1 ns Extreme 🚀
Raw Math (with AR) per item 2 ns Optimal
Cache Hit (no AR) per item 1 ns Fast
Cache Hit (with AR) per item 1 ns Zero-Math 🚀
Batch (100 items, math) 34 ns/batch Extreme 🏎️
Batch Cache (100 items, AR) 242 ns/batch Stable
Persistence Load ~0.06 ms Fast

raw_batch_cache_ar at 242 ns/batch remains dominated by the 100× AR lookup loop.


3. Benchmarks — Physical Hardware (Xiaomi 2107113SG · Snapdragon 888 · SM8350)

[!NOTE] Hardware: Captures below use ./gradlew :library:connectedDebugAndroidTest · 100,000 iterations · 3 trials, minimum reported.

Operation Result Status
Raw Math (no AR) per item 2 ns Optimal
Raw Math (with AR) per item 45 ns Standard
Cache Hit (no AR) per item 5 ns Fast
Cache Hit (with AR) per item 35 ns Zero-Math 🚀
Batch (100 items, math) 169 ns/batch Near-Zero 🚀
Batch (100 items, math+AR) 194 ns/batch Stable
Batch Cache (100 items, no AR) 431 ns/batch Constant
Batch Cache (100 items, with AR) 3,773 ns Stable
Batch Mixed (50% AR / 50% without) 2,036 ns/batch Stable
Persistence Load 0.76 ms Fast

Regression Fix (F1.1): Inlining of getOrPutInternal and ShardWrapper visibility (internal @PublishedApi) keeps batch AR paths in the ~3.7–3.8 µs range for 100 cached AR lookups; the non-AR hot path (most cases) remains extremely stable at ~5 ns.


4. Optimization Analysis

F1 — Public getBatch()

JVM:     34 ns / 100 items = 0.34 ns per item
Android: 169 ns / 100 items = 1.69 ns per item

The batch API exposes a continuous loop that the JVM can optimize aggressively on desktop. On Android, the gain is still largely about amortizing context and init work across 100 resolutions.

Recommended usage:

val keys = LongArray(views.size) { i ->
    DimenCache.buildKey(values[i].toFloat(), isLandscape,
        false, DimenCache.CalcType.SCALED, DpQualifier.SMALL_WIDTH,
        Inverter.DEFAULT, false, DimenCache.ValueType.PX)
}
val results = DimenCache.getBatch(keys, context) { i -> computeDimension(i) }

F2 — ShardWrapper (Anti-False-Sharing Padding)

The 128-byte padding between shards ensures that each AtomicLongArray and AtomicIntegerArray is in distinct heap regions, without sharing a cache line (64 bytes on ARM64).

Memory Overhead:

Before: 4 × SHARD_SIZE × (8 + 4) bytes = 4 × 512 × 12 = 24,576 bytes (~24 KB)
After:  4 × ShardWrapper ≈ 4 × (16 header + 8+8 refs + 14×8 pad) = 4 × ~144 = ~576 bytes overhead
        + 4 × 512 × 12 bytes of data = ~24 KB (unchanged)
Total:  ~24.6 KB (increase of <2.5 KB due to padding — negligible)

Benefit: Eliminates cross-core cache line invalidation between threads. Particularly relevant on octa-core devices (4+4) like the SM8350.

F3 — ScreenFactors (@Volatile Padding)

The 6 @Volatile fields (scale, arMultiplier, normalizedAr, logNormalizedAr, density, smallestWidthDp) occupied ~24 bytes — half an ARM64 cache line. A write to scale during updateFactors() could invalidate arMultiplier on another core reading simultaneously.

With ScreenFactors, the 6 fields + 112-byte padding are isolated in their own line. updateFactors() occurs rarely (configuration changes), so the benefit is preventing sporadic jank rather than steady-state latency.

F4 — clearAll() with lazySet() + 4× Unrolling

lazySet() emits an ordered store (without a full StoreLoad barrier), making mass zeroing ~2-3× faster than set(). The next getOrPut() will emit the necessary acquisition barrier.

Theory: 512 elements × 4 shards = 2,048 lazySet() calls per clearAll(). With 4× unrolling: ~512 loop iterations instead of 2,048 → 4× reduction in branch+increment overhead.


5. Test Integrity

✅ DimenCacheTest         — 5/5 tests passed (keysArray backward compat via aliases)
✅ DimenPerformanceTest   — executed successfully (local JVM)
✅ ExampleUnitTest        — passes
✅ DimenAndroidPerformanceTest — 2/2 tests on physical device (SM8350)
✅ ExampleInstrumentedTest    — passes
✅ BenchmarkActivity      — executed successfully on physical device (SM8350)

5a. BenchmarkActivity — Production-Grade Micro + Macro Test

BenchmarkActivity has been redesigned into a professional dual-benchmark system:

  1. Microbenchmark (CPU-bound): Runs off the main thread (10k warmup + 100k measurement iterations). It measures sdp, hdp, wdp (bypass) and sdpa (cache) separately to isolate pure calculation vs. lookup costs.
  2. Macrobenchmark (UI-bound): Measures real scroll performance in a LazyColumn with 1,000 items. Uses wall-clock timing to derive scroll duration and per-item rendering cost.

Note (Macro): Wall-clock scroll duration and the estimated cost per item include the full cost of each list row — not only sdp / dimension resolution, but also Compose composition (or View inflation/layout) and drawing for that item. The macro numbers therefore reflect realistic UI work, not isolated math/cache timing.

Baseline Metrics (Snapdragon 888 · SM8350 · Android 14):

Runner Metric Result JIT State
Micro Combined Avg ~619 ns Cold
Micro Combined Avg ~303 ns Warm (await)
Micro Combined Avg ~260 ns Hot (steady-state)
Micro sdp/hdp/wdp (bypass) ~2 ns Hot
Micro sdpa (cache lookup) ~35 ns Hot
Macro Scroll Duration (1k items) ~996 ms Fluid
Macro Est. Cost per item ~996 µs Fluid

Release + minify + R8 (same harness family): micro combined average ~125 ns – ~155 ns per cycle (contrast ~260 ns hot steady-state above on debug without minify). Macro per-item estimate under R8 ~367 ns – ~380 ns (see methodology note at the top of this document—distinct from ~996 µs per row in the debug table, which includes full row composition/layout/draw).

Steady-state performance: ~260 ns combined average per 4-call cycle (hot JIT, dashboard capture · 2026-04-03).

Warm-up Chart Interpretation

ns/resolution (Combined Avg)
619 │ ●  Cold Start
    │
303 │    ●  JIT warming (await)
    │
260 │       ●  JIT hot (steady-state)
    └─────────────────────────────────
       run 1    run 2    run 3

The decay from ~619 → ~260 ns is the expected behavior of the ART JIT as hot paths compile and inline:

Note: With Profile Guided Optimization (PGO), cold-run penalties are reduced — steady-state remains near ~260 ns for this workload on SM8350-class hardware.

Context with BenchmarkActivity (Compose + View + 1000 items)

The stress test measures:

totalNs / (repeatCount * 4)   // = totalNs / 40,000 resolutions

Each “resolution” is one of the 4 calls:

Mixed bypass (sdp / hdp / wdp) and AR cache (sdpa) paths contribute to the ~260 ns hot steady-state average per 4-call cycle (hash, sharded lookup, and bypass math, inclusive of dispatch overhead in the activity harness).


graph TD
    A[UI / Code Call] --> B{Cache Enabled?}
    B -- No --> Z[Compute Directly]
    B -- Yes --> C{Bypass-eligible & No-AR?}
    C -- Yes --> D["Fast Math Direct Return (~2ns)"]
    C -- No --> E[getOrPutInternal]
    E --> F["Hash Key → ShardWrapper[i]<br/>(Isolated 128-byte padding)"]
    F --> G{Key Match?}
    G -- Hit --> H["Return Float.fromBits (~5-35ns)"]
    G -- Miss --> I[Compute Once & Write Back]
    I --> H
    D --> H
    J["ScreenFactors<br/>(Padded @Volatile)"] -.reads.-> E

6. Simple Calculations Faster Than Cache

For CalcType values of AUTO, FLUID, PERCENT, and SCALED without Aspect Ratio (applyAspectRatio = false), DimenCache.getOrPut() immediately returns compute() without touching the sharded arrays.

The scaling formula for these types reduces to: baseValue × scale (a single float multiply). Measured cost on Snapdragon 888: ~2 ns (multiply) vs ~5 ns (hash + atomic lookup). The cache adds overhead for these paths — bypassing it is ~2.5× faster.

This is an intentional hot-path optimization, not a missing feature. The cache is most valuable when the computation is expensive (Aspect Ratio path: ~45 ns on hardware in recent captures), making the 5 ns lookup amortize well.

Path Cost Cache?
SCALED without AR (most calls) ~2 ns ❌ Bypass
SCALED with AR ~45 ns → ~35 ns cached ✅ Cache
Other CalcTypes with AR ~45 ns → ~35 ns cached ✅ Cache

Consequence for benchmarks: calls via DimenSdp.sdp(), .hdp(), .wdp() (i.e., without AR) measure raw math latency, not cache lookup latency. Always use the *a variants (.sdpa(), .hdpa(), etc.) to specifically measure cache throughput.

The BenchmarkActivity micro harness reports a per-cycle combined average (~260 ns hot) over four calls (3 bypass + 1 cache path), including framework and dispatch overhead — not a per-call pure math figure.


7. Benchmark Variability

All numbers in this document were captured on a Xiaomi 2107113SG (Snapdragon 888 SM8350, Android 14) and a Ubuntu Linux JVM 17 host. Real-world results will differ based on:

Benchmarks vary with real usage — use these figures as upper-bound reference points for architecture decisions, not as absolute production guarantees. Profile on your target device with your target workload.


Report generated on: 2026-04-03 · AppDimens Dynamic Performance Lab · Snapdragon 888 (SM8350) · Physical Hardware Compiled with: Kotlin 2.x · JVM 17 · ART (Android 14) · Gradle 9.3.1