MPSC Determinism Under the Lookahead Scheduler (Consumer-Tick-Driven Arbitration)

Status: Implemented and in production. This document is retrospective: it describes the only MPSC arbitration path used by the lookahead scheduler. A previous BulkBarrier backend was removed once this path became the default (see commit 65a8fed and the removal change).

Background

The lookahead scheduler advances worker threads independently, spinning on predecessor-cluster completed_cycle atomics (thread_progress_array_) instead of rejoining the main thread at every cycle. The default epoch-free driver runs one window for the whole call; the deprecated per-epoch fallback (executeEpochProgressBased / executeRunProgressBased) keeps the same consumer-driven arbitration invariant but adds epoch boundaries. There is no per-cycle sync point where an auxiliary main-thread arbiter can run.

MPSC admission uses a two-stage path (Connection::transfer, InPort::arbitrateMPSC):

Producer writes into Connection::staging_ — one SPSC ring per connection, owned by the producer thread.
Arbiter drains every registered connection's staging into the shared MultiProducerQueueAdapter in conn_id-ascending order. The arbiter is the sole writer to the shared per-thread rings, making ring-level admission single-writer and deterministic.
Consumer tryReceive pops from the shared MPSC adapter via k-way merge on (arrive_cycle, conn_id).

Without a global sync point, arbitration rides on the consumer's own execution.

Consumer-tick-driven arbitration

Arbitration shifts from "main thread at global sync" to "consumer thread at the start of every tick that owns the InPort". The lookahead scheduler already gates the consumer's tick on predecessor progress atomics, so the arbiter can safely drain each connection up to its own producer's published cycle.

Per-connection drain bound (heterogeneous-delay correctness)

The arbiter drains each connection up to its OWN producer cluster's completed_cycle, not a single S = min over all producers. Each MPSC connection has its own per-connection ring (keyed on conn_id, so it carries a single fixed delay and stays monotonic in arrive_cycle), and the consumer pops by (arrive_cycle, conn_id). A single min-over-producers bound is only correct when every edge into the InPort shares one delay: with mixed delays it drags the bound down to the slowest (highest-delay) producer, so a low-delay producer's entry needed at the consumer's current cycle is left un-arbitrated and arrives a cycle late — diverging from barrier/sequential and breaking num_workers invariance. Draining per connection (bound = that producer's completed_cycle) admits exactly the entries whose producer cycle is finished; the lookahead gate (completed >= K+1-delay for each edge) guarantees every arrive_cycle <= K entry is admitted before the consumer reads cycle K. See InPort::arbitrateMPSCConsumerDriven and the test_mpsc_mixed_delay_determinism regression test.

Sequencing

Consumer thread C is about to tick unit U at localCycle K. U owns MPSC InPort P. Sequence:

Pseudocode equivalent:

At start of U's tick for cycle K (before user tick() runs):
  for conn in P.mpsc_connections_ sorted by conn_id:
      p := producer_completed_cycle(conn)      // conn's OWN producer cluster
      d := conn.delay
      drain conn's staging entries with enqueue_cycle <= min(p - 1, K - d)

  user tick() runs → tryReceive pops via k-way merge on (arrive_cycle, conn_id)

Hook: TickableUnit::executeTick calls arbitrateOwnedMPSCPorts_() before the user's tick(). The helper walks the owner's ports and calls arbitrateMPSCConsumerDriven on each.

Each connection is drained up to a per-connection bound, min(p - 1, K - d):

p - 1 (its own producer's completed_cycle): a single min over producers would be too conservative under heterogeneous edge delays — it caps every connection at the slowest producer, so a low-delay producer's entry due at the consumer's current cycle is left un-arbitrated and arrives late.
K - d (consumer's current cycle minus this edge's delay): admit only entries whose arrive_cycle = enqueue_cycle + d is <= K, i.e. visible to the receiver this cycle. A producer running ahead under lookahead must not have its future-cycle entries admitted early — under LegacyFastPath selective cancellation an early-admitted entry can be stamped and then wrongly canceled by a later cancelYoungerThan flush, which sequential/barrier (never admitting early) would not do.

Because each connection has its own conn_id-keyed ring (one fixed delay ⇒ monotonic arrive_cycle), per-connection draining is safe, and the lookahead gate (producer.completed >= K+1-delay for each edge) guarantees all arrive_cycle <= K entries are admitted before the consumer reads cycle K.

Under Sequential or Barrier execution the per-InPort producer-progress pointer set is empty, so arbitrateMPSCConsumerDriven degrades to the legacy unbounded drain. Those modes still use the main-thread arbitrateAllMPSCPorts_() at their own sync points.

Determinism proof sketch

Claim: for a fixed topology, every run produces the same shared-queue commit order — the same (send_cycle, conn_id) sequence — regardless of worker-thread interleaving.

Let E(c, i) denote the set of staging entries produced by Connection i during producer cycle c. Given a fixed topology and fixed initial state:

Producer output determinism (by induction on cycle). tick() is deterministic and its inputs at cycle c are a deterministic function of arbitrated entries from cycles < c (see step 3).
Arbitration drains every entry exactly once. last_arbitrated_cycle_ advances monotonically (single writer, per-port). Staging is SPSC and popped only by the arbiter.
Arbitration order is (send_cycle, conn_id). Outer loop iterates c strictly ascending. Inner loop iterates P.mpsc_connections_ in conn_id-sorted order (enforced at registerMPSCConnection). conn_id is assigned at topology construction time.
Shared MPSC k-way merge keyed on (arrive_cycle, conn_id). Given arbitration writes entries in (send_cycle, conn_id) order and arrive_cycle = send_cycle + delay is monotonic in send_cycle for a fixed delay per connection, pop order equals arbitration order.
S is monotone and eventually sufficient. Each thread_progress_array_[p].completed_cycle is monotonically increasing. At epoch end, every cluster reaches end_cycle, so every producer eventually publishes every cycle's completion and every entry gets drained before the next call into user code that would observe it.

The proof does NOT rely on wall-clock fairness between producer threads.

Tail drain (R8)

At epoch end every thread reaches end_cycle but publishes end_cycle - 1 as completed just before exiting the per-thread loop. Staging entries with enqueue_cycle == end_cycle - 1 are still pending. After the epoch sync_wait, executeEpochProgressBased calls arbitrateAllMPSCPorts_() once to drain the tail. The default epoch-free path performs the same tail drain once after the run-spanning worker launch joins.

Termination

A unit calling requestTermination mid-epoch sets stop_token. Spin-waits exit. Staging may contain entries never arbitrated; the epoch-end flush covers them. If a consumer's tryReceive runs before termination is observed, it sees whatever arrived via the last consumer-tick arbitration.

Performance

Every consumer tick loads up to |producer_clusters(P)| atomics plus the drain inner loop — one extra load pair per MPSC InPort per tick. Measured ~2.9× wall-clock speedup over the prior BulkBarrier backend on Dhrystone at num_workers=8, driven largely by eliminating per-iteration sync_wait round-trips.

Audit of concurrent shared state

For producer on thread A pushing conn_7 vs. consumer on thread B arbitrating conn_3:

Connection::staging_ — per-connection SPSC ring. conn_3 and conn_7 have distinct rings. No share.
Connection::cancel_epoch_ — per-connection atomic. No share.
Connection::pushes_this_cycle_ — producer-only, per-connection. No share.
MultiProducerQueueAdapter (shared across connections feeding the same InPort) — only the port's consumer thread ever arbitrates this port, so the adapter has a single writer per port.
thread_queues_ vector — populated at initialize(), read-only during run.
thread_progress_array_[p].completed_cycle — producer cluster writes its own slot (release), arbiter reads predecessor slots (acquire). Atomic, safe.
InPort::last_arbitrated_cycle_ — single writer (port's consumer thread).

No cross-connection shared mutable state between producer-A-pushing-conn_7 and arbiter-on-B-draining-conn_3. The only consumer/producer shared state is within a single Connection (its staging ring), handled by SPSC ordering.

Background​

Consumer-tick-driven arbitration​

Sequencing​

Determinism proof sketch​

Tail drain (R8)​

Termination​

Performance​

Audit of concurrent shared state​