Module lockfree_buffer_pool

Source
Expand description

Lock-free buffer pool for zero-allocation high-frequency data processing

This module implements a lock-free buffer pool using the Treiber stack algorithm, optimized for minimal latency in HFT environments where any synchronization overhead can destroy microsecond-level performance requirements.

§HFT Performance Rationale

§Lock Contention Elimination

Traditional mutex-based pools create critical bottlenecks in HFT systems:

  • Lock acquisition: 10-50ns overhead per operation
  • Thread blocking: Unpredictable delays when multiple threads compete
  • Priority inversion: Low-priority threads can block high-priority trading logic
  • Cache coherency: Mutex state changes invalidate CPU caches across cores

§Memory Allocation Costs

Dynamic allocation is prohibitive in latency-sensitive paths:

  • malloc/free overhead: 50-200ns per allocation
  • Fragmentation: Degrades cache performance over time
  • TLB pressure: Memory mapping overhead affects page table lookups
  • NUMA effects: Cross-node allocation can add 100+ ns latency

§Lock-Free Architecture

§Treiber Stack Algorithm

Uses atomic compare-and-swap (CAS) operations for thread-safe access:

  • ABA problem protection: Pointer-based nodes prevent recycling issues
  • Memory ordering: Acquire/Release semantics ensure proper synchronization
  • Cache-line alignment: 64-byte alignment prevents false sharing
  • Wait-free progress: No thread can block others indefinitely

§Memory Layout Optimization

Cache Line 0: [ Stack Head Pointer (8B) | Padding (56B) ]
Cache Line 1: [ Node->Buffer Ptr (8B) | Node->Next (8B) | Padding (48B) ]
Cache Line 2: [ Actual Buffer Data ... ]

§Buffer Type Specialization

Separate lock-free stacks for different buffer types:

  • Serialization buffers: JSON/Binary encoding (128KB default)
  • Compression buffers: Data compression operations (64KB default)
  • SIMD buffers: Aligned vectorized calculations (VecSimd<f64x4>)

§Performance Characteristics

§Latency Metrics

  • Buffer acquisition: <10ns typical, <50ns worst-case
  • Buffer return: <5ns (simple atomic store)
  • Cache miss penalty: ~100ns when buffer not in L3 cache
  • Allocation fallback: 200-500ns when pool exhausted

§Throughput Optimization

  • High concurrency: Scales linearly with CPU cores
  • Memory bandwidth: Minimizes allocation traffic
  • CPU cache efficiency: Hot buffers stay in L1/L2 cache

§Safety & Correctness

§Memory Safety

  • No data races: CAS operations provide synchronization
  • No double-free: Buffers tracked through intrusive linked list
  • Bounded memory: Size limits prevent unbounded growth
  • Leak prevention: Pool cleanup in destructor

§ABA Prevention

Uses pointer-based nodes rather than array indices:

// Safe: Pointer values are unique
struct BufferNode<T> {
    buffer: T,
    next: AtomicPtr<BufferNode<T>>,
}

§Integration Patterns

§RAII Buffer Guards

Automatic buffer return prevents leaks:

let buffer_guard = pool.get_serialization_buffer_guard();
// Use buffer_guard.data_mut()
// Automatically returned on scope exit

§Thread-Local Caching

Can be combined with thread-local storage for ultimate performance:

  • Each thread maintains a small local cache
  • Fall back to global lock-free pool when cache misses
  • Reduces atomic operations to near-zero in steady state

§Monitoring & Tuning

Built-in statistics for performance analysis:

  • Hit/miss ratios: Pool efficiency metrics
  • Allocation counts: Dynamic allocation frequency
  • Contention metrics: CAS retry counts (future enhancement)
  • Memory usage: Current and peak buffer counts

Structs§

BufferCounts
Available buffer counts
BufferGuard
RAII guard for automatic buffer return
BufferPoolStats
Buffer pool statistics
LockFreeBufferPool
Lock-free buffer pool for high-performance data processing
LockFreeBufferPoolConfig
Configuration for the lock-free buffer pool