Module lockfree_buffer_pool

Expand description

Lock-free buffer pool for zero-allocation high-frequency data processing

This module implements a lock-free buffer pool using the Treiber stack algorithm, optimized for minimal latency in HFT environments where any synchronization overhead can destroy microsecond-level performance requirements.

§HFT Performance Rationale

§Lock Contention Elimination

Traditional mutex-based pools create critical bottlenecks in HFT systems:

Lock acquisition: 10-50ns overhead per operation
Thread blocking: Unpredictable delays when multiple threads compete
Priority inversion: Low-priority threads can block high-priority trading logic
Cache coherency: Mutex state changes invalidate CPU caches across cores

§Memory Allocation Costs

Dynamic allocation is prohibitive in latency-sensitive paths:

malloc/free overhead: 50-200ns per allocation
Fragmentation: Degrades cache performance over time
TLB pressure: Memory mapping overhead affects page table lookups
NUMA effects: Cross-node allocation can add 100+ ns latency

§Lock-Free Architecture

§Treiber Stack Algorithm

Uses atomic compare-and-swap (CAS) operations for thread-safe access:

ABA problem protection: Pointer-based nodes prevent recycling issues
Memory ordering: Acquire/Release semantics ensure proper synchronization
Cache-line alignment: 64-byte alignment prevents false sharing
Wait-free progress: No thread can block others indefinitely

§Memory Layout Optimization

Cache Line 0: [ Stack Head Pointer (8B) | Padding (56B) ]
Cache Line 1: [ Node->Buffer Ptr (8B) | Node->Next (8B) | Padding (48B) ]
Cache Line 2: [ Actual Buffer Data ... ]

§Buffer Type Specialization

Separate lock-free stacks for different buffer types:

Serialization buffers: JSON/Binary encoding (128KB default)
Compression buffers: Data compression operations (64KB default)
SIMD buffers: Aligned vectorized calculations (VecSimd<f64x4>)

§Performance Characteristics

§Latency Metrics

Buffer acquisition: <10ns typical, <50ns worst-case
Buffer return: <5ns (simple atomic store)
Cache miss penalty: ~100ns when buffer not in L3 cache
Allocation fallback: 200-500ns when pool exhausted

§Throughput Optimization

High concurrency: Scales linearly with CPU cores
Memory bandwidth: Minimizes allocation traffic
CPU cache efficiency: Hot buffers stay in L1/L2 cache

§Safety & Correctness

§Memory Safety

No data races: CAS operations provide synchronization
No double-free: Buffers tracked through intrusive linked list
Bounded memory: Size limits prevent unbounded growth
Leak prevention: Pool cleanup in destructor

§ABA Prevention

Uses pointer-based nodes rather than array indices:

// Safe: Pointer values are unique
struct BufferNode<T> {
    buffer: T,
    next: AtomicPtr<BufferNode<T>>,
}

§Integration Patterns

§RAII Buffer Guards

Automatic buffer return prevents leaks:

let buffer_guard = pool.get_serialization_buffer_guard();
// Use buffer_guard.data_mut()
// Automatically returned on scope exit

§Thread-Local Caching

Can be combined with thread-local storage for ultimate performance:

Each thread maintains a small local cache
Fall back to global lock-free pool when cache misses
Reduces atomic operations to near-zero in steady state

§Monitoring & Tuning

Built-in statistics for performance analysis:

Hit/miss ratios: Pool efficiency metrics
Allocation counts: Dynamic allocation frequency
Contention metrics: CAS retry counts (future enhancement)
Memory usage: Current and peak buffer counts

Structs§

BufferCounts: Available buffer counts
BufferGuard: RAII guard for automatic buffer return
BufferPoolStats: Buffer pool statistics
LockFreeBufferPool: Lock-free buffer pool for high-performance data processing
LockFreeBufferPoolConfig: Configuration for the lock-free buffer pool

Module lockfree_buffer_poolCopy item path