Expand description
Lock-free buffer pool for zero-allocation high-frequency data processing
This module implements a lock-free buffer pool using the Treiber stack algorithm, optimized for minimal latency in HFT environments where any synchronization overhead can destroy microsecond-level performance requirements.
§HFT Performance Rationale
§Lock Contention Elimination
Traditional mutex-based pools create critical bottlenecks in HFT systems:
- Lock acquisition: 10-50ns overhead per operation
- Thread blocking: Unpredictable delays when multiple threads compete
- Priority inversion: Low-priority threads can block high-priority trading logic
- Cache coherency: Mutex state changes invalidate CPU caches across cores
§Memory Allocation Costs
Dynamic allocation is prohibitive in latency-sensitive paths:
- malloc/free overhead: 50-200ns per allocation
- Fragmentation: Degrades cache performance over time
- TLB pressure: Memory mapping overhead affects page table lookups
- NUMA effects: Cross-node allocation can add 100+ ns latency
§Lock-Free Architecture
§Treiber Stack Algorithm
Uses atomic compare-and-swap (CAS) operations for thread-safe access:
- ABA problem protection: Pointer-based nodes prevent recycling issues
- Memory ordering: Acquire/Release semantics ensure proper synchronization
- Cache-line alignment: 64-byte alignment prevents false sharing
- Wait-free progress: No thread can block others indefinitely
§Memory Layout Optimization
Cache Line 0: [ Stack Head Pointer (8B) | Padding (56B) ]
Cache Line 1: [ Node->Buffer Ptr (8B) | Node->Next (8B) | Padding (48B) ]
Cache Line 2: [ Actual Buffer Data ... ]§Buffer Type Specialization
Separate lock-free stacks for different buffer types:
- Serialization buffers: JSON/Binary encoding (128KB default)
- Compression buffers: Data compression operations (64KB default)
- SIMD buffers: Aligned vectorized calculations (
VecSimd<f64x4>)
§Performance Characteristics
§Latency Metrics
- Buffer acquisition: <10ns typical, <50ns worst-case
- Buffer return: <5ns (simple atomic store)
- Cache miss penalty: ~100ns when buffer not in L3 cache
- Allocation fallback: 200-500ns when pool exhausted
§Throughput Optimization
- High concurrency: Scales linearly with CPU cores
- Memory bandwidth: Minimizes allocation traffic
- CPU cache efficiency: Hot buffers stay in L1/L2 cache
§Safety & Correctness
§Memory Safety
- No data races: CAS operations provide synchronization
- No double-free: Buffers tracked through intrusive linked list
- Bounded memory: Size limits prevent unbounded growth
- Leak prevention: Pool cleanup in destructor
§ABA Prevention
Uses pointer-based nodes rather than array indices:
// Safe: Pointer values are unique
struct BufferNode<T> {
buffer: T,
next: AtomicPtr<BufferNode<T>>,
}§Integration Patterns
§RAII Buffer Guards
Automatic buffer return prevents leaks:
let buffer_guard = pool.get_serialization_buffer_guard();
// Use buffer_guard.data_mut()
// Automatically returned on scope exit§Thread-Local Caching
Can be combined with thread-local storage for ultimate performance:
- Each thread maintains a small local cache
- Fall back to global lock-free pool when cache misses
- Reduces atomic operations to near-zero in steady state
§Monitoring & Tuning
Built-in statistics for performance analysis:
- Hit/miss ratios: Pool efficiency metrics
- Allocation counts: Dynamic allocation frequency
- Contention metrics: CAS retry counts (future enhancement)
- Memory usage: Current and peak buffer counts
Structs§
- Buffer
Counts - Available buffer counts
- Buffer
Guard - RAII guard for automatic buffer return
- Buffer
Pool Stats - Buffer pool statistics
- Lock
Free Buffer Pool - Lock-free buffer pool for high-performance data processing
- Lock
Free Buffer Pool Config - Configuration for the lock-free buffer pool