Expand description
Vectorization-friendly utilities for HFT calculations
This module provides high-performance implementations of common HFT calculations specifically designed to maximize compiler auto-vectorization and CPU efficiency.
§HFT Performance Rationale
§Auto-Vectorization Benefits
Modern CPUs can execute multiple operations simultaneously through SIMD instructions:
- SSE/AVX instructions: Process 2-8 floating-point operations per cycle
- Pipeline utilization: Keeps CPU execution units busy
- Memory bandwidth: Vectorized loads/stores maximize memory throughput
- Branch reduction: Eliminates costly conditional jumps in inner loops
§Critical Path Calculations
HFT systems require these calculations in sub-microsecond timeframes:
- VWAP calculations: Volume-weighted average price across order book levels
- Volatility estimates: Realized volatility from tick data
- Order flow analysis: Buy/sell pressure and market imbalance
- Moving averages: Exponential and simple moving averages for signals
§Memory Alignment Strategy
§Cache Line Optimization
32-byte alignment (AVX): ┌──── 4 x f64 ────┐
64-byte alignment (Cache): ┌─────── 8 x f64 ───────┐§AlignedBuffer Types
- AlignedBuffer32: Optimized for AVX/AVX2 SIMD operations
- AlignedBuffer64: Cache-line aligned to prevent false sharing
- Generic implementation: Works with any
Copy + Defaulttype
§Compiler Vectorization Patterns
§Loop Structure Optimization
Designed to trigger auto-vectorization:
// Vectorization-friendly pattern
for i in 0..values.len() {
result[i] = values[i].mul_add(weights[i], accumulator);
}§Dependency Chain Breaking
Uses multiple accumulators to improve instruction-level parallelism:
let mut sum1 = 0.0;
let mut sum2 = 0.0;
let mut sum3 = 0.0;
let mut sum4 = 0.0;
// Process in chunks to break dependencies
for chunk in values.chunks_exact(4) {
sum1 = chunk[0].mul_add(weights[0], sum1);
sum2 = chunk[1].mul_add(weights[1], sum2);
sum3 = chunk[2].mul_add(weights[2], sum3);
sum4 = chunk[3].mul_add(weights[3], sum4);
}§Financial Calculation Implementations
§Volume-Weighted Average Price (VWAP)
- Multiple accumulators: Breaks dependency chains for better ILP
- FMA operations: Uses
mul_addfor better precision and performance - Chunk processing: 4-element chunks optimize for SIMD width
- Remainder handling: Processes leftover elements efficiently
§Exponential Moving Average (EMA)
- Vectorization-friendly: Linear dependency structure
- Precision optimization: Uses
mul_addfor numerical stability - Cache efficiency: Sequential memory access pattern
§Order Flow Imbalance (OFI)
- Parallel processing: Independent calculations per element
- Sign operations: Vectorizable using
signum()function - Memory layout: Structure-of-arrays for optimal access
§Realized Volatility
- Squared returns: Vectorizable multiplication operations
- Multiple accumulators: Reduces dependency chain length
- Chunk processing: Maximizes SIMD utilization
§Performance Characteristics
§Latency Improvements
- 2-4x speedup: From auto-vectorization on modern CPUs
- Cache efficiency: Aligned access patterns reduce cache misses
- Memory bandwidth: Vectorized loads maximize throughput
- Branch elimination: Reduces pipeline stalls
§Throughput Optimization
- High concurrency: Functions are thread-safe and reentrant
- Memory efficiency: Minimal allocation through in-place operations
- CPU utilization: Keeps multiple execution units busy
§Integration Patterns
§Real-Time Trading
// Market making: Calculate fair value
let fair_value = calculate_weighted_mid_price(
&bid_prices, &bid_sizes,
&ask_prices, &ask_sizes
);
// Risk management: Estimate volatility
let vol = calculate_realized_volatility(&recent_returns);§Strategy Development
// Alpha generation: Order flow analysis
let ofi = calculate_ofi(
&bid_volumes, &ask_volumes,
&bid_changes, &ask_changes
);
// Signal processing: Moving averages
let ema = calculate_ema_vectorized(&prices, alpha);§Compiler Optimization Requirements
§Build Flags
For optimal performance, use:
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
target-cpu = "native" # Enables AVX2/FMA if available§Target Features
- AVX2: 256-bit SIMD operations
- FMA: Fused multiply-add instructions
- BMI2: Bit manipulation for efficient indexing
Structs§
- Aligned
Buffer32 - 32-byte aligned buffer optimized for AVX/AVX2 SIMD operations
- Aligned
Buffer64 - 64-byte aligned buffer for cache line optimization and false sharing prevention
Functions§
- calculate_
ema_ vectorized - Calculate exponential moving average with vectorization
- calculate_
log_ returns - Vectorized log returns calculation
- calculate_
ofi - Calculate order flow imbalance (OFI) with vectorization
- calculate_
order_ book_ imbalance - Calculate order book imbalance with vectorization
- calculate_
realized_ volatility - Calculate realized volatility with vectorization
- calculate_
returns - Vectorized returns calculation
- calculate_
vwap - Calculate VWAP in a vectorization-friendly way
- calculate_
weighted_ mid_ price - Calculate weighted mid price with vectorization
- moving_
sum - Fast moving sum calculation
Type Aliases§
- Legacy
Aligned Buffer - Legacy type alias for backward compatibility with existing f64 buffers