Module vectorized

Expand description

Vectorization-friendly utilities for HFT calculations

This module provides high-performance implementations of common HFT calculations specifically designed to maximize compiler auto-vectorization and CPU efficiency.

§HFT Performance Rationale

§Auto-Vectorization Benefits

Modern CPUs can execute multiple operations simultaneously through SIMD instructions:

SSE/AVX instructions: Process 2-8 floating-point operations per cycle
Pipeline utilization: Keeps CPU execution units busy
Memory bandwidth: Vectorized loads/stores maximize memory throughput
Branch reduction: Eliminates costly conditional jumps in inner loops

§Critical Path Calculations

HFT systems require these calculations in sub-microsecond timeframes:

VWAP calculations: Volume-weighted average price across order book levels
Volatility estimates: Realized volatility from tick data
Order flow analysis: Buy/sell pressure and market imbalance
Moving averages: Exponential and simple moving averages for signals

§Memory Alignment Strategy

§Cache Line Optimization

32-byte alignment (AVX):  ┌──── 4 x f64 ────┐
64-byte alignment (Cache): ┌─────── 8 x f64 ───────┐

§AlignedBuffer Types

AlignedBuffer32: Optimized for AVX/AVX2 SIMD operations
AlignedBuffer64: Cache-line aligned to prevent false sharing
Generic implementation: Works with any Copy + Default type

§Compiler Vectorization Patterns

§Loop Structure Optimization

Designed to trigger auto-vectorization:

// Vectorization-friendly pattern
for i in 0..values.len() {
    result[i] = values[i].mul_add(weights[i], accumulator);
}

§Dependency Chain Breaking

Uses multiple accumulators to improve instruction-level parallelism:

let mut sum1 = 0.0;
let mut sum2 = 0.0;
let mut sum3 = 0.0;
let mut sum4 = 0.0;

// Process in chunks to break dependencies
for chunk in values.chunks_exact(4) {
    sum1 = chunk[0].mul_add(weights[0], sum1);
    sum2 = chunk[1].mul_add(weights[1], sum2);
    sum3 = chunk[2].mul_add(weights[2], sum3);
    sum4 = chunk[3].mul_add(weights[3], sum4);
}

§Financial Calculation Implementations

§Volume-Weighted Average Price (VWAP)

Multiple accumulators: Breaks dependency chains for better ILP
FMA operations: Uses mul_add for better precision and performance
Chunk processing: 4-element chunks optimize for SIMD width
Remainder handling: Processes leftover elements efficiently

§Exponential Moving Average (EMA)

Vectorization-friendly: Linear dependency structure
Precision optimization: Uses mul_add for numerical stability
Cache efficiency: Sequential memory access pattern

§Order Flow Imbalance (OFI)

Parallel processing: Independent calculations per element
Sign operations: Vectorizable using signum() function
Memory layout: Structure-of-arrays for optimal access

§Realized Volatility

Squared returns: Vectorizable multiplication operations
Multiple accumulators: Reduces dependency chain length
Chunk processing: Maximizes SIMD utilization

§Performance Characteristics

§Latency Improvements

2-4x speedup: From auto-vectorization on modern CPUs
Cache efficiency: Aligned access patterns reduce cache misses
Memory bandwidth: Vectorized loads maximize throughput
Branch elimination: Reduces pipeline stalls

§Throughput Optimization

High concurrency: Functions are thread-safe and reentrant
Memory efficiency: Minimal allocation through in-place operations
CPU utilization: Keeps multiple execution units busy

§Integration Patterns

§Real-Time Trading

// Market making: Calculate fair value
let fair_value = calculate_weighted_mid_price(
    &bid_prices, &bid_sizes,
    &ask_prices, &ask_sizes
);

// Risk management: Estimate volatility
let vol = calculate_realized_volatility(&recent_returns);

§Strategy Development

// Alpha generation: Order flow analysis
let ofi = calculate_ofi(
    &bid_volumes, &ask_volumes,
    &bid_changes, &ask_changes
);

// Signal processing: Moving averages
let ema = calculate_ema_vectorized(&prices, alpha);

§Compiler Optimization Requirements

§Build Flags

For optimal performance, use:

[profile.release]
opt-level = 3
lto = true
codegen-units = 1
target-cpu = "native"  # Enables AVX2/FMA if available

§Target Features

AVX2: 256-bit SIMD operations
FMA: Fused multiply-add instructions
BMI2: Bit manipulation for efficient indexing

Structs§

AlignedBuffer32: 32-byte aligned buffer optimized for AVX/AVX2 SIMD operations
AlignedBuffer64: 64-byte aligned buffer for cache line optimization and false sharing prevention

Functions§

calculate_ema_vectorized: Calculate exponential moving average with vectorization
calculate_log_returns: Vectorized log returns calculation
calculate_ofi: Calculate order flow imbalance (OFI) with vectorization
calculate_order_book_imbalance: Calculate order book imbalance with vectorization
calculate_realized_volatility: Calculate realized volatility with vectorization
calculate_returns: Vectorized returns calculation
calculate_vwap: Calculate VWAP in a vectorization-friendly way
calculate_weighted_mid_price: Calculate weighted mid price with vectorization
moving_sum: Fast moving sum calculation

Type Aliases§

LegacyAlignedBuffer: Legacy type alias for backward compatibility with existing f64 buffers

Module vectorizedCopy item path