Module vectorized

Source
Expand description

Vectorization-friendly utilities for HFT calculations

This module provides high-performance implementations of common HFT calculations specifically designed to maximize compiler auto-vectorization and CPU efficiency.

§HFT Performance Rationale

§Auto-Vectorization Benefits

Modern CPUs can execute multiple operations simultaneously through SIMD instructions:

  • SSE/AVX instructions: Process 2-8 floating-point operations per cycle
  • Pipeline utilization: Keeps CPU execution units busy
  • Memory bandwidth: Vectorized loads/stores maximize memory throughput
  • Branch reduction: Eliminates costly conditional jumps in inner loops

§Critical Path Calculations

HFT systems require these calculations in sub-microsecond timeframes:

  • VWAP calculations: Volume-weighted average price across order book levels
  • Volatility estimates: Realized volatility from tick data
  • Order flow analysis: Buy/sell pressure and market imbalance
  • Moving averages: Exponential and simple moving averages for signals

§Memory Alignment Strategy

§Cache Line Optimization

32-byte alignment (AVX):  ┌──── 4 x f64 ────┐
64-byte alignment (Cache): ┌─────── 8 x f64 ───────┐

§AlignedBuffer Types

  • AlignedBuffer32: Optimized for AVX/AVX2 SIMD operations
  • AlignedBuffer64: Cache-line aligned to prevent false sharing
  • Generic implementation: Works with any Copy + Default type

§Compiler Vectorization Patterns

§Loop Structure Optimization

Designed to trigger auto-vectorization:

// Vectorization-friendly pattern
for i in 0..values.len() {
    result[i] = values[i].mul_add(weights[i], accumulator);
}

§Dependency Chain Breaking

Uses multiple accumulators to improve instruction-level parallelism:

let mut sum1 = 0.0;
let mut sum2 = 0.0;
let mut sum3 = 0.0;
let mut sum4 = 0.0;

// Process in chunks to break dependencies
for chunk in values.chunks_exact(4) {
    sum1 = chunk[0].mul_add(weights[0], sum1);
    sum2 = chunk[1].mul_add(weights[1], sum2);
    sum3 = chunk[2].mul_add(weights[2], sum3);
    sum4 = chunk[3].mul_add(weights[3], sum4);
}

§Financial Calculation Implementations

§Volume-Weighted Average Price (VWAP)

  • Multiple accumulators: Breaks dependency chains for better ILP
  • FMA operations: Uses mul_add for better precision and performance
  • Chunk processing: 4-element chunks optimize for SIMD width
  • Remainder handling: Processes leftover elements efficiently

§Exponential Moving Average (EMA)

  • Vectorization-friendly: Linear dependency structure
  • Precision optimization: Uses mul_add for numerical stability
  • Cache efficiency: Sequential memory access pattern

§Order Flow Imbalance (OFI)

  • Parallel processing: Independent calculations per element
  • Sign operations: Vectorizable using signum() function
  • Memory layout: Structure-of-arrays for optimal access

§Realized Volatility

  • Squared returns: Vectorizable multiplication operations
  • Multiple accumulators: Reduces dependency chain length
  • Chunk processing: Maximizes SIMD utilization

§Performance Characteristics

§Latency Improvements

  • 2-4x speedup: From auto-vectorization on modern CPUs
  • Cache efficiency: Aligned access patterns reduce cache misses
  • Memory bandwidth: Vectorized loads maximize throughput
  • Branch elimination: Reduces pipeline stalls

§Throughput Optimization

  • High concurrency: Functions are thread-safe and reentrant
  • Memory efficiency: Minimal allocation through in-place operations
  • CPU utilization: Keeps multiple execution units busy

§Integration Patterns

§Real-Time Trading

// Market making: Calculate fair value
let fair_value = calculate_weighted_mid_price(
    &bid_prices, &bid_sizes,
    &ask_prices, &ask_sizes
);

// Risk management: Estimate volatility
let vol = calculate_realized_volatility(&recent_returns);

§Strategy Development

// Alpha generation: Order flow analysis
let ofi = calculate_ofi(
    &bid_volumes, &ask_volumes,
    &bid_changes, &ask_changes
);

// Signal processing: Moving averages
let ema = calculate_ema_vectorized(&prices, alpha);

§Compiler Optimization Requirements

§Build Flags

For optimal performance, use:

[profile.release]
opt-level = 3
lto = true
codegen-units = 1
target-cpu = "native"  # Enables AVX2/FMA if available

§Target Features

  • AVX2: 256-bit SIMD operations
  • FMA: Fused multiply-add instructions
  • BMI2: Bit manipulation for efficient indexing

Structs§

AlignedBuffer32
32-byte aligned buffer optimized for AVX/AVX2 SIMD operations
AlignedBuffer64
64-byte aligned buffer for cache line optimization and false sharing prevention

Functions§

calculate_ema_vectorized
Calculate exponential moving average with vectorization
calculate_log_returns
Vectorized log returns calculation
calculate_ofi
Calculate order flow imbalance (OFI) with vectorization
calculate_order_book_imbalance
Calculate order book imbalance with vectorization
calculate_realized_volatility
Calculate realized volatility with vectorization
calculate_returns
Vectorized returns calculation
calculate_vwap
Calculate VWAP in a vectorization-friendly way
calculate_weighted_mid_price
Calculate weighted mid price with vectorization
moving_sum
Fast moving sum calculation

Type Aliases§

LegacyAlignedBuffer
Legacy type alias for backward compatibility with existing f64 buffers