Post

Single Instruction Multiple Data technique - SIMD

Single Instruction Multiple Data technique - SIMD

SIMD on ARM Cortex-M: Accelerating Embedded Applications

Modern embedded applications demand ever-increasing computational performance while maintaining strict power and cost constraints. From real-time audio processing to machine learning inference at the edge, developers need techniques that can deliver maximum performance from limited hardware resources.

Single Instruction Multiple Data (SIMD) represents one of the most effective ways to achieve this goal. By processing multiple data elements simultaneously with a single instruction, SIMD can deliver 2-4x performance improvements for many computational tasks common in embedded systems.

Introduction to SIMD Computing

The Performance Challenge

Consider a typical embedded application processing sensor data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Traditional approach: Process 1000 sensor readings
float sensor_data[1000];
float filtered_data[1000];

void process_sensors_traditional() {
    for(int i = 0; i < 1000; i++) {
        // Read sensor value
        sensor_data[i] = read_adc_channel(i);
        
        // Apply calibration and filtering
        filtered_data[i] = sensor_data[i] * calibration_factor + offset;
        
        // Apply low-pass filter
        filtered_data[i] = (filtered_data[i] + previous_value) * 0.5f;
    }
}

This traditional scalar approach processes one value at a time, requiring 1000 separate operations. Each operation consumes CPU cycles, memory bandwidth, and energy.

The SIMD Solution

SIMD (Single Instruction Multiple Data) transforms this by processing multiple values simultaneously:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// SIMD approach: Process 4 values at once
void process_sensors_simd() {
    for(int i = 0; i < 1000; i += 4) {
        // Process 4 sensor readings simultaneously
        float32x4_t sensors = vld1q_f32(&sensor_data[i]);
        float32x4_t calibrated = vmulq_n_f32(sensors, calibration_factor);
        calibrated = vaddq_n_f32(calibrated, offset);
        
        // Apply filtering to all 4 values at once
        float32x4_t previous = vld1q_f32(&previous_values[i]);
        float32x4_t filtered = vmulq_n_f32(vaddq_f32(calibrated, previous), 0.5f);
        
        vst1q_f32(&filtered_data[i], filtered);
    }
}

Result: 4x fewer instructions, significantly improved performance, and reduced power consumption.

How SIMD Works:

CPU Architecture Fundamentals

Traditional scalar processors execute one operation per instruction:

1
2
3
4
5
6
7
Scalar Processing (Traditional):
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Data A    │───▶│     ALU     │───▶│   Result    │
└─────────────┘    └─────────────┘    └─────────────┘
    1 cycle            1 cycle            1 cycle

Total: 3 cycles for 1 operation

SIMD processors contain wider execution units that can handle multiple data elements:

1
2
3
4
5
6
7
8
9
10
SIMD Processing (Parallel):
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Data A1   │───▶│             │───▶│  Result 1   │
│   Data A2   │───▶│   SIMD ALU  │───▶│  Result 2   │
│   Data A3   │───▶│             │───▶│  Result 3   │
│   Data A4   │───▶│             │───▶│  Result 4   │
└─────────────┘    └─────────────┘    └─────────────┘
    1 cycle            1 cycle            1 cycle

Total: 3 cycles for 4 operations

SIMD Architecture Types

Modern processors implement SIMD through various architectural approaches, each optimized for different use cases and performance requirements.

ARM NEON (Advanced SIMD)

Traditional ARM NEON found in Cortex-A series and some Cortex-M processors:

Key Features:

  • 128-bit vector width
  • Floating-point and integer operations
  • Limited predication support
  • Available on Cortex-A and select Cortex-M

ARM Helium (M-Profile Vector Extension - MVE)

Next-generation SIMD for Cortex-M series, designed specifically for AI/ML workloads:

Key Features:

  • 128-bit vectors optimized for Cortex-M
  • Advanced predication for loop tail handling
  • Built-in support for neural network operations
  • Low-overhead context switching
  • Optimized for power efficiency

ARM Scalable Vector Extension (SVE)

Variable-length SIMD for high-performance computing:

Key Features:

  • Variable vector length (128-2048 bits)
  • Vector-length agnostic programming
  • Advanced predication and gather/scatter
  • Primarily for Cortex-A and Neoverse cores

ARM Scalable Matrix Extension (SME)

Specialized for matrix operations and AI acceleration:

Key Features:

  • Matrix-oriented instructions
  • Streaming SVE mode for memory efficiency
  • AI/ML acceleration focus
  • Integration with SVE

Register Organization

SIMD operations use wider registers that can hold multiple values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Scalar register (32-bit)
┌─────────────────────────────────┐
         Single 32-bit float     
└─────────────────────────────────┘

// SIMD register (128-bit) - ARM Helium
┌───────────┬───────────┬───────────┬───────────┐
  float 1    float 2    float 3    float 4  
  32 bits    32 bits    32 bits    32 bits  
└───────────┴───────────┴───────────┴───────────┘

// Alternative packing for different data types
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
 i16  i16  i16  i16  i16  i16  i16  i16   // 8x 16-bit
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
i8 i8 i8 i8 i8 i8 i8 i8 i8 i8 i8 i8 i8 i8 i8 i8   // 16x 8-bit
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Execution Pipeline

SIMD instructions flow through the processor pipeline just like scalar instructions, but operate on multiple data lanes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Pipeline Stage 1: Fetch Instruction
┌─────────────────────────────────────────┐
│  SIMD_ADD vector_a, vector_b, vector_c  │
└─────────────────────────────────────────┘

Pipeline Stage 2: Decode & Register Read
┌───────────┬───────────┬───────────┬───────────┐
│ Lane 0    │ Lane 1    │ Lane 2    │ Lane 3    │
│ A[0] B[0] │ A[1] B[1] │ A[2] B[2] │ A[3] B[3] │
└───────────┴───────────┴───────────┴───────────┘

Pipeline Stage 3: Execute (Parallel ALUs)
┌───────────┬───────────┬───────────┬───────────┐
│ ALU Lane0 │ ALU Lane1 │ ALU Lane2 │ ALU Lane3 │
│ A[0]+B[0] │ A[1]+B[1] │ A[2]+B[2] │ A[3]+B[3] │
└───────────┴───────────┴───────────┴───────────┘

Pipeline Stage 4: Write Back
┌───────────┬───────────┬───────────┬───────────┐
│ Result[0] │ Result[1] │ Result[2] │ Result[3] │
└───────────┴───────────┴───────────┴───────────┘

ARM Cortex-M SIMD Evolution

Cortex-M0/M0+: No SIMD Support

  • Basic 32-bit ARM architecture
  • No parallel data processing capabilities
  • Suitable for simple control applications

Cortex-M3: Limited SIMD Instructions

  • Basic packed arithmetic operations
  • 16-bit and 8-bit data packing
  • Simple parallel operations

Cortex-M4/M7: DSP Extensions

ARM Cortex-M4 and M7 include DSP extensions with SIMD capabilities:

1
2
3
4
5
6
7
8
9
// Packed 16-bit addition (2 operations in 1 instruction)
uint32_t packed_add_16(uint32_t a, uint32_t b) {
    return __SADD16(a, b);  // Add two 16-bit values in parallel
}

// Packed 8-bit addition (4 operations in 1 instruction)  
uint32_t packed_add_8(uint32_t a, uint32_t b) {
    return __SADD8(a, b);   // Add four 8-bit values in parallel
}

Cortex-M55: ARM Helium (M-Profile Vector Extension)

The most advanced SIMD implementation for microcontrollers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "arm_mve.h"

// Process 4 float values simultaneously
void vector_add_float(float *a, float *b, float *result, int count) {
    int i;
    for(i = 0; i <= count - 4; i += 4) {
        float32x4_t vec_a = vld1q_f32(&a[i]);
        float32x4_t vec_b = vld1q_f32(&b[i]);
        float32x4_t vec_result = vaddq_f32(vec_a, vec_b);
        vst1q_f32(&result[i], vec_result);
    }
    
    // Handle remaining elements
    for(; i < count; i++) {
        result[i] = a[i] + b[i];
    }
}

Basic SIMD Operations on Cortex-M4/M7

Packed Arithmetic Instructions

16-bit Operations:

1
2
3
4
5
6
7
8
9
10
11
12
#include "cmsis_gcc.h"

// Parallel 16-bit addition
uint32_t a = 0x00020001;  // Two 16-bit values: 2, 1
uint32_t b = 0x00040003;  // Two 16-bit values: 4, 3
uint32_t result = __SADD16(a, b);  // Result: 0x00060004 (6, 4)

// Parallel 16-bit subtraction
uint32_t diff = __SSUB16(a, b);

// Parallel 16-bit multiplication
uint32_t product = __SMUL16(a, b);

8-bit Operations:

1
2
3
4
5
6
7
// Parallel 8-bit addition (4 values at once)
uint32_t a = 0x04030201;  // Four 8-bit values: 4, 3, 2, 1
uint32_t b = 0x08070605;  // Four 8-bit values: 8, 7, 6, 5
uint32_t result = __SADD8(a, b);  // Result: 0x0C0A0806 (12, 10, 8, 6)

// Parallel 8-bit saturation addition
uint32_t sat_result = __QADD8(a, b);  // Saturated addition

ARM Helium (M-Profile Vector Extension)

ARM Helium provides 128-bit vector processing for Cortex-M55, enabling advanced SIMD operations.

Vector Data Types

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include "arm_mve.h"

// Integer vector types
int8x16_t   vec_i8;    // 16 x 8-bit integers
int16x8_t   vec_i16;   // 8 x 16-bit integers  
int32x4_t   vec_i32;   // 4 x 32-bit integers

// Floating-point vector types
float32x4_t vec_f32;   // 4 x 32-bit floats
float16x8_t vec_f16;   // 8 x 16-bit floats (half precision)

// Loading and storing vectors
int32x4_t data = vld1q_s32(input_array);    // Load 4 integers
vst1q_s32(output_array, result);            // Store 4 integers

Advanced Vector Operations

Arithmetic Operations:

1
2
3
4
5
6
7
// Vector addition, subtraction, multiplication
int32x4_t vec_sum = vaddq_s32(vec_a, vec_b);
int32x4_t vec_diff = vsubq_s32(vec_a, vec_b);
int32x4_t vec_prod = vmulq_s32(vec_a, vec_b);

// Fused multiply-add
float32x4_t vec_fma = vfmaq_f32(vec_c, vec_a, vec_b);  // c + (a * b)

Reduction Operations:

1
2
3
4
5
6
7
8
// Find maximum value in vector
int32_t max_val = vmaxvq_s32(vec_data);

// Sum all elements in vector  
int32_t sum = vaddvq_s32(vec_data);

// Horizontal operations
int32x4_t pairwise_sum = vpaddq_s32(vec_a, vec_b);

Conditional Operations:

1
2
3
// Conditional execution with predicates
mve_predicate_t pred = vcmpgtq_s32(vec_a, vec_threshold);
int32x4_t result = vaddq_m_s32(vec_base, vec_a, vec_b, pred);

As embedded applications become more computationally demanding, SIMD techniques will become increasingly important for achieving real-time performance within power and thermal constraints. The evolution from basic packed arithmetic to sophisticated vector processing units demonstrates ARM’s commitment to bringing high-performance computing capabilities to the embedded world.

This post is licensed under CC BY 4.0 by the author.