Single Instruction Multiple Data technique - SIMD

Posted Jul 24, 2025

By Khoi Nguyen Van

8 min read

SIMD on ARM Cortex-M: Accelerating Embedded Applications

Modern embedded applications demand ever-increasing computational performance while maintaining strict power and cost constraints. From real-time audio processing to machine learning inference at the edge, developers need techniques that can deliver maximum performance from limited hardware resources.

Single Instruction Multiple Data (SIMD) represents one of the most effective ways to achieve this goal. By processing multiple data elements simultaneously with a single instruction, SIMD can deliver 2-4x performance improvements for many computational tasks common in embedded systems.

Introduction to SIMD Computing

The Performance Challenge

Consider a typical embedded application processing sensor data:

  
// Traditional approach: Process 1000 sensor readings
float sensor_data[1000];
float filtered_data[1000];

void process_sensors_traditional() {
    for(int i = 0; i < 1000; i++) {
        // Read sensor value
        sensor_data[i] = read_adc_channel(i);
        
        // Apply calibration and filtering
        filtered_data[i] = sensor_data[i] * calibration_factor + offset;
        
        // Apply low-pass filter
        filtered_data[i] = (filtered_data[i] + previous_value) * 0.5f;
    }
}

This traditional scalar approach processes one value at a time, requiring 1000 separate operations. Each operation consumes CPU cycles, memory bandwidth, and energy.

The SIMD Solution

SIMD (Single Instruction Multiple Data) transforms this by processing multiple values simultaneously:

  
// SIMD approach: Process 4 values at once
void process_sensors_simd() {
    for(int i = 0; i < 1000; i += 4) {
        // Process 4 sensor readings simultaneously
        float32x4_t sensors = vld1q_f32(&sensor_data[i]);
        float32x4_t calibrated = vmulq_n_f32(sensors, calibration_factor);
        calibrated = vaddq_n_f32(calibrated, offset);
        
        // Apply filtering to all 4 values at once
        float32x4_t previous = vld1q_f32(&previous_values[i]);
        float32x4_t filtered = vmulq_n_f32(vaddq_f32(calibrated, previous), 0.5f);
        
        vst1q_f32(&filtered_data[i], filtered);
    }
}

Result: 4x fewer instructions, significantly improved performance, and reduced power consumption.

How SIMD Works:

CPU Architecture Fundamentals

Traditional scalar processors execute one operation per instruction:

Scalar Processing (Traditional):
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Data A    │───▶│     ALU     │───▶│   Result    │
└─────────────┘    └─────────────┘    └─────────────┘
    1 cycle            1 cycle            1 cycle

Total: 3 cycles for 1 operation

SIMD processors contain wider execution units that can handle multiple data elements:

SIMD Processing (Parallel):
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Data A1   │───▶│             │───▶│  Result 1   │
│   Data A2   │───▶│   SIMD ALU  │───▶│  Result 2   │
│   Data A3   │───▶│             │───▶│  Result 3   │
│   Data A4   │───▶│             │───▶│  Result 4   │
└─────────────┘    └─────────────┘    └─────────────┘
    1 cycle            1 cycle            1 cycle

Total: 3 cycles for 4 operations

SIMD Architecture Types

Modern processors implement SIMD through various architectural approaches, each optimized for different use cases and performance requirements.

ARM NEON (Advanced SIMD)

Traditional ARM NEON found in Cortex-A series and some Cortex-M processors:

Key Features:

128-bit vector width
Floating-point and integer operations
Limited predication support
Available on Cortex-A and select Cortex-M

ARM Helium (M-Profile Vector Extension - MVE)

Next-generation SIMD for Cortex-M series, designed specifically for AI/ML workloads:

Key Features:

128-bit vectors optimized for Cortex-M
Advanced predication for loop tail handling
Built-in support for neural network operations
Low-overhead context switching
Optimized for power efficiency

ARM Scalable Vector Extension (SVE)

Variable-length SIMD for high-performance computing:

Key Features:

Variable vector length (128-2048 bits)
Vector-length agnostic programming
Advanced predication and gather/scatter
Primarily for Cortex-A and Neoverse cores

ARM Scalable Matrix Extension (SME)

Specialized for matrix operations and AI acceleration:

Key Features:

Matrix-oriented instructions
Streaming SVE mode for memory efficiency
AI/ML acceleration focus
Integration with SVE

Register Organization

SIMD operations use wider registers that can hold multiple values:

  
// Scalar register (32-bit)
┌─────────────────────────────────┐
│         Single 32-bit float     │
└─────────────────────────────────┘

// SIMD register (128-bit) - ARM Helium
┌───────────┬───────────┬───────────┬───────────┐
│  float 1  │  float 2  │  float 3  │  float 4  │
│  32 bits  │  32 bits  │  32 bits  │  32 bits  │
└───────────┴───────────┴───────────┴───────────┘

// Alternative packing for different data types
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │  // 8x 16-bit
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │  // 16x 8-bit
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Execution Pipeline

SIMD instructions flow through the processor pipeline just like scalar instructions, but operate on multiple data lanes:

Pipeline Stage 1: Fetch Instruction
┌─────────────────────────────────────────┐
│  SIMD_ADD vector_a, vector_b, vector_c  │
└─────────────────────────────────────────┘

Pipeline Stage 2: Decode & Register Read
┌───────────┬───────────┬───────────┬───────────┐
│ Lane 0    │ Lane 1    │ Lane 2    │ Lane 3    │
│ A[0] B[0] │ A[1] B[1] │ A[2] B[2] │ A[3] B[3] │
└───────────┴───────────┴───────────┴───────────┘

Pipeline Stage 3: Execute (Parallel ALUs)
┌───────────┬───────────┬───────────┬───────────┐
│ ALU Lane0 │ ALU Lane1 │ ALU Lane2 │ ALU Lane3 │
│ A[0]+B[0] │ A[1]+B[1] │ A[2]+B[2] │ A[3]+B[3] │
└───────────┴───────────┴───────────┴───────────┘

Pipeline Stage 4: Write Back
┌───────────┬───────────┬───────────┬───────────┐
│ Result[0] │ Result[1] │ Result[2] │ Result[3] │
└───────────┴───────────┴───────────┴───────────┘

ARM Cortex-M SIMD Evolution

Cortex-M0/M0+: No SIMD Support

Basic 32-bit ARM architecture
No parallel data processing capabilities
Suitable for simple control applications

Cortex-M3: Limited SIMD Instructions

Basic packed arithmetic operations
16-bit and 8-bit data packing
Simple parallel operations

Cortex-M4/M7: DSP Extensions

ARM Cortex-M4 and M7 include DSP extensions with SIMD capabilities:

  
// Packed 16-bit addition (2 operations in 1 instruction)
uint32_t packed_add_16(uint32_t a, uint32_t b) {
    return __SADD16(a, b);  // Add two 16-bit values in parallel
}

// Packed 8-bit addition (4 operations in 1 instruction)  
uint32_t packed_add_8(uint32_t a, uint32_t b) {
    return __SADD8(a, b);   // Add four 8-bit values in parallel
}

Cortex-M55: ARM Helium (M-Profile Vector Extension)

The most advanced SIMD implementation for microcontrollers:

  
#include "arm_mve.h"

// Process 4 float values simultaneously
void vector_add_float(float *a, float *b, float *result, int count) {
    int i;
    for(i = 0; i <= count - 4; i += 4) {
        float32x4_t vec_a = vld1q_f32(&a[i]);
        float32x4_t vec_b = vld1q_f32(&b[i]);
        float32x4_t vec_result = vaddq_f32(vec_a, vec_b);
        vst1q_f32(&result[i], vec_result);
    }
    
    // Handle remaining elements
    for(; i < count; i++) {
        result[i] = a[i] + b[i];
    }
}

Basic SIMD Operations on Cortex-M4/M7

Packed Arithmetic Instructions

16-bit Operations:

  
#include "cmsis_gcc.h"

// Parallel 16-bit addition
uint32_t a = 0x00020001;  // Two 16-bit values: 2, 1
uint32_t b = 0x00040003;  // Two 16-bit values: 4, 3
uint32_t result = __SADD16(a, b);  // Result: 0x00060004 (6, 4)

// Parallel 16-bit subtraction
uint32_t diff = __SSUB16(a, b);

// Parallel 16-bit multiplication
uint32_t product = __SMUL16(a, b);

8-bit Operations:

  
// Parallel 8-bit addition (4 values at once)
uint32_t a = 0x04030201;  // Four 8-bit values: 4, 3, 2, 1
uint32_t b = 0x08070605;  // Four 8-bit values: 8, 7, 6, 5
uint32_t result = __SADD8(a, b);  // Result: 0x0C0A0806 (12, 10, 8, 6)

// Parallel 8-bit saturation addition
uint32_t sat_result = __QADD8(a, b);  // Saturated addition

ARM Helium (M-Profile Vector Extension)

ARM Helium provides 128-bit vector processing for Cortex-M55, enabling advanced SIMD operations.

Vector Data Types

  
#include "arm_mve.h"

// Integer vector types
int8x16_t   vec_i8;    // 16 x 8-bit integers
int16x8_t   vec_i16;   // 8 x 16-bit integers  
int32x4_t   vec_i32;   // 4 x 32-bit integers

// Floating-point vector types
float32x4_t vec_f32;   // 4 x 32-bit floats
float16x8_t vec_f16;   // 8 x 16-bit floats (half precision)

// Loading and storing vectors
int32x4_t data = vld1q_s32(input_array);    // Load 4 integers
vst1q_s32(output_array, result);            // Store 4 integers

Advanced Vector Operations

Arithmetic Operations:

  
// Vector addition, subtraction, multiplication
int32x4_t vec_sum = vaddq_s32(vec_a, vec_b);
int32x4_t vec_diff = vsubq_s32(vec_a, vec_b);
int32x4_t vec_prod = vmulq_s32(vec_a, vec_b);

// Fused multiply-add
float32x4_t vec_fma = vfmaq_f32(vec_c, vec_a, vec_b);  // c + (a * b)

Reduction Operations:

  
// Find maximum value in vector
int32_t max_val = vmaxvq_s32(vec_data);

// Sum all elements in vector  
int32_t sum = vaddvq_s32(vec_data);

// Horizontal operations
int32x4_t pairwise_sum = vpaddq_s32(vec_a, vec_b);

Conditional Operations:

  
// Conditional execution with predicates
mve_predicate_t pred = vcmpgtq_s32(vec_a, vec_threshold);
int32x4_t result = vaddq_m_s32(vec_base, vec_a, vec_b, pred);

As embedded applications become more computationally demanding, SIMD techniques will become increasingly important for achieving real-time performance within power and thermal constraints. The evolution from basic packed arithmetic to sophisticated vector processing units demonstrates ARM’s commitment to bringing high-performance computing capabilities to the embedded world.

Embedded Systems, AI

ARM CortexM

This post is licensed under CC BY 4.0 by the author.

SIMD on ARM Cortex-M: Accelerating Embedded Applications

Introduction to SIMD Computing

The Performance Challenge

The SIMD Solution

How SIMD Works:

CPU Architecture Fundamentals

SIMD Architecture Types

ARM NEON (Advanced SIMD)

ARM Helium (M-Profile Vector Extension - MVE)

ARM Scalable Vector Extension (SVE)

ARM Scalable Matrix Extension (SME)

Register Organization

Execution Pipeline

ARM Cortex-M SIMD Evolution

Cortex-M0/M0+: No SIMD Support

Cortex-M3: Limited SIMD Instructions

Cortex-M4/M7: DSP Extensions

Cortex-M55: ARM Helium (M-Profile Vector Extension)

Basic SIMD Operations on Cortex-M4/M7

Packed Arithmetic Instructions

ARM Helium (M-Profile Vector Extension)

Vector Data Types

Advanced Vector Operations

Trending Tags