Single Instruction Multiple Data technique - SIMD
SIMD on ARM Cortex-M: Accelerating Embedded Applications
Modern embedded applications demand ever-increasing computational performance while maintaining strict power and cost constraints. From real-time audio processing to machine learning inference at the edge, developers need techniques that can deliver maximum performance from limited hardware resources.
Single Instruction Multiple Data (SIMD) represents one of the most effective ways to achieve this goal. By processing multiple data elements simultaneously with a single instruction, SIMD can deliver 2-4x performance improvements for many computational tasks common in embedded systems.
Introduction to SIMD Computing
The Performance Challenge
Consider a typical embedded application processing sensor data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Traditional approach: Process 1000 sensor readings
float sensor_data[1000];
float filtered_data[1000];
void process_sensors_traditional() {
for(int i = 0; i < 1000; i++) {
// Read sensor value
sensor_data[i] = read_adc_channel(i);
// Apply calibration and filtering
filtered_data[i] = sensor_data[i] * calibration_factor + offset;
// Apply low-pass filter
filtered_data[i] = (filtered_data[i] + previous_value) * 0.5f;
}
}
This traditional scalar approach processes one value at a time, requiring 1000 separate operations. Each operation consumes CPU cycles, memory bandwidth, and energy.
The SIMD Solution
SIMD (Single Instruction Multiple Data) transforms this by processing multiple values simultaneously:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// SIMD approach: Process 4 values at once
void process_sensors_simd() {
for(int i = 0; i < 1000; i += 4) {
// Process 4 sensor readings simultaneously
float32x4_t sensors = vld1q_f32(&sensor_data[i]);
float32x4_t calibrated = vmulq_n_f32(sensors, calibration_factor);
calibrated = vaddq_n_f32(calibrated, offset);
// Apply filtering to all 4 values at once
float32x4_t previous = vld1q_f32(&previous_values[i]);
float32x4_t filtered = vmulq_n_f32(vaddq_f32(calibrated, previous), 0.5f);
vst1q_f32(&filtered_data[i], filtered);
}
}
Result: 4x fewer instructions, significantly improved performance, and reduced power consumption.
How SIMD Works:
CPU Architecture Fundamentals
Traditional scalar processors execute one operation per instruction:
1
2
3
4
5
6
7
Scalar Processing (Traditional):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Data A │───▶│ ALU │───▶│ Result │
└─────────────┘ └─────────────┘ └─────────────┘
1 cycle 1 cycle 1 cycle
Total: 3 cycles for 1 operation
SIMD processors contain wider execution units that can handle multiple data elements:
1
2
3
4
5
6
7
8
9
10
SIMD Processing (Parallel):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Data A1 │───▶│ │───▶│ Result 1 │
│ Data A2 │───▶│ SIMD ALU │───▶│ Result 2 │
│ Data A3 │───▶│ │───▶│ Result 3 │
│ Data A4 │───▶│ │───▶│ Result 4 │
└─────────────┘ └─────────────┘ └─────────────┘
1 cycle 1 cycle 1 cycle
Total: 3 cycles for 4 operations
SIMD Architecture Types
Modern processors implement SIMD through various architectural approaches, each optimized for different use cases and performance requirements.
ARM NEON (Advanced SIMD)
Traditional ARM NEON found in Cortex-A series and some Cortex-M processors:
Key Features:
- 128-bit vector width
- Floating-point and integer operations
- Limited predication support
- Available on Cortex-A and select Cortex-M
ARM Helium (M-Profile Vector Extension - MVE)
Next-generation SIMD for Cortex-M series, designed specifically for AI/ML workloads:
Key Features:
- 128-bit vectors optimized for Cortex-M
- Advanced predication for loop tail handling
- Built-in support for neural network operations
- Low-overhead context switching
- Optimized for power efficiency
ARM Scalable Vector Extension (SVE)
Variable-length SIMD for high-performance computing:
Key Features:
- Variable vector length (128-2048 bits)
- Vector-length agnostic programming
- Advanced predication and gather/scatter
- Primarily for Cortex-A and Neoverse cores
ARM Scalable Matrix Extension (SME)
Specialized for matrix operations and AI acceleration:
Key Features:
- Matrix-oriented instructions
- Streaming SVE mode for memory efficiency
- AI/ML acceleration focus
- Integration with SVE
Register Organization
SIMD operations use wider registers that can hold multiple values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Scalar register (32-bit)
┌─────────────────────────────────┐
│ Single 32-bit float │
└─────────────────────────────────┘
// SIMD register (128-bit) - ARM Helium
┌───────────┬───────────┬───────────┬───────────┐
│ float 1 │ float 2 │ float 3 │ float 4 │
│ 32 bits │ 32 bits │ 32 bits │ 32 bits │
└───────────┴───────────┴───────────┴───────────┘
// Alternative packing for different data types
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ // 8x 16-bit
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │i8 │ // 16x 8-bit
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
Execution Pipeline
SIMD instructions flow through the processor pipeline just like scalar instructions, but operate on multiple data lanes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Pipeline Stage 1: Fetch Instruction
┌─────────────────────────────────────────┐
│ SIMD_ADD vector_a, vector_b, vector_c │
└─────────────────────────────────────────┘
Pipeline Stage 2: Decode & Register Read
┌───────────┬───────────┬───────────┬───────────┐
│ Lane 0 │ Lane 1 │ Lane 2 │ Lane 3 │
│ A[0] B[0] │ A[1] B[1] │ A[2] B[2] │ A[3] B[3] │
└───────────┴───────────┴───────────┴───────────┘
Pipeline Stage 3: Execute (Parallel ALUs)
┌───────────┬───────────┬───────────┬───────────┐
│ ALU Lane0 │ ALU Lane1 │ ALU Lane2 │ ALU Lane3 │
│ A[0]+B[0] │ A[1]+B[1] │ A[2]+B[2] │ A[3]+B[3] │
└───────────┴───────────┴───────────┴───────────┘
Pipeline Stage 4: Write Back
┌───────────┬───────────┬───────────┬───────────┐
│ Result[0] │ Result[1] │ Result[2] │ Result[3] │
└───────────┴───────────┴───────────┴───────────┘
ARM Cortex-M SIMD Evolution
Cortex-M0/M0+: No SIMD Support
- Basic 32-bit ARM architecture
- No parallel data processing capabilities
- Suitable for simple control applications
Cortex-M3: Limited SIMD Instructions
- Basic packed arithmetic operations
- 16-bit and 8-bit data packing
- Simple parallel operations
Cortex-M4/M7: DSP Extensions
ARM Cortex-M4 and M7 include DSP extensions with SIMD capabilities:
1
2
3
4
5
6
7
8
9
// Packed 16-bit addition (2 operations in 1 instruction)
uint32_t packed_add_16(uint32_t a, uint32_t b) {
return __SADD16(a, b); // Add two 16-bit values in parallel
}
// Packed 8-bit addition (4 operations in 1 instruction)
uint32_t packed_add_8(uint32_t a, uint32_t b) {
return __SADD8(a, b); // Add four 8-bit values in parallel
}
Cortex-M55: ARM Helium (M-Profile Vector Extension)
The most advanced SIMD implementation for microcontrollers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "arm_mve.h"
// Process 4 float values simultaneously
void vector_add_float(float *a, float *b, float *result, int count) {
int i;
for(i = 0; i <= count - 4; i += 4) {
float32x4_t vec_a = vld1q_f32(&a[i]);
float32x4_t vec_b = vld1q_f32(&b[i]);
float32x4_t vec_result = vaddq_f32(vec_a, vec_b);
vst1q_f32(&result[i], vec_result);
}
// Handle remaining elements
for(; i < count; i++) {
result[i] = a[i] + b[i];
}
}
Basic SIMD Operations on Cortex-M4/M7
Packed Arithmetic Instructions
16-bit Operations:
1
2
3
4
5
6
7
8
9
10
11
12
#include "cmsis_gcc.h"
// Parallel 16-bit addition
uint32_t a = 0x00020001; // Two 16-bit values: 2, 1
uint32_t b = 0x00040003; // Two 16-bit values: 4, 3
uint32_t result = __SADD16(a, b); // Result: 0x00060004 (6, 4)
// Parallel 16-bit subtraction
uint32_t diff = __SSUB16(a, b);
// Parallel 16-bit multiplication
uint32_t product = __SMUL16(a, b);
8-bit Operations:
1
2
3
4
5
6
7
// Parallel 8-bit addition (4 values at once)
uint32_t a = 0x04030201; // Four 8-bit values: 4, 3, 2, 1
uint32_t b = 0x08070605; // Four 8-bit values: 8, 7, 6, 5
uint32_t result = __SADD8(a, b); // Result: 0x0C0A0806 (12, 10, 8, 6)
// Parallel 8-bit saturation addition
uint32_t sat_result = __QADD8(a, b); // Saturated addition
ARM Helium (M-Profile Vector Extension)
ARM Helium provides 128-bit vector processing for Cortex-M55, enabling advanced SIMD operations.
Vector Data Types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include "arm_mve.h"
// Integer vector types
int8x16_t vec_i8; // 16 x 8-bit integers
int16x8_t vec_i16; // 8 x 16-bit integers
int32x4_t vec_i32; // 4 x 32-bit integers
// Floating-point vector types
float32x4_t vec_f32; // 4 x 32-bit floats
float16x8_t vec_f16; // 8 x 16-bit floats (half precision)
// Loading and storing vectors
int32x4_t data = vld1q_s32(input_array); // Load 4 integers
vst1q_s32(output_array, result); // Store 4 integers
Advanced Vector Operations
Arithmetic Operations:
1
2
3
4
5
6
7
// Vector addition, subtraction, multiplication
int32x4_t vec_sum = vaddq_s32(vec_a, vec_b);
int32x4_t vec_diff = vsubq_s32(vec_a, vec_b);
int32x4_t vec_prod = vmulq_s32(vec_a, vec_b);
// Fused multiply-add
float32x4_t vec_fma = vfmaq_f32(vec_c, vec_a, vec_b); // c + (a * b)
Reduction Operations:
1
2
3
4
5
6
7
8
// Find maximum value in vector
int32_t max_val = vmaxvq_s32(vec_data);
// Sum all elements in vector
int32_t sum = vaddvq_s32(vec_data);
// Horizontal operations
int32x4_t pairwise_sum = vpaddq_s32(vec_a, vec_b);
Conditional Operations:
1
2
3
// Conditional execution with predicates
mve_predicate_t pred = vcmpgtq_s32(vec_a, vec_threshold);
int32x4_t result = vaddq_m_s32(vec_base, vec_a, vec_b, pred);
As embedded applications become more computationally demanding, SIMD techniques will become increasingly important for achieving real-time performance within power and thermal constraints. The evolution from basic packed arithmetic to sophisticated vector processing units demonstrates ARM’s commitment to bringing high-performance computing capabilities to the embedded world.