Quantization
Running neural networks on microcontrollers seems impossible at first - your typical CNN model might need hundreds of megabytes, but your MCU only has 32-256KB of RAM. That’s where quantization comes in. It’s basically the art of making big floating-point models small enough to actually run on real hardware.
Why We Need Quantization
Let’s start with the problem. A simple image classification network might have:
Original Model (Float32):
- 1 million parameters × 4 bytes = 4MB storage
- Intermediate activations: 512KB during inference
- Total: Way more than your typical MCU’s memory
After 8-bit Quantization:
- 1 million parameters × 1 byte = 1MB storage
- Intermediate activations: 128KB during inference
- Total: Still big, but much more manageable
After Aggressive Quantization (1-bit + 8-bit mixed):
- Core weights: 125KB (1-bit binary)
- Activations: 32KB (8-bit)
- Total: Fits in many modern MCUs!
The magic is that you can often keep 90%+ of the accuracy while shrinking the model by 32x or more.
- Intermediate activations: 512KB during inference
- Total: Way more than your 32KB STM32
After 8-bit Quantization:
- 1 million parameters × 1 byte = 1MB storage
- Intermediate activations: 128KB during inference
- Total: Still big, but much more manageable
After Aggressive Quantization (1-bit + 8-bit mixed):
- Core weights: 125KB (1-bit binary)
- Activations: 32KB (8-bit)
- Total: Fits in many MCUs!
How Quantization Actually Works
Instead of storing weights as 32-bit floats (-3.14159…), quantization maps them to smaller integer ranges.
8-bit Signed Quantization
The most common approach maps float values to signed 8-bit integers (-128 to +127):
1
2
3
4
5
// Quantization formula
int8_t quantized = (int8_t)(float_value / scale + zero_point);
// Dequantization (when needed)
float dequantized = (quantized - zero_point) * scale;
Example:
1
2
3
4
5
// Original weights: [0.1, -0.05, 0.3, -0.2, 0.15]
// Scale = 0.002, Zero_point = 0
// Quantized: [50, -25, 150, -100, 75] (8-bit signed)
// Storage: 5 bytes instead of 20 bytes
Common Data Types
Most MCU frameworks support:
f32: 32-bit floating point (original)s8: 8-bit signed integer (-128 to +127)u8: 8-bit unsigned integer (0 to 255)
Performance Comparison
Here’s what you get with different quantization levels on a typical high-end MCU (400-500MHz Cortex-M7):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Example model: Simple CNN for MNIST
// Original model: 60,000 parameters
typedef struct {
uint32_t flash_kb;
uint32_t ram_kb;
uint32_t inference_ms;
float accuracy;
} model_stats_t;
model_stats_t float32_model = {
.flash_kb = 240, // 60k params × 4 bytes
.ram_kb = 86, // Activation memory
.inference_ms = 45, // Baseline timing
.accuracy = 0.992
};
model_stats_t int8_model = {
.flash_kb = 60, // 60k params × 1 byte
.ram_kb = 22, // 4x smaller activations
.inference_ms = 12, // 3.75x faster
.accuracy = 0.989 // Minimal loss
};
Different Quantization Approaches
Post-Training Quantization (PTQ)
Take an already-trained float model and convert it:
Pros: Easy to apply to existing models
Cons: Can lose significant accuracy
Quantization-Aware Training (QAT)
Train the model with quantization in mind from the start:
Pros: Better accuracy retention
Cons: Need to retrain your model, take more time
MCU-Specific Optimizations
Optimized Kernel Operations
Modern MCU AI frameworks generate highly optimized C kernels for specific data type combinations:
1
2
3
4
5
6
7
8
9
10
11
12
// Binary convolution kernel (1-bit × 1-bit → 1-bit)
// Uses ARM SXTAB16, USAD8 instructions for efficiency on Cortex-M
void conv2d_binary_kernel(
const uint32_t *input, // Packed binary input
const uint32_t *weights, // Packed binary weights
uint32_t *output, // Packed binary output
const conv_params_t *params
) {
// Highly optimized ARM assembly implementation
// Uses SXOR (sign XOR) operations instead of multiply
// 32 operations per cycle on modern MCUs
}
Memory Layout Optimizations
Channel Alignment: Most 32-bit MCUs work best when channels are multiples of 32:
1
2
3
4
5
6
7
8
// Recommended: 32, 64, 96, 128 channels
// Each 32 channels = 1 word for binary data
#define CHANNELS 64 // Good
// #define CHANNELS 30 // Wasteful (needs padding)
// Binary tensor storage
uint32_t activations[HEIGHT * WIDTH * CHANNELS/32];
Practical Example: Quantized Image Classifier
Let’s build a complete quantized model for MNIST digit recognition:
Generated MCU Code Usage
Example using a generic AI framework API:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include "mcu_ai_framework.h"
#include "mnist_model.h"
// Initialize the AI model
model_handle_t network = NULL;
static uint8_t activations[MODEL_ACTIVATION_SIZE];
static uint8_t input_buffer[MODEL_INPUT_SIZE];
static uint8_t output_buffer[MODEL_OUTPUT_SIZE];
void run_inference(uint8_t* image_data) {
// Copy input data
memcpy(input_buffer, image_data, MODEL_INPUT_SIZE);
// Run inference using the framework API
int result = model_predict(network, input_buffer, output_buffer);
if (result == 0) {
// Process results
int8_t* predictions = (int8_t*)output_buffer;
int best_class = 0;
int8_t best_score = predictions[0];
for (int i = 1; i < 10; i++) {
if (predictions[i] > best_score) {
best_score = predictions[i];
best_class = i;
}
}
printf("Predicted digit: %d (confidence: %d)\n", best_class, best_score);
}
}
ai_mnist_model_run(network, ai_input, ai_output);
// Process results
int8_t* predictions = (int8_t*)out_data;
int best_class = 0;
int8_t best_score = predictions[0];
for (int i = 1; i < 10; i++) {
if (predictions[i] > best_score) {
best_score = predictions[i];
best_class = i;
}
}
printf("Predicted digit: %d (confidence: %d)\n", best_class, best_score);
}
Performance Analysis Tools
Model Analysis Tools
Most MCU AI frameworks provide analysis tools to understand your model’s performance:
STM32 X-CUBE-AI Example:
1
2
3
4
5
6
7
8
9
10
$ stm32ai analyze mnist_quantized.h5
Model complexity analysis:
-------------------------
params # : 45,386 items (177.29 KiB)
macc : 1,234,567
weights (ro) : 8,456 B (8.26 KiB) / -168,988(-95.2%) vs float model
activations (rw) : 12,288 B (12.00 KiB) (1 segment)
ram (total) : 15,424 B (15.06 KiB) = 12,288 + 3,136 + 40
TensorFlow Lite Analysis:
1
2
3
4
5
6
$ tflite_convert --model_metrics_file=metrics.txt \
--keras_model_file=mnist_quantized.h5
Model size: 8.5 KB
Inference time (estimated): 15.2 ms
Memory usage: 12.8 KB