Post

Quantization

Quantization

Running neural networks on microcontrollers seems impossible at first - your typical CNN model might need hundreds of megabytes, but your MCU only has 32-256KB of RAM. That’s where quantization comes in. It’s basically the art of making big floating-point models small enough to actually run on real hardware.

Why We Need Quantization

Let’s start with the problem. A simple image classification network might have:

Original Model (Float32):

  • 1 million parameters × 4 bytes = 4MB storage
  • Intermediate activations: 512KB during inference
  • Total: Way more than your typical MCU’s memory

After 8-bit Quantization:

  • 1 million parameters × 1 byte = 1MB storage
  • Intermediate activations: 128KB during inference
  • Total: Still big, but much more manageable

After Aggressive Quantization (1-bit + 8-bit mixed):

  • Core weights: 125KB (1-bit binary)
  • Activations: 32KB (8-bit)
  • Total: Fits in many modern MCUs!

The usual first step is 8-bit integer quantization, which gives about a 4x reduction for weights and activations compared with float32. More aggressive schemes, such as binary weights, can shrink selected tensors further, but the accuracy tradeoff is model-dependent and the MCU toolchain must support that quantization scheme.

How Quantization Actually Works

Instead of storing weights as 32-bit floats (-3.14159…), quantization maps them to smaller integer ranges.

8-bit Signed Quantization

The most common approach maps float values to signed 8-bit integers (-128 to +127):

1
2
3
4
5
6
7
// Quantization formula
int32_t q = (int32_t)roundf(float_value / scale + zero_point);
q = q < -128 ? -128 : (q > 127 ? 127 : q);
int8_t quantized = (int8_t)q;

// Dequantization (when needed)
float dequantized = (quantized - zero_point) * scale;

alt text

Example:

1
2
3
4
5
// Original weights: [0.1, -0.05, 0.24, -0.2, 0.15]
// Scale = 0.002, zero_point = 0

// Quantized: [50, -25, 120, -100, 75] (8-bit signed)
// Storage: 5 bytes instead of 20 bytes

Common Data Types

Most MCU frameworks support:

  • f32: 32-bit floating point (original)
  • s8: 8-bit signed integer (-128 to +127)
  • u8: 8-bit unsigned integer (0 to 255)

For small embedded targets, the most useful deployment path is usually integer-only inference: both weights and activations are stored and processed as integers. Some frameworks also support mixed models, but every float fallback costs extra memory and latency.

Performance Comparison

Here’s what you get with different quantization levels on a typical high-end MCU (400-500MHz Cortex-M7):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Example model: Simple CNN for MNIST
// Original model: 60,000 parameters

typedef struct {
    uint32_t flash_kb;
    uint32_t ram_kb;  
    uint32_t inference_ms;
    float accuracy;
} model_stats_t;

model_stats_t float32_model = {
    .flash_kb = 240,    // 60k params × 4 bytes
    .ram_kb = 86,       // Activation memory
    .inference_ms = 45, // Baseline timing
    .accuracy = 0.992
};

model_stats_t int8_model = {
    .flash_kb = 60,     // 60k params × 1 byte  
    .ram_kb = 22,       // 4x smaller activations
    .inference_ms = 12, // 3.75x faster
    .accuracy = 0.989   // Minimal loss
};

Different Quantization Approaches

Post-Training Quantization (PTQ)

Take an already-trained float model and convert it:

1
2
3
4
5
6
7
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
quant_model = converter.convert()

Pros: Easy to apply to existing models
Cons: Can lose significant accuracy

Quantization-Aware Training (QAT)

Train the model with quantization in mind from the start:

1
2
# TensorFlow Model Optimization Toolkit example shape:
# annotate/apply quantization during training, then export to TFLite int8.

Pros: Better accuracy retention
Cons: Need to retrain your model, take more time

For MCU deployment, PTQ normally needs a representative calibration dataset so the converter can estimate activation ranges. Without calibration, some tools only quantize weights, which reduces file size but may still require floating-point computation during inference.

Per-Tensor vs Per-Channel Quantization

Per-tensor quantization uses one scale and zero_point for the whole tensor. Per-channel quantization uses separate parameters per output channel or filter, which usually improves convolution weight accuracy with little runtime overhead.

In many int8 deployment flows, weights are symmetric and often per-channel, while activations are per-tensor and may be asymmetric. This matters because optimized MCU kernels are usually written for a small set of weight/activation schemes.

MCU-Specific Optimizations

Optimized Kernel Operations

Modern MCU AI frameworks generate highly optimized C kernels for specific data type combinations:

1
2
3
4
5
6
7
8
9
10
11
12
// Binary convolution kernel (1-bit × 1-bit -> 1-bit)
// Uses ARM SXTAB16, USAD8 instructions for efficiency on Cortex-M
void conv2d_binary_kernel(
    const uint32_t *input,     // Packed binary input
    const uint32_t *weights,   // Packed binary weights  
    uint32_t *output,          // Packed binary output
    const conv_params_t *params
) {
    // Highly optimized implementation
    // Uses packed bit operations/popcount-style accumulation
    // instead of ordinary multiply-accumulate operations.
}

Memory Layout Optimizations

Channel Alignment: Most 32-bit MCUs work best when channels are multiples of 32:

1
2
3
4
5
6
7
8
// Recommended: 32, 64, 96, 128 channels
// Each 32 channels = 1 word for binary data

#define CHANNELS 64  // Good
// #define CHANNELS 30  // Wasteful (needs padding)

// Binary tensor storage
uint32_t activations[HEIGHT * WIDTH * CHANNELS/32];

Practical Example: Quantized Image Classifier

Let’s build a complete quantized model for MNIST digit recognition:

Generated MCU Code Usage

Example using a generic AI framework API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include "mcu_ai_framework.h"
#include "mnist_model.h"

// Initialize the AI model
model_handle_t network = NULL;
static uint8_t activations[MODEL_ACTIVATION_SIZE];
static uint8_t input_buffer[MODEL_INPUT_SIZE];
static uint8_t output_buffer[MODEL_OUTPUT_SIZE];

void run_inference(uint8_t* image_data) {
    // Copy input data
    memcpy(input_buffer, image_data, MODEL_INPUT_SIZE);
    
    // Run inference using the framework API
    int result = model_predict(network, input_buffer, output_buffer);
    
    if (result == 0) {
        // Process results
        int8_t* predictions = (int8_t*)output_buffer;
        int best_class = 0;
        int8_t best_score = predictions[0];
        
        for (int i = 1; i < 10; i++) {
            if (predictions[i] > best_score) {
                best_score = predictions[i];
                best_class = i;
            }
        }
        
        printf("Predicted digit: %d (confidence: %d)\n", best_class, best_score);
    }
}

Performance Analysis Tools

Model Analysis Tools

Most MCU AI frameworks provide analysis tools to understand your model’s performance:

Generic Analysis Example:

1
2
3
4
5
6
7
8
9
10
$ mcu-ai-tool analyze mnist_quantized.tflite

Model complexity analysis:
-------------------------
params #             : 45,386 items (177.29 KiB)
macc                 : 1,234,567
weights (ro)         : 8,456 B (8.26 KiB) / -168,988(-95.2%) vs float model
activations (rw)     : 12,288 B (12.00 KiB) (1 segment)  
ram (total)          : 15,424 B (15.06 KiB) = 12,288 + 3,136 + 40

TensorFlow Lite Analysis:

1
2
3
4
5
6
$ tflite_convert --model_metrics_file=metrics.txt \
    --keras_model_file=mnist_quantized.h5

Model size: 8.5 KB
Inference time (estimated): 15.2 ms
Memory usage: 12.8 KB
This post is licensed under CC BY 4.0 by the author.