Post

ACUITY Quantization Accuracy Optimization

ACUITY Quantization Accuracy Optimization

ACUITY Quantitative Accuracy Optimization

When quantizing with ACUITY, the model accuracy is slightly lost, but if the accuracy loss is too large and does not meet the needs of the current project, you can use Kullback-Leibler Divergence (KLD) quantization or mixed quantization.

If the model accuracy is not satisfied, you can use the KLD quantization algorithm to quantify the model first, and then check whether the accuracy is satisfied.

Quantization Using KLD

To use KLD quantization, you can add the following options in pegasus_quantize.sh:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cmd="$PEGASUS quantize \
    --model         ${NAME}.json \
    --model-data    ${NAME}.data \
    --iterations    ${Q_ITER} \
    --device        CPU \
    --with-input-meta ${NAME}_inputmeta.yml \
    --rebuild  \
    --model-quantize  ${NAME}_${POSTFIX}.quantize \
    --quantizer ${QUANTIZER} \
    --qtype  ${QUANTIZED}  \
    --algorithm kl_divergence \
    --batch-size 100 \
    --divergence-first-quantize-bits 12
    --MLE"
  • --algorithm kl_divergence — set up KLD quantization
  • --divergence-first-quantize-bits — set up a $2^{12}$ KLD histogram box
  • --batch-size — set the number of quantitative inputs for the model
  • --MLE — optional; if the quantization accuracy using KLD is still not compliant, use this option to achieve higher accuracy, but it will increase the quantization time

Mixed Quantification

The purpose of hybrid quantization is to use data types with higher precision for specific layers in the specified model and lower data types for other layers to ensure the accuracy of the final result. If the accuracy of a quantized model cannot meet the requirements and cannot be improved by co-quantization (such as KLD quantization), then hybrid quantization can be used to avoid loss of accuracy.

Mixed Quantification Examples

Taking the uint8 quantization of MobileNetV2_ImageNet in the previous section as an example, this model directly uses ACUITY to quantize uint8 and has a significant loss of accuracy compared to the floating-point model.

MobileNetV2_ImageNet — compare the results of the floating-point model after direct quantization by uint8:

Float model inference results

1
2
3
4
5
6
I 07:01:06 Iter(0), top(5), tensor(@attach_Logits/Softmax/out0_0:out0) :
I 07:01:06 812: 0.9990391731262207
I 07:01:06 814: 0.0001562383840791881
I 07:01:06 627: 8.89502334757708e-05
I 07:01:06 864: 6.59249781165272e-05
I 07:01:06 536: 2.808812860166654e-05

uint8 model inference results

1
2
3
4
5
6
I 07:02:20 Iter(0), top(5), tensor(@attach_Logits/Softmax/out0_0:out0) :
I 07:02:20 904: 0.8729746341705322
I 07:02:20 530: 0.012925799004733562
I 07:02:20 905: 0.01022859662771225
I 07:02:20 468: 0.006405209191143513
I 07:02:20 466: 0.005068646278232336

Statistics: Layers Requiring Mixed Quantization

Add the --compute-entropy parameter to the quantization script pegasus_quantize.sh:

1
2
3
4
5
6
7
8
9
10
11
cmd="$PEGASUS quantize \
    --model         ${NAME}.json \
    --model-data    ${NAME}.data \
    --iterations    ${Q_ITER} \
    --device        CPU \
    --with-input-meta ${NAME}_inputmeta.yml \
    --rebuild  \
    --model-quantize  ${NAME}_${POSTFIX}.quantize \
    --quantizer ${QUANTIZER} \
    --qtype  ${QUANTIZED}  \
    --compute-entropy"

X86 Linux PC

1
2
# pegasus_quantize.sh MODEL_DIR QUANTIZED ITERATION
pegasus_quantize.sh MobileNetV2_Imagenet uint8 10

When the quantization script pegasus_quantize.sh is executed, the following files are generated:

  • MODEL_DIR_QUANTIZE.quantize — contains data for the quantization model, where the layers in customized_quantize_layers are automatically counted for layers that need to be used for mixed quantization.
  • entropy.txt — records the entropy value of each layer in this quantization. The higher the entropy value, the lower the quantification accuracy. The range is [0, 1].

Users can refer to the values in entropy.txt to add, delete, or modify the layers in customized_quantize_layers within MODEL_DIR_QUANTIZE.quantize.

Execute the Hybrid Quantization Command

Tips: When pegasus_quantize.sh is executed, the full quantization command will be printed at the top. Replace --rebuild with --hybrid to run hybrid quantization.

X86 Linux PC

1
2
3
4
5
6
7
8
9
10
11
python3 ~/acuity-toolkit-whl-6.30.22/bin/pegasus.py quantize \
    --model MobileNetV2_Imagenet.json \
    --model-data MobileNetV2_Imagenet.data \
    --iterations 1 \
    --device CPU \
    --with-input-meta MobileNetV2_Imagenet_inputmeta.yml \
    --hybrid \
    --model-quantize MobileNetV2_Imagenet_uint8.quantize \
    --quantizer asymmetric_affine \
    --qtype uint8 \
    --compute-entropy

After the command is executed, the ACUITY Model Weights file (.data) is updated, and a new ACUITY Model Structure file (quantize.json) is generated.

Mixed Quantification Results

After performing the hybrid quantization, use pegasus_inference.sh for inference on the uint8 mixed quantization model.

Because the model structure changes after mixed quantization, please first modify the --model parameter in pegasus_inference.sh to point to the new quantize.json file.

X86 Linux PC

1
2
# pegasus_inference.sh MODEL_DIR QUANTIZED ITERATION
pegasus_inference.sh MobileNetV2_Imagenet/ uint8

The inference output result is:

1
2
3
4
5
6
I 04:00:10 Iter(0), top(5), tensor(@attach_Logits/Softmax/out0_0:out0) :
I 04:00:10 812: 0.9987972974777222
I 04:00:10 404: 0.0001131662429543212
I 04:00:10 814: 6.176823808345944e-05
I 04:00:10 627: 4.6059416490606964e-05
I 04:00:10 833: 4.153002373641357e-05

The hybrid quantization model now has the highest confidence level of 812 with a confidence of 0.998 in the Top 5 results, which is consistent with the floating-point model results — proving that the mixed quantization uint8 model has successfully reduced the loss of model accuracy.

Next, you can continue with NPU deployment to compile and export the model. Because the model structure changes after mixed quantization, use the quantize.json file generated by mixed quantization as the --model parameter when performing model conversion.

This post is licensed under CC BY 4.0 by the author.