tree: 708249d836a7207374471939da91ecbf8858589a [path history] [tgz]
  1. Examples/
  2. Include/
  3. Scripts/
  4. Source/
  5. Tests/
  6. .clang-format
  7. .gitignore
  8. CMakeLists.txt
  9. README.md
CMSIS/NN/README.md

CMSIS NN

CMSIS NN software library is a collection of efficient neural network kernels developed to maximize the performance and minimize the memory footprint of neural networks on Cortex-M processors.

About

This page give a quick overview of the functions available and key differences between them.

Note: The GitHub documentation does not follow the develop branch but rather the last official release in the master branch. Consequently, the group documentation linked to in the table table might not have the listed API. Please refer to the description in the header file instead.

Support / Contact

For any questions or to reach the CMSIS-NN team, please create a new issue in https://github.com/ARM-software/CMSIS_5/issues

Supported Framework

TensorFlow Lite for Microcontrollers

Legacy vs TFL micro compliant APIs

There are two kinds of APIs available in the CMSIS-NN repository; One that supports a legacy symmetric quantization scheme[1] and one that supports TFL micro's symmetric quantization scheme. One of the main differences is how the quantization is performed. The legacy APIs have a fixed point format with power of 2 scaling. This simplifies the re-quantization to a cycle efficient shift operation. No new development is done on the legacy functions and all of the new development is on the functions that support TFL micro. The table below highlights some of the differences between the two formats for convolution related functions. The TFL micro compliant APIs in most cases have a _s8 suffix and is always specified in the API header file.

OperationLegacy APIsTFL micro compliant APIs
Core loopNo input or filter offsetInput and/or filter offset
Re-quantizationShift and saturate in one instruction. ~ 5 cyclesGreater than 200 cycles for one output element
QuantizationPer layer quantizationPer-channel quantization
Output offsetNoPer-layer output offset
Fused ActivationNoYes

TFL micro compliant APIs

GroupAPIBase OperatorInput ConstraintsAdditional memory required for
optimizations (bytes)
DSP OptimizedMVE OptimizedOther comments
Conv
arm_convolve_wrapper_s8()CONVNonen.a.YesYesThe additional memory required depends on the optimal convolution function called.
arm_convolve_s8()CONVNone4 * (ker_x * ker_y * input_ch + delta)YesYesdelta - MVE only
arm_convolve_1x1_s8_fast()CONVdilation = 1
ker_x = 1, ker_y = 1
pad = 0
stride = 1
input_ch % 4 = 0
NoYesYes
arm_convolve_1_x_n_s8()CONVdilation = 1
output_y % 4 = 0
4 * ker_x * ker_y * input_chYesYes
arm_depthwise_conv_wrapper_s8()DEPTHWISE_CONVNonen.a.YesYesThe additional memory required depends on the optimal convolution function called
arm_depthwise_conv_3x3_s8()DEPTHWISE_CONVdilation = 1
depth_multiplier = 1
pad_x <= 1
NoNoNoPreferred function for 3x3 kernel size for DSP extension. For MVE, use arm_depthwise_conv_s8_opt()
arm_depthwise_conv_s8()DEPTHWISE_CONVNoneNoNoNo
arm_depthwise_conv_s8_opt()DEPTHWISE_CONVdilation = 1
depth_multiplier = 1
DSP: 2 * ker_x * ker_y * input_ch
MVE: 2 * DSP + 4
YesYesBest case is when channels are multiple of 4 or
at the least >= 4
arm_convolve_wrapper_s16()CONVNonen.a.YesNoThe additional memory required depends on the optimal convolution function called
arm_convolve_s16()CONVNoneNoNoNo
arm_convolve_fast_s16()CONVdilation = 1,
ker_x * ker_y * input_ch < 512
4 * ker_x * ker_y * input_chYesNo
arm_depthwise_conv_s16()DEPTHWISE_CONVNoneNoNoNo
Fully Connected
arm_fully_connected_s8()FULLY CONNECTED &
MAT MUL
NoneNoYesYes
arm_fully_connected_s16()FULLY CONNECTED &
MAT MUL
NoneNoYesNo
Pooling
arm_avgpool_s8()AVERAGE POOLNoneinput_ch * 2
(DSP only)
YesYesBest case is when channels are multiple of 4 or
at the least >= 4
arm_avgpool_s16()AVERAGE POOLNoneNoneNoNoBest case is when channels are multiple of 4 or
at the least >= 4
arm_maxpool_s8()MAX POOLNoneNoneYesYes
arm_maxpool_s16()MAX POOLNoneNoneNoNo
Softmax
arm_softmax_q7()SOFTMAXNoneNoneYesNoNot bit exact to TFLu but can be up to 70x faster
arm_softmax_s8()SOFTMAXNoneNoneNoYesBit exact to TFLu
arm_softmax_s8_s16()SOFTMAXNoneNoneNoNoBit exact to TFLu
arm_softmax_s16()SOFTMAXNoneNoneNoNoBit exact to TFLu
arm_softmax_u8()SOFTMAXNoneNoneNoNoBit exact to TFLu
SVDF
arm_svdf_s8()SVDFNoneNoneYesYesBit exact to TFLu
arm_svdf_state_s16_s8()SVDFNoneNoneYesYesBit exact to TFLu
Misc
arm_reshape_s8()SOFTMAXNoneNoneNoNo
arm_elementwise_add_s8()ELEMENTWISE ADDNoneNoneYesYesReshape is not done in this function
Only minor improvements are expected
arm_elementwise_add_s16()ELEMENTWISE ADDNoneNoneNoNoReshape is not done in this function
Only minor improvements are expected
arm_elementwise_mul_s8()ELEMENTWISE MULNoneNoneYesYesReshape is not done in this function
Only minor improvements are expected
arm_elementwise_mul_s16()ELEMENTWISE MULNoneNoneNoNoReshape is not done in this function
Only minor improvements are expected
arm_relu_q7()RELUNoneNoneYesNo
arm_relu6_s8()RELUNoneNoneYesNo
Concat
arm_concatenation_s8_w()CONCATNoneNoneNoNo
arm_concatenation_s8_x()CONCATNoneNoneNoNo
arm_concatenation_s8_y()CONCATNoneNoneNoNo
arm_concatenation_s8_z()CONCATNoneNoneNoNo

Building CMSIS-NN as a library

It is recommended to use toolchain files from Arm Ethos-U Core Platform project. These are supporting TARGET_CPU, which is a required argument. Note that if not specifying TARGET_CPU, these toolchains will set some default. The format must be TARGET_CPU=cortex-mXX, see examples below. Clone Arm Ethos-U Core Platform project and build, for example:

cd </path/to/CMSIS_5>/CMSIS/NN
mkdir build
cd build
cmake .. -DCMAKE_TOOLCHAIN_FILE=</path/to/ethos-u-core-platform>/cmake/toolchain/arm-none-eabi-gcc.cmake -DTARGET_CPU=cortex-m55
make

Some more examples, assuming Ethos-u-core-platform is cloned into your home directory:

cmake .. -DCMAKE_TOOLCHAIN_FILE=~/ethos-u-core-platform/cmake/toolchain/arm-none-eabi-gcc.cmake -DTARGET_CPU=cortex-m55
cmake .. -DCMAKE_TOOLCHAIN_FILE=~/ethos-u-core-platform/cmake/toolchain/arm-none-eabi-gcc.cmake -DTARGET_CPU=cortex-m7
cmake .. -DCMAKE_TOOLCHAIN_FILE=~/ethos-u-core-platform/cmake/toolchain/armclang.cmake -DTARGET_CPU=cortex-m3

Compiler options

Default optimization level is Ofast. Please change according to project needs. Just bear in mind it will impact performance.

The compiler option '-fomit-frame-pointer' is enabled by default at -O and higher. With no optimization level you may need to specifiy '-fomit-frame-pointer' as a minimum.

The compiler option '-fno-builtin' does not utilize optimized implementations of e.g. memcpy and memset, which are heavily used by CMSIS-NN. It can significantly downgrade performance. So this should be avoided. The compiler option '-ffreestanding' should also be avoided as it enables '-fno-builtin' implicitly.

Reference

[1] Legacy CMSIS-NN and how to use it https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/converting-a-neural-network-for-arm-cortex-m-with-cmsis-nn/single-page