今天给大家介绍一下在Khadas VIM3上进行NPU加速(支持在X86_64上模拟),有关Tengine请参考Tengine的入门教程,稳重需要的模型文件与测试图片在文末的百度网盘链接中。NPU是移动端的进行算法推理的主要是趋势,现在咱们使用的手机、边缘设备上经常能看到它,毕竟Soc才是大佬。

img

Khadas VIM3是一款带有NPU的移动端开发板。

处理器:四核2.2Ghz Cortex-A73满足高性能要求,双核1.8Ghz Cortex-A53满足低功耗要求。

算力:下一代的深度神经网络单元,性能高达5.0 TOPS。

购买链接:https://www.khadas.cn/product-page/vim3

开发需要TIM-VX,可将神经网络模型部署在支持OpenVX的ML加速器上。

Khadas VIM3上的部署

1.准备阶段

下载TIM-VX、Tengine-Lite源码:

git clone https://github.com/VeriSilicon/TIM-VX.git
git clone https://github.com/OAID/Tengine.git tengine-lite

下载prebuild-SDK,速度慢的话可以去文末的网盘链接中下载:

wget -c https://github.com/VeriSilicon/TIM-VX/releases/download/v1.1.28/aarch64_A311D_D312513_A294074_R311680_T312233_O312045.tgz
tar zxvf aarch64_A311D_D312513_A294074_R311680_T312233_O312045.tgz
mv aarch64_A311D_D312513_A294074_R311680_T312233_O312045 prebuild-sdk-a311d

2.编译

为安装对应依赖,首先进区Tengine-Lite文件夹:

cd tengine-lite

将TIM-VX中的对应依赖拷贝到Tengine-Lite项目中,详情可参考TIM-VX的3rdparty文件夹:

cd  < tengine-lite-root-dir >
mkdir -p ./3rdparty/tim-vx/lib/aarch64
mkdir -p ./3rdparty/tim-vx/include
cp -rf ../TIM-VX/include/ *     ./3rdparty/tim-vx/include/
cp -rf ../TIM-VX/src ./src/dev/tim-vx/
cp -rf ../prebuild-sdk-a311d/include/ *     ./3rdparty/tim-vx/include/
cp -rf ../prebuild-sdk-a311d/lib/ *     ./3rdparty/tim-vx/lib/aarch64/
rm ./src/dev/tim-vx/src/tim/vx/ * _test.cc

编译指令如下,其他的编译指令请参考Demo中的CMakeLists.txt文件:

mkdir build && cd build
cmake -DTENGINE_ENABLE_TIM_VX=ON -DTENGINE_ENABLE_TIM_VX_INTEGRATION=ON ..
make -j4
make install

3.测试

库的路径如下:

3rdparty/tim-vx/lib/
├── libArchModelSw.so
├── libCLC.so
├── libGAL.so
├── libNNArchPerf.so
├── libOpenVX.so
├── libOpenVXU.so
└── libVSC.so

build/install/lib/
└── libtengine-lite.so

Khadas VIM3上需要进行替换。

必要时需要设置内核,可能自带的旧版本:

rmmod galcore
insmod galcore.ko

int8模型量化:

/* set runtime options */
struct options opt;
opt.num_thread = num_thread;
opt.cluster = TENGINE_CLUSTER_ALL;
opt.precision = TENGINE_MODE_UINT8;
opt.affinity = 0;

模型文件以及测试用的用图片在百度网盘链接中:

./tm_classification_timvx -m squeezenet_uint8.tmfile -i cat.jpg -r 1 -s 0.017,0.017,0.017 -r 10

terminal中的结果如下:

Tengine plugin allocator TIMVX is registered.Image height not specified, use default 227Image width not specified, use default  227Mean value not specified, use default   104.0, 116.7, 122.7tengine-lite library version: 1.2-devTIM-VX prerun.model file : squeezenet_uint8.tmfileimage file : cat.jpgimg_h, img_w, scale[3], mean[3] : 227 227 , 0.017 0.017 0.017, 104.0 116.7 122.7Repeat 10 times, thread 1, avg time 2.95 ms, max_time 3.42 ms, min_time 2.76 ms--------------------------------------34.786182, 27833.942883, 28733.732056, 28032.045452, 27730.780502, 282

X86上模拟计算

支持X86_64。

1.准备阶段

下载TIM-VX、Tengine-Lite源码:

git clone https://github.com/VeriSilicon/TIM-VX.gitgit clone https://github.com/OAID/Tengine.git tengine-lite

2.编译

为安装对应依赖,首先进区Tengine-Lite文件夹:

cd tengine-lite

将TIM-VX中的对应依赖拷贝到Tengine-Lite项目中,详情可参考TIM-VX的3rdparty文件夹:

mkdir -p ./3rdparty/tim-vx/lib/x86_64mkdir -p ./3rdparty/tim-vx/includecp -rf ../TIM-VX/include/ *     ./3rdparty/tim-vx/include/cp -rf ../TIM-VX/src ./src/dev/tim-vx/cp -rf ../TIM-VX/prebuilt-sdk/x86_64_linux/include/ *     ./3rdparty/tim-vx/include/cp -rf ../TIM-VX/prebuilt-sdk/x86_64_linux/lib/ *     ./3rdparty/tim-vx/lib/x86_64/rm ./src/dev/tim-vx/src/tim/vx/ * _test.cc

编译指令如下:

mkdir build && cd buildcmake -DTENGINE_ENABLE_TIM_VX=ON -DTENGINE_ENABLE_TIM_VX_INTEGRATION=ON ..make -j4make install

3.测试

库路径如下:

3rdparty/tim-vx/lib/x86_64├── libArchModelSw.so├── libCLC.so├── libEmulator.so├── libGAL.so├── libNNArchPerf.so├── libOpenVXC.so├── libOpenVX.so -> libOpenVX.so.1.3.0├── libOpenVX.so.1 -> libOpenVX.so.1.3.0├── libOpenVX.so.1.3.0├── libOpenVXU.so├── libvdtproxy.so└── libVSC.sobuild/install/lib/└── libtengine-lite.so

int8模型量化:

/* set runtime options */struct options opt;opt.num_thread = num_thread;opt.cluster = TENGINE_CLUSTER_ALL;opt.precision = TENGINE_MODE_UINT8;opt.affinity = 0;

模型文件以及测试用的用图片在百度网盘链接中:

./tm_classification_timvx -m squeezenet_uint8.tmfile -i cat.jpg -r 1 -s 0.017,0.017,0.017 -r 10

terminal中的结果如下:

Tengine plugin allocator TIMVX is registered.Image height not specified, use default 227Image width not specified, use default  227Mean value not specified, use default   104.0, 116.7, 122.7tengine-lite library version: 1.2-devTIM-VX prerun.model file : /root/Desktop/NPU/tengine-lite/models/squeezenet_uint8.tmfileimage file : /root/Desktop/NPU/tengine-lite/image/cat.jpgimg_h, img_w, scale[3], mean[3] : 227 227 , 0.017 0.017 0.017, 104.0 116.7 122.7Repeat 10 times, thread 1, avg time 14431.41 ms, max_time 14699.82 ms, min_time 14213.30 ms--------------------------------------34.786182, 27833.942883, 28733.732056, 28031.834629, 27730.780502, 282--------------------------------------

X86上模拟的话速度还是很慢的,哈哈,需要等待一会,不过没有板子的话测试还是不错的了。