Linux 36.3 + JetPack v6.0@jetson-inference之目标检测_Linux

linux 36.3 + jetpack v6.0@jetson-inference之目标检测

1. 源由
2. detectnet
3. 代码
- 3.1 python
- 3.2 c++
4. 参考资料

1. 源由

从应用角度来说，目标检测是计算机视觉里面第二个重要环节。之前的识别示例输出了表示整个输入图像的类别概率。接下来，将专注于目标检测，通过提取边界框来找到帧中各种目标的位置。与图像分类不同，目标检测网络能够在每帧中检测到多个不同的目标。

2. detectnet

detectnet对象接受图像作为输入，并输出检测到的边界框坐标列表以及它们的类别和置信度值。detectnet可以在python和c++中使用。请参阅下面可供下载的各种预训练检测模型。默认使用的模型是基于ms coco数据集训练的91类ssd-mobilenet-v2模型，该模型在jetson上结合tensorrt实现了实时推理性能。

2.1 命令选项

$ detectnet --help
usage: detectnet [--help] [--network=network] [--threshold=threshold] ...
                 input [output]

locate objects in a video/image stream using an object detection dnn.
see below for additional arguments that may not be shown above.

positional arguments:
    input           resource uri of input stream  (see videosource below)
    output          resource uri of output stream (see videooutput below)

detectnet arguments:
  --network=network     pre-trained model to load, one of the following:
                            * ssd-mobilenet-v1
                            * ssd-mobilenet-v2 (default)
                            * ssd-inception-v2
                            * peoplenet
                            * peoplenet-pruned
                            * dashcamnet
                            * trafficcamnet
                            * facedetect
  --model=model         path to custom model to load (caffemodel, uff, or onnx)
  --prototxt=prototxt   path to custom prototxt to load (for .caffemodel only)
  --labels=labels       path to text file containing the labels for each class
  --input-blob=input    name of the input layer (default is 'data')
  --output-cvg=coverage name of the coverage/confidence output layer (default is 'coverage')
  --output-bbox=boxes   name of the bounding output layer (default is 'bboxes')
  --mean-pixel=pixel    mean pixel value to subtract from input (default is 0.0)
  --confidence=conf     minimum confidence threshold for detection (default is 0.5)
  --clustering=cluster  minimum overlapping area threshold for clustering (default is 0.75)
  --alpha=alpha         overlay alpha blending value, range 0-255 (default: 120)
  --overlay=overlay     detection overlay flags (e.g. --overlay=box,labels,conf)
                        valid combinations are:  'box', 'lines', 'labels', 'conf', 'none'
  --profile             enable layer profiling in tensorrt

objecttracker arguments:
  --tracking               flag to enable default tracker (iou)
  --tracker=tracker        enable tracking with 'iou' or 'klt'
  --tracker-min-frames=n   the number of re-identified frames for a track to be considered valid (default: 3)
  --tracker-drop-frames=n  number of consecutive lost frames before a track is dropped (default: 15)
  --tracker-overlap=n      how much iou overlap is required for a bounding box to be matched (default: 0.5)

videosource arguments:
    input                resource uri of the input stream, for example:
                             * /dev/video0               (v4l2 camera #0)
                             * csi://0                   (mipi csi camera #0)
                             * rtp://@:1234              (rtp stream)
                             * rtsp://user:pass@ip:1234  (rtsp stream)
                             * webrtc://@:1234/my_stream (webrtc stream)
                             * file://my_image.jpg       (image file)
                             * file://my_video.mp4       (video file)
                             * file://my_directory/      (directory of images)
  --input-width=width    explicitly request a width of the stream (optional)
  --input-height=height  explicitly request a height of the stream (optional)
  --input-rate=rate      explicitly request a framerate of the stream (optional)
  --input-save=file      path to video file for saving the input stream to disk
  --input-codec=codec    rtp requires the codec to be set, one of these:
                             * h264, h265
                             * vp8, vp9
                             * mpeg2, mpeg4
                             * mjpeg
  --input-decoder=type   the decoder engine to use, one of these:
                             * cpu
                             * omx  (aarch64/jetpack4 only)
                             * v4l2 (aarch64/jetpack5 only)
  --input-flip=flip      flip method to apply to input:
                             * none (default)
                             * counterclockwise
                             * rotate-180
                             * clockwise
                             * horizontal
                             * vertical
                             * upper-right-diagonal
                             * upper-left-diagonal
  --input-loop=loop      for file-based inputs, the number of loops to run:
                             * -1 = loop forever
                             *  0 = don't loop (default)
                             * >0 = set number of loops

videooutput arguments:
    output               resource uri of the output stream, for example:
                             * file://my_image.jpg       (image file)
                             * file://my_video.mp4       (video file)
                             * file://my_directory/      (directory of images)
                             * rtp://<remote-ip>:1234    (rtp stream)
                             * rtsp://@:8554/my_stream   (rtsp stream)
                             * webrtc://@:1234/my_stream (webrtc stream)
                             * display://0               (opengl window)
  --output-codec=codec   desired codec for compressed output streams:
                            * h264 (default), h265
                            * vp8, vp9
                            * mpeg2, mpeg4
                            * mjpeg
  --output-encoder=type  the encoder engine to use, one of these:
                            * cpu
                            * omx  (aarch64/jetpack4 only)
                            * v4l2 (aarch64/jetpack5 only)
  --output-save=file     path to a video file for saving the compressed stream
                         to disk, in addition to the primary output above
  --bitrate=bitrate      desired target vbr bitrate for compressed streams,
                         in bits per second. the default is 4000000 (4 mbps)
  --headless             don't create a default opengl gui window

logging arguments:
  --log-file=file        output destination file (default is stdout)
  --log-level=level      message output threshold, one of the following:
                             * silent
                             * error
                             * warning
                             * success
                             * info
                             * verbose (default)
                             * debug
  --verbose              enable verbose logging (same as --log-level=verbose)
  --debug                enable debug logging   (same as --log-level=debug)

注：关于照片、视频等基本操作，详见：《linux 36.3 + jetpack v6.0@jetson-inference之视频操作》

2.2 下载模型

两种方式：

创建对象时，初始化会自动下载
通过手动将模型文件放置到data/networks/目录下

国内，由于“墙”的存在，对于我们这种处于起飞阶段的菜鸟来说就是“障碍”。有条件的朋友可以参考进行设置网络。

不过，nvidia还是很热心的帮助我们做了“work around”，所有的模型都已经预先存放在中国大陆能访问的位置：github - model-mirror-190618

  --network=network     pre-trained model to load, one of the following:
                            * ssd-mobilenet-v1
                            * ssd-mobilenet-v2 (default)
                            * ssd-inception-v2
                            * peoplenet
                            * peoplenet-pruned
                            * dashcamnet
                            * trafficcamnet
                            * facedetect
  --model=model         path to custom model to load (caffemodel, uff, or onnx)

根据以上model方面信息，该命令支持：

ssd-mobilenet-v1
ssd-mobilenet-v2 (default)
ssd-inception-v2
peoplenet
peoplenet-pruned
dashcamnet
trafficcamnet
facedetect
支持定制模型(需要用到通用的模型文件caffemodel, uff, or onnx)

作为示例，就下载一个ssd-mobilenet-v2(default)模型

$ mkdir model-mirror-190618
$ cd model-mirror-190618
$ wget https://github.com/dusty-nv/jetson-inference/releases/download/model-mirror-190618/ssd-mobilenet-v2.tar.gz
$ tar -zxvf ssd-mobilenet-v2.tar.gz -c ../data/networks
$ cd ..

注：这个模型文件下载要注意，将解压缩文件放置到ssd-mobilenet-v2目录下。

2.3 操作示例

$ cd build/aarch64/bin/

2.3.1 单张照片

# c++
$ ./detectnet --network=ssd-mobilenet-v2 images/peds_0.jpg images/test/output_detectnet_cpp.jpg

# python
$ ./detectnet.py --network=ssd-mobilenet-v2 images/peds_0.jpg images/test/output_detectnet_python.jpg

本次cpp和python执行概率结果一致，不像imagenet有差异。

在这里插入图片描述

2.3.2 多张照片

# c++
$ ./detectnet "images/peds_*.jpg" images/test/peds_output_detectnet_cpp_%i.jpg

# python
$ ./detectnet.py "images/peds_*.jpg" images/test/peds_output_detectnet_python_%i.jpg

注：多张图片这里就不再放出了，感兴趣的朋友下载代码，本地运行一下即可。

2.3.3 视频

# download test video
wget https://nvidia.box.com/shared/static/veuuimq6pwvd62p9fresqhrrmfqz0e2f.mp4 -o pedestrians.mp4

# c++
$ ./detectnet ../../../pedestrians.mp4 images/test/pedestrians_ssd_detectnet_cpp.mp4

# python
$ ./detectnet.py ../../../pedestrians.mp4 images/test/pedestrians_ssd_detectnet_python.mp4

pedestrians

3. 代码

3.1 python

import statements
├── import sys
├── import argparse
├── from jetson_inference import detectnet
└── from jetson_utils import videosource, videooutput, log

command-line argument parsing
├── create argumentparser
│   ├── description: "locate objects in a live camera stream using an object detection dnn."
│   ├── formatter_class: argparse.rawtexthelpformatter
│   └── epilog: detectnet.usage() + videosource.usage() + videooutput.usage() + log.usage()
├── add arguments
│   ├── input: "uri of the input stream"
│   ├── output: "uri of the output stream"
│   ├── --network: "pre-trained model to load (default: 'ssd-mobilenet-v2')"
│   ├── --overlay: "detection overlay flags (default: 'box,labels,conf')"
│   └── --threshold: "minimum detection threshold to use (default: 0.5)"
└── parse arguments
    ├── args = parser.parse_known_args()[0]
    └── exception handling
        ├── print("")
        └── parser.print_help()
        └── sys.exit(0)

create video sources and outputs
├── input = videosource(args.input, argv=sys.argv)
└── output = videooutput(args.output, argv=sys.argv)

load object detection network
└── net = detectnet(args.network, sys.argv, args.threshold)

# note: hard-code paths to load a model (commented out)
   ├── net = detectnet(model="model/ssd-mobilenet.onnx", labels="model/labels.txt", 
   ├──                 input_blob="input_0", output_cvg="scores", output_bbox="boxes", 
   └──                 threshold=args.threshold)

process frames until eos or user exits
└── while true:
    ├── capture next image
    │   └── img = input.capture()
    │       └── if img is none: # timeout
    │           └── continue
    ├── detect objects in the image
    │   └── detections = net.detect(img, overlay=args.overlay)
    ├── print the detections
    │   ├── print("detected {:d} objects in image".format(len(detections)))
    │   └── for detection in detections:
    │       └── print(detection)
    ├── render the image
    │   └── output.render(img)
    ├── update the title bar
    │   └── output.setstatus("{:s} | network {:.0f} fps".format(args.network, net.getnetworkfps()))
    ├── print performance info
    │   └── net.printprofilertimes()
    └── exit on input/output eos
        ├── if not input.isstreaming() or not output.isstreaming():
        └── break

3.2 c++

#include statements
├── "videosource.h"
├── "videooutput.h"
├── "detectnet.h"
├── "objecttracker.h"
└── <signal.h>

global variables
└── bool signal_recieved = false;

function definitions
├── void sig_handler(int signo)
│   └── if (signo == sigint)
│       ├── logverbose("received sigint\n");
│       └── signal_recieved = true;
└── int usage()
    ├── printf("usage: detectnet [--help] [--network=network] [--threshold=threshold] ...\n");
    ├── printf("                 input [output]\n\n");
    ├── printf("locate objects in a video/image stream using an object detection dnn.\n");
    ├── printf("see below for additional arguments that may not be shown above.\n\n");
    ├── printf("positional arguments:\n");
    ├── printf("    input           resource uri of input stream  (see videosource below)\n");
    ├── printf("    output          resource uri of output stream (see videooutput below)\n\n");
    ├── printf("%s", detectnet::usage());
    ├── printf("%s", objecttracker::usage());
    ├── printf("%s", videosource::usage());
    ├── printf("%s", videooutput::usage());
    └── printf("%s", log::usage());

main function
├── parse command line
│   ├── commandline cmdline(argc, argv);
│   └── if (cmdline.getflag("help"))
│       └── return usage();
├── attach signal handler
│   └── if (signal(sigint, sig_handler) == sig_err)
│       └── logerror("can't catch sigint\n");
├── create input stream
│   ├── videosource* input = videosource::create(cmdline, arg_position(0));
│   └── if (!input)
│       ├── logerror("detectnet:  failed to create input stream\n");
│       └── return 1;
├── create output stream
│   ├── videooutput* output = videooutput::create(cmdline, arg_position(1));
│   └── if (!output)
│       ├── logerror("detectnet:  failed to create output stream\n");
│       └── return 1;
├── create detection network
│   ├── detectnet* net = detectnet::create(cmdline);
│   └── if (!net)
│       ├── logerror("detectnet:  failed to load detectnet model\n");
│       └── return 1;
│   └── const uint32_t overlayflags = detectnet::overlayflagsfromstr(cmdline.getstring("overlay", "box,labels,conf"));
├── processing loop
│   └── while (!signal_recieved)
│       ├── capture next image
│       │   ├── uchar3* image = null;
│       │   ├── int status = 0;
│       │   ├── if (!input->capture(&image, &status))
│       │   │   └── if (status == videosource::timeout)
│       │   │       └── continue;
│       │   │   └── break; // eos
│       ├── detect objects in the frame
│       │   ├── detectnet::detection* detections = null;
│       │   ├── const int numdetections = net->detect(image, input->getwidth(), input->getheight(), &detections, overlayflags);
│       │   └── if (numdetections > 0)
│       │       └── logverbose("%i objects detected\n", numdetections);
│       │       └── for (int n=0; n < numdetections; n++)
│       │           ├── logverbose("\ndetected obj %i  class #%u (%s)  confidence=%f\n", n, detections[n].classid, net->getclassdesc(detections[n].classid), detections[n].confidence);
│       │           ├── logverbose("bounding box %i  (%.2f, %.2f)  (%.2f, %.2f)  w=%.2f  h=%.2f\n", n, detections[n].left, detections[n].top, detections[n].right, detections[n].bottom, detections[n].width(), detections[n].height());
│       │           └── if (detections[n].trackid >= 0)
│       │               └── logverbose("tracking  id %i  status=%i  frames=%i  lost=%i\n", detections[n].trackid, detections[n].trackstatus, detections[n].trackframes, detections[n].tracklost);
│       ├── render outputs
│       │   ├── if (output != null)
│       │   │   ├── output->render(image, input->getwidth(), input->getheight());
│       │   │   ├── char str[256];
│       │   │   ├── sprintf(str, "tensorrt %i.%i.%i | %s | network %.0f fps", nv_tensorrt_major, nv_tensorrt_minor, nv_tensorrt_patch, precisiontypetostr(net->getprecision()), net->getnetworkfps());
│       │   │   ├── output->setstatus(str);
│       │   │   └── if (!output->isstreaming())
│       │   │       └── break;
│       └── print out timing info
│           └── net->printprofilertimes();
├── destroy resources
│   ├── logverbose("detectnet:  shutting down...\n");
│   ├── safe_delete(input);
│   ├── safe_delete(output);
│   ├── safe_delete(net);
└── logverbose("detectnet:  shutdown complete.\n");
    └── return 0;