Object Detection Techniques in Computer Vision

Object Detection Tools and Frameworks

Published in

The Startup

6 min readJan 5, 2020

Object detection is a technology related to computer vision that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or vehicles) in digital videos and images. Object detection has proved to be a prominent module for numerous important applications like video surveillance, autonomous driving, face detection, etc. Feature detectors such as Scale Invariant Feature Transform and Speeded Up Robust Feature are good methods which yield high quality features but are too computationally intensive for use in real-time applications of any complexity. Based on the normalized corner information, support vector machine and back-propagation neural network training are performed for the efficient recognition of objects. In this paper, we discuss the popular and widely used techniques along with the libraries and frameworks used for implementing the techniques. This paper presents the available technique in the field of Computer Vision which provides a reference for the end users to select the appropriate technique along with the suitable framework for its implementation.

With the recent advancements in the 21st century, there has been a lot of innovation and creative methodologies which enable the users to use object detection in a modular structure in the domain of object detection. Recent object detection libraries like TensorFlow Lite enable the users to use object detection in mobile platforms like Android and iOS. The mobile platform libraries are highly efficient enabling the users to deploy machine learning or object detection models on mobile platforms to make use of the computation power of the handheld devices.

Object Detection Techniques

Scale-Invariant Feature Transform (SURF):

The SIFT method can robustly identify objects even among clutter and under partial occlusion because the SIFT feature descriptor is invariant to scale, orientation, and affine distortion.

Steps for feature information generation in SIFT algorithms:

Scale-space extrema detection
Keypoint Locatization
Orientation assignment
Keypoint descriptor

Scale-Invariant Feature Transform (SURF)

The Harris corner detector is used to extract features. In Scale-space extrema detection, the interest points (keypoints) are detected at distinctive locations in the image. In Keypoint localization, among keypoint candidates, distinctive keypoints are selected by comparing each pixel in the detected feature to its neighbouring ones.

In Orientation assignment, dominant orientations are assigned to localized keypoints based on local image gradient directions.

In Keypoint descriptor, SIFT descriptors that are robust to local affine distortion are generated. This allows the keypoint descriptor that has many different orientations and scales to find objects in images. The SIFT method does not provide real-time object recognition due to expensive computation in feature detection and keypoint descriptor generation.

Speeded Up Robust Feature (SURF):

SURF algorithms have detection techniques similar to SIFT algorithms. The difference is that SURF algorithms simplify scale-space extrema detection by constructing the scale space via distribution changes instead of using Difference of Gaussian (DoG) filter. To approximate the Laplacian of Gaussian, SURF uses a box filter representation.

Simplified scale-space extrema detection in SIFT algorithms accelerates feature extraction speed, so they are several times faster than SIFT algorithms. SURF relies on integral images for image convolutions to reduce computation time.

SURF algorithms identify a reproducible orientation for the interest points by calculating the Haar-wavelet responses.

The descriptor describes a distribution of Haar-wavelet responses within the interest point neighborhood.

Feature Extraction
Orientation and size assignment
Descriptor generation

The image descriptor is generated by measuring an image gradient. SURF algorithms that rely on image descriptor are robust against different image transformations and disturbance in the images by occlusions. Despite reduced time for feature computation and matching, they have difficulty in providing real-time object recognition in resource-constrained embedded system environments.

Features from Accelerated Segment Test (FAST) corner detector:

Corners in an input image have distinctive features that clearly distinguish them from surrounding pixels. Reliable detection and tracking of corners in images are possible even when the images have geometric deformations. Thus, most object recognition algorithms utilize corner information to extract features. The Harris corner detector used in the SIFT method has good performance but it is not effective for real-time object recognition due to its long computation time.

FAST corner detector is 10 times faster than the Harris corner detector without degrading performance. It finds corners by examining a circle of sixteen pixels around the corner candidate. This candidate is detected as corner if the intensities of a certain number of contiguous pixels are all above or all below the intensity of the center pixel by some threshold. The extracted interest points lie on distinctive, high-contrast regions of the image.

Fast R-CNN:

A Fast R-CNN network takes an entire image as input and a set of object proposals. The network first processes the whole image with several convolutional and max pooling layers to produce a convolutional feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K-object classes plus a catch-all background class and another layer that outputs four real-valued numbers for each of the K-object classes. Each set of 4 values encodes refined bounding-box positions for one of the K-classes.

Fast R-CNN has several advantages:

Higher detection quality (mean Average Precision) than R-CNN, SPPnet (Spatial Pyramid Pooling)
Training is single-stage, using a multi-task loss
Training can update all network layers
No disk storage is required for feature caching

You Only Look Once (YOLO):

YOLO is a new and a novel approach to object detection. Prior work on object detection repurposes classifiers to perform detection. YOLO frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method, mistakes background patches in an image for objects because it cannot see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

References:

[1] http://correll.cs.colorado.edu/?p=2048

[2] Herbert Bay, Andreas Ess, Tinee Tuytelaars, Luc Van Gool, Computer Vision and Image Understanding (2008)

[3] Marc Pierrot Deseilligny, Ahmad Audi, Christophe Meynard, Christian Thom, Implementation of an IMU Aided Image Stacking Algorithm in a Digital Camera for Unmanned Aerial Vehicles (2017)

[4] https://arxiv.org/pdf/1506.01497.pdf

[5] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, You Only Look Once: Unified, Real-Time Object Detection

[6] Kanghun Jeong and Hyeonjoon Moon (2011). Object Detection using FAST Corner Detector based on Smartphone Platforms

[7] Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi(2016). You Only Look Once: Unified, Real-Time Object Detection.

[8] Ross Girshick. (2015). Fast R-CNN

[9] https://github.com/vishakha-lall/Real-Time-Object-Detection

[10] https://towardsdatascience.com/object-detection-using-deep-learning-approaches-an-end-to-end-theoretical-perspective-4ca27eee8a9a

[11] https://towardsdatascience.com/yolo-you-only-look-once-real-time-object-detection-explained-492dc9230006