Disclaimer: This dissertation has been written by a student and is not an example of our professional work, which you can see examples of here.

Any opinions, findings, conclusions, or recommendations expressed in this dissertation are those of the authors and do not necessarily reflect the views of UKDiss.com.

Object Detection and Tracking in Image Processing and Computer Vision

Info: 3998 words (16 pages) Dissertation
Published: 11th Dec 2019

Reference this

Tags: Computer ScienceTechnology


This reports aims to research existing methods for feature selection, feature matching and keypoint selection in images. The overall aim being to explore the area of object detection and tracking. Well established descriptors such as Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF) will be explored using the open source computer vision library OpenCV

1 Introduction

Object Detection has an important role in the image processing and computer vision field. It is the process of finding the identity of a known object that is being observed in an image or a video sequence from a set of known descriptors with the help of an object recognition technique. Object Detection has many applications in multiple industries. The technology can be used for surveillance, industrial inspection, robotics, medical analysis, human to computer interaction, intelligent vehicle systems etc. Object recognition becomes humdrum due to the complex nature of positioning, scaling, alignment and occlusion of objects. With these points in mind, the project aims to review existing methods used and to compare the systems.

Object detection techniques in this report include scale invariant feature transform (SIFT), speeded up robust features (SURF), colour based and finally background subtraction. Template matching involves matching a small portion of an image with a template image. The method can be used on either greyscale or colour images. Colour based detection, whilst being very simple, could also be used. This involves detection via a matching a histogram containing the colour values of the image. SIFT, involves scale invariant feature which will be explored further in this project report. Speeded up robust features (SURF) is a modified version of the SIFT algorithm which uses a different method to detect feature points. Finally, background subtraction does exactly what its called, using a threshold the background is removed and an object is identified.

1.1 Related Projects

There are a many projects that are cited and referenced in this report. Distinctive Image Features from Scale by David Lowe (2004) and Speeded Up Robust Features by Herbert Bay (2005) are the main reports that are cited in this report. The use of OpenCV and Matlab also have a large catalogue of materials to help this project and they are also used.

1.2 Research Questions


This report aims to explore methods currently used for detection and tracking of objects. The main questions that are going to be answered are what methods of object detection and feature matching are being used? What are the ethical, environmental, legal and societal issues relating to the field? What are the applications of the methods described?  The advantages and disadvantages of the various methods will also be discussed.

1.3 Ethical, Legal, Environmental and Societal Issues

Ethical issues related to this computer vision in general include the privacy of the data being used, the sensitivity of the image stream being processed and the implementation of such technology all have an impact on society. This has been expressed in a number of ways, for example Isaac Asimov (1970) with his  “Law of Robotics” [1], to books such as the Philosophy of Computing and Information by Steven M. Cahn [2] which express the  concern and “rules” that should be kept when developing software that utilises vision and computers.

Another issue with detection and tracking is that it isn’t only limited to objects, faces and people can also be tracked essentially impeding the privacy of a human being. With such technology the applications are pretty limitless, from innocent automated storage and retrieval of items to military use for identifying targets on an operation in a target country for example. The issues are complex and require deep consideration when looking to create an application or system applying the technology. This also provokes privacy rules associated with people and the stored image data associated with image processing.

Legal aspects that affect the development of the programs must comply with The Data Protection Act 1998 [3]. The act was created to protect the rights and privacy of personal information such that organisations, businesses and the government have tight measures on how the data stored is used [4]. Data must be kept safe and secure, handled according to people rights, used for specifically stated purposes and to be used fairly and lawfully  [4].

Environmentally this project doesn’t pose a major impact, the largest impact to the environment that the methods explored pose are the energy consumption of the computers used to undertake the task. If the said method requires a large amount of processing power, the energy that is consumed by the computer would have an affect on its surrounding environment.

The societal issues associated the approaches described in this report are no so much the methods themselves but the applications of the said method. For example, using facial recognition or to use cctv footage to identify a potential criminal all pose issues with the way the data is used and kept. Also the societal issues posed with an algorithm identified an object or person, if a person is misidentified then the issues associated would be linked to racial issues, sexual harassment or similar.

1.4 Definitions

This section will define a couple of important ideas and concepts that will be continuously referred to in this report. The definitions will apply to all of the methods explored and compared to in this report.


1.4.1 Affine Invariance

An affine transformation is constructed by using a sequence of translation, scale, flip, rotation, and shears. The essence of an affine transformation, also known as affinity, is that of a linear mapping that preserves collinearity and ratio of distance between points on a shape. Creating invariance for affinity would enable object detection to become more useful in applications where motion is inevitable or in dynamic situations.

In object recognition and detection, objects are often detected with ambiguity; the method used to detect the object governs this predominantly. A transformation of the first order is defined as affine, the approximation of the differences between two images using affine approximation is common in computer vision [4].

Knowing this is essential so that an algorithm or method used for vision is preferred to deal with a known object up to the affine transformation. Dealing with objects outside this transform will only create uncertainty and ambiguity when processing and potentially matching.

In the past there have been two approaches to affine invariance in the application of computer vision. In general, the two approaches are invariants and normalisation. Affine invariance involves computing different functions of a set of points, these are said to be called invariants, that are relevant to a group of transformations [5]. Normalisation conducts a different approach to the invariants method. Normalisation first brings the object into a normalised position to eliminate the impact from affine transformation. The normalised position of the affine points is independent to that of the original object. Due to the fact that the normalised position of an object is only different from the original object by the factor of the affine transform, the weighted average of the normalised object are in fact the affine invariants of the original object [6].

In summary, affine invariance is the idea that a target object is immune to deformation of the original object, making the target object highly detectable and repeatedly accurate.

1.4.2 Scale Invariance


Scale invariance is the ability for a descriptor of a keypoint to be able to be reproduced at a multiplied factor of the detected scale, accurately. Usually methods that use scale invariance assume that the scale of the object has been changed evenly in all directions. It is shown that there is a small amount of robustness against affine deformations. To enable scale invariance, it is generally approached by forming a three dimensional scale space pyramid of an image with the parameters of





for i=2:r-1

for j=2:c-1









if (i_two(i,j)>max_y && i_two(i,j)>max_z)…

|| (i_two(i,j)<min_y && i_two(i,j)<min_z)

key=[key i_two(i,j)];

key_loc=[key_loc i j];




The next step in the SIFT section of the software was to display the maxima and minima onto the original image, lena.jpg. This was done by making the pixel value at the point of extrema equal to one. This created white pixel on the original image where the extrema were found to be. Each position of maxima is found at in the for loop by equating k_one to a point and equating j_one to the point that is immediately next k_one. I_two then equates both the points to one to be displayed on the original image.

for i=1:2:length(key_loc);





the result for the above code can be seen in figure 6.5.




Figure 6.5: The points of maxima and minima that were found on lena.jpg.

6.2.2 SURF Experiments

The SURF algorithm was implemented using the OpenCV functions that are included in the image processing toolbox in the Matlab version used for this report. Following the theory outlined in the earlier SURF chapter, an overview of the method implemented is shown in figure 6.6.

Figure 6.6: An overview of the SURF method implemented.

The first step was to read object and the scene in order for processing. The object and the image were read in individually, converted from the RGB colour space to grey and then each display to the user. This was done by the following code.

canImage = imread(‘can.jpg’); %read the object



imshow(canImage)   %display the object

title(‘Image of Monster Can’);

canGrey = rgb2gray(canImage); %convert the object to grey



imshow(canGrey)  %display the grey object

title(‘Image of Monster Can Greyscale’);

canScene = imread(‘canScene.jpg’); %read in the scene



imshow(canScene);    %display the scene

title(‘Image of Can in Cluttered Scene’);

canSceneGray = rgb2gray(canScene); %convert the scene to grey



imshow(canSceneGray);   %display the grey scene

title(‘Image of Can in Cluttered Screen Greyscale’);

The result of the code to read the images is shown in figure 6.7.


Figure 6.7: The colour and greyscale images of the object and the scene

Once the object and the scene were read, they both undertook the steps to extract the feature points from each image. As there were many feature points detected, the fifty best were displayed. This was done by the following code.

canPoints = detectSURFFeatures(canGrey);

canScenePoints = detectSURFFeatures(canSceneGray);




title(’50 Strongest Feature Points from Can Image’);

hold on;

plot(selectStrongest(canPoints, 50));




title(’50 Strongest Feature Points from Scene Image’);

hold on;

plot(selectStrongest(canScenePoints, 50));

The above code results in the features detected and shown in figure 6.8.


Figure 6.8: The fifty strongest feature points detected on both the can and the scene.

Once the feature points were detected the two images were compared and the feature points extracted in order to find any matches. The first step was to extract the features from both the object and the scene and then to match them. Next was to display the matched points on top of the original image for a visual representation. This was done by the following code.

[canFeatures, canPoints] = extractFeatures(canGrey, canPoints);

[sceneFeatures, canScenePoints] = extractFeatures(canSceneGray, canScenePoints);

canPairs = matchFeatures(canFeatures, sceneFeatures);

matchedCanPoints = canPoints(canPairs(:, 1), :);

matchedScenePoints = canScenePoints(canPairs(:, 2), :);



showMatchedFeatures(canGrey, canScene, matchedCanPoints, …

matchedScenePoints, ‘montage’);

title(‘Matched Points (Including Outliers)’);

The points on the image included any outlier points detected (points not within the boundry of the detected object). Figure 6.9 shows the result of this operation.


Figure 6.9: The matched points including outliers on the left and the matched points excluding the outliers on the right.

The final step was to convert the matched inlier points into a bounding box by performing a geometric transform to the matched points. Once the transform had been calculated, the nest step was to apply a polygon so that the detected object can be bound for a user to see the result. Below is the code used for the operation.

[tform, inlierCanPoints, inlierScenePoints] = …

estimateGeometricTransform(matchedCanPoints, matchedScenePoints, ‘affine’);



showMatchedFeatures(canGrey, canSceneGray, inlierCanPoints, …

inlierScenePoints, ‘montage’);

title(‘Matched Points (Inliers Only)’);

canPolygon = [1, 1;…                           % top-left

size(canImage, 2), 1;…                 % top-right

size(canImage, 2), size(canImage, 1);… % bottom-right

1, size(canImage, 1);…                 % bottom-left

1, 1];           % top-left again to close the polygon

newCanPolygon = transformPointsForward(tform, canPolygon);



hold on;

line(newCanPolygon(:, 1), newCanPolygon(:, 2), ‘Color’, ‘r’);

title(‘Detected Can’);

Once this operation was completed, Figure 6.10 shows the resulting image.


Figure 6.10: The object has been detected in the scene.

6.2.3 Colour Based Object Tracking

The colour based object tracking method was implemented following the theory outlined in the earlier SURF chapter, an overview of the method implemented is described in the relevant chapter.

The first step of this approach was to capture the first frame of the video. The number of frames used to test this method was two hundred. Each frame is processed in order to track the object in real-time. Once the frame was captured, the next step was to split the red, green and blue channels in order to seclude the red sections from the rest of the image. This was done by the following code.

while(FaceTimeHD.FramesAcquired<200) %200 frame demo

frame = getsnapshot(FaceTimeHD); % Get the snapshot of the current frame

R = frame(:,:,1); % Red channel

G = frame(:,:,2); % Green channel

B = frame(:,:,3); % Blue channel

Y = ((0.299.*R) + (0.587.*G) + (0.114.*B)); %Y conversion

difference = imsubtract(R, Y); %subtract to leave the R component only

Once the red channel has been secluded the following image is observed in figure 6.11.


Figure 6.11: Frame with the red component secluded from the captured frame.

The next step was to apply a median filter to eliminate any unwanted noise from the newly subtracted image. A filter of three by three was chosen as anything higher or lower provided erratic results when testing and too much or too little of the image was taken away. Once the image had been filtered of noise it was then time to convert the image into a binary format. This is to set any red regions as a white blob and to set the background as a black region. Finally, any regions that are detected and to be smaller than 400 pixels in size (small unwanted objects) were filtered out using an inbuilt matlab command bwareaopen. This function removes small objects from a binary image [23]. Below is the code which enabled this step to happen.

difference = medfilt2(difference, [3 3]); %filter noise using 3×3 median filter

difference = im2bw(difference,0.18); %convert to binary

difference = bwareaopen(difference,400); %remove pixels less than 400

bw = bwlabel(difference, 8); %label the blobs using connected component 8

The result of the newly formed binary image that visibly shows red regions detected in the current frame. This is shown in figure 6.12.


Figure 6.12: The binary image after filtering and binary conversion.

Finally, the detected red objects were bound in a green box. This was done by using the region properties of the white sections and to find the centre point of each detected region. Using these statistics of the regions enabled a box to be drawn over the region and to be overlaid the original frame resulting in detection of red objects. This was done by the code below.

stats = regionprops(bw, ‘BoundingBox’, ‘Centroid’); % Reigion properties (labeled)

imshow(frame) % original frame with bound object.

title(‘Red objects detected bound in red box’);

hold on

for object = 1:length(stats) %for each object detected repeat the bounding procedure.

bound_box = stats(object).BoundingBox;

bound_centroid = stats(object).Centroid;


plot(bound_centroid(1),bound_centroid(2), ‘-m+’)


Lastly figure 6.13 shows the object that was being detected and 6.14 shows the resulting object that was detected in the current video frame.


Figure 6.13: Beer bottle with red sleeve to be detected in a live video feed.


Figure 6.14: Detected red object from live video feed.

6.2.4 Background Subtraction

The first step of this method was to read in the background image and the frame that required an object to be detected and potentially tracked in. Similar to the method used in the colour based object tracking the first step was to read in the required images. To use this method, the images needed to be converted from RGB to HSV as stated in the Background Subtraction chapter. The following code shows this.

back_img = imread(‘background.jpg’); %read background image

frame_img = imread(‘frame.jpg’); %read frame

back_img_hsv = rgb2hsv(back_img); %convert from rgb to hsv

frame_img_hsv = rgb2hsv(frame_img); %convert the image from rgb to hsv

Figure 6.15 shows the images that were processed.


Figure 6.15: The left image shows the background; the right image shows the frame image.

The next step was to subtract the two images in order to identify the difference between the two. Once completed the image was converted to binary to seclude the detected regions from the background. This was done by the following code.

difference =  frame_img_hsv – back_img_hsv;

subimage = difference;

difference = im2bw(difference,0); %convert to binary

difference = bwareaopen(difference,1); %remove noise

difference=medfilt2(difference,[2 2]);

Once this operation is completed the result is a black and white image showing where the difference of the two images occurs. This can be seen in figure 6.16.


Figure 6.16: This was the subtracted image. The white shows the region of interest and the black represents the background.

Next the white regions location and properties are found and the pixels are connected via the Connected Component Algorithm and the objects are highlighted with a red border. Below is the code.

[L num] = bwlabel(difference, 8); %label the blobs using connected component 8

stats=regionprops(L,’BoundingBox’, ‘Centroid’)

[length_two num2]=bwlabel(L);

[B,L,N,A] = bwboundaries(length_two);

hold on

for k=1:length(B),


bounding_box = B{k};

plot(bounding_box(:,2), bounding_box(:,1), ‘r’,’LineWidth’,2);



Figure 6.17 shows the final result of the objects that have been detected by the subtraction of the background and the frame, and they have been bound with a red border that traces the perimeter of the white pixel shape.


Figure 6.18: The final result showing the detected objects bound with a red perimeter.

7 Comparison

Running the SIFT and SURF algorithms using the same image it was possible to compare the time taken for the respective algorithm to complete the task of finding feature points.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please: