Augmented Reality — Lab — Marker Tracking (OpenCV)
Marker Tracking
Requirements:
pip install opencv-python
pip install numpy
Introduction
The goal of this lab session is to introduce you to several fundamental concepts in computer vision through a concrete application: detecting and localizing a planar marker in a video stream.
You will work on:
- detecting feature points (interest points),
- describing these points (local descriptors),
- matching points detected in two images,
- computing a homography (projective transformation),
- augmenting a video stream based on marker localization.
By the end of the lab, you will have a program that, given:
- a reference image (the marker),
- a video stream (webcam or video file),
automatically detects the presence of the marker in the scene and masks the marker area by replacing it with a white polygon (to visually validate detection).
Application Principle
The overall idea of the system is the following:
Load the marker image (reference image).
Open a video stream (webcam or video file).
Detect a set of feature points in the marker image (FP_M).
Compute descriptors for these points (FD_M).
For each frame I_t of the video stream:
- detect a set of feature points (FP_t),
- compute descriptors (FD_t),
- match FD_M and FD_t to obtain correspondences,
- filter these correspondences to retain only “good matches”.
Estimate the geometric transformation relating the marker plane to the frame: a homography H_{M->t} robustly computed using RANSAC.
Use H_{M->t} to:
- project the 4 corners of the marker into image I_t,
- mask the detected region by filling it in white.
Important note: this lab performs localization via “re-detection + matching” at each frame. It is not a temporal tracking approach (e.g., KLT/optical flow), but rather a robust and simple method to implement.
Feature Detection
The objective of feature detection (or more generally “interest regions”) is to automatically select image elements that exhibit distinctive properties: corners, textured regions, high-contrast areas, etc.
A feature detector takes an image as input and outputs a set of pixel coordinates corresponding to points considered “interesting” by the algorithm.
Important: a detector provides positions (and sometimes scale/orientation) but not a descriptive signature.
Examples of detectors: Harris, FAST, Shi-Tomasi, ORB, AKAZE, SIFT, etc.
Resources (to read after the lab):
- https://fr.wikipedia.org/wiki/D%C3%A9tection_de_zones_d%27int%C3%A9r%C3%AAt
- https://en.wikipedia.org/wiki/Feature_detection_(computer_vision)
Feature Description
A feature descriptor aims to numerically characterize the local appearance around an interest point.
It takes as input:
- an image,
- a list of feature points,
and outputs a vector (or matrix) of descriptors, one per point.
These descriptors represent a local “fingerprint” that enables comparison of points detected in different images.
Desirable properties of a descriptor:
- invariance (or robustness) to rotation,
- invariance (or robustness) to scale changes,
- robustness to photometric variations (lighting),
- robustness to moderate geometric transformations.
Examples: ORB, AKAZE, BRISK, SIFT (detector + descriptor), etc.
Resources (to read after the lab):
- https://fr.wikipedia.org/wiki/Extraction_de_caract%C3%A9ristique_en_vision_par_ordinateur
- https://en.wikipedia.org/wiki/Visual_descriptor
- https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
Matching
Matching addresses the following question:
Given a feature in the marker image, how can we find the most similar feature in the current video frame?
The general idea is:
- Define a distance or similarity measure between descriptors.
- For each descriptor of the marker, find the closest descriptor in the frame.
Possible measures: L1, L2, Hamming (for binary descriptors), ratio test, etc.
Matching methods:
- Brute force: compares all descriptors pairwise (simple, potentially costly).
- FLANN: approximate nearest neighbors (faster for large datasets).
After matching, a filtering rule must be defined to retain only relevant correspondences. A classical approach (used in the provided base code) consists of using a threshold based on the minimum observed distance:
threshold = alpha * minDist
where alpha is a coefficient (typically between 3 and 20 depending on the method and expected quality).
Resources (to read after the lab):
- https://docs.google.com/presentation/d/1_HFh3SdmdyZ_j-sFS4Tw17DmhKfjmYZqvSp7TmfdD_M/edit?slide=id.p#slide=id.p
- https://www.cs.toronto.edu/~urtasun/courses/CV/lecture04.pdf
Homography and Projection
Once correspondences between the marker (planar object) and the frame (image) have been established, we estimate the geometric transformation relating these two sets of points.
For a planar object, this transformation is a homography (3x3 matrix). It models perspective effects and relates two views of the same plane.
We estimate a matrix H from the good matches (after filtering). The estimation must be robust to outliers: we use RANSAC.
Resources (to read after the lab):
- https://en.wikipedia.org/wiki/Homography_(computer_vision)
- https://fr.wikipedia.org/wiki/Application_projective
Implementation in OpenCV
The objective of the lab is to implement the previous steps in OpenCV starting from a provided code skeleton.
Provided Base Code
You are given the following files:
main.pyMyFeatureDetector.pyMyDescriptorExtractor.pyMyDescriptorMatcher.py
These files contain TODO sections guiding your implementation.
Important: the current base code loads two static images. You must adapt it to handle:
- a reference image
marker.jpg, - a video stream (webcam or file).
You will remove the final warped image generation part and replace it with an augmentation step (masking the marker) based on projected corners.
Questions
Question 1: Reading and Understanding the Provided Code
Read the provided code and identify:
- where images are loaded,
- where the detector, extractor, and matcher are instantiated,
- where detected points and descriptors are stored,
- how the list of best matches is built,
- how the homography is computed.
Explain in a few lines the role of each file.
Question 2: Create a Feature Detector (ORB)
In MyFeatureDetector.py, complete the changeFeatureDetector function to create an ORB detector.
Objective: fill the myFeatureDetector attribute by calling the appropriate OpenCV constructor (e.g., cv.ORB_create(...)).
Question 3: Display Detected Points
In main.py, display detected feature points:
- on the image
marker.jpg, - on a video frame.
Hint: use the displayFeatures method and the existing TODO sections in the skeleton.
To do: add a screenshot showing results on the marker and on a frame.
Optional: add parameters to cv.ORB_create(...) (number of points, threshold, etc.) and comment on their impact.
Question 4: Instantiate a Descriptor Extractor (ORB)
In MyDescriptorExtractor.py, complete changeDescriptorExtractor to create an ORB descriptor extractor.
Question 5: Compute Descriptors
Complete computeDescriptors() in MyDescriptorExtractor.py and complete the corresponding calls in main.py to:
- compute descriptors for the marker,
- compute descriptors for a frame.
Verify that descriptors are properly computed (consistent dimensions, non-empty).
Question 6: Perform Matching
In MyDescriptorMatcher.py, complete the match method to retain only correspondences whose distance is below a threshold:
threshold = alpha * minDist
Selected matches must be stored in bestMatches.
Question 7: Display Matching Results
In main.py, call drawMatchingResults() and display the resulting image.
To do: illustrate this result with a screenshot in your report.
Question 8: Understand and Compute the Homography
From the best matches, build two lists of 2D points:
- points in the marker image,
- points in the video frame.
Then compute the homography using cv.findHomography(..., cv.RANSAC, epsilon).
Explain briefly what this computation does and why RANSAC is necessary.
Question 9: Augmentation — Mask the Marker
Once homography H is estimated, project the 4 marker corners into the frame.
Expected steps:
- Define the 4 marker corners in its image coordinate system.
- Project them using
cv.perspectiveTransform. - Draw the contour (recommended for debugging).
- Fill the projected polygon in white using
cv.fillConvexPoly.
Condition: perform augmentation only if H is valid and if the number of inliers is sufficient (threshold to define and justify).
To do: provide a screenshot where the marker is effectively masked.
Question 10: Test on Other Sequences
Test your program with:
- another video (or another webcam scene),
- another marker (different reference image).
To do: illustrate at least one additional test and comment on robustness.
Question 11: Change Detector/Descriptor (Optional)
Try another detector/descriptor combination (e.g., AKAZE) and compare:
- number of detected points,
- matching stability,
- number of inliers,
- robustness to scale/lighting variations,
- computational cost.
To do: illustrate and comment.
Deliverables
You must submit:
The completed code.
A short report (PDF or Markdown) containing:
- the requested screenshots,
- your parameter choices (alpha, inlier threshold, etc.),
- a robustness analysis (tests, failure cases).