Tracking in Mixed Reality
From Cameras to Robust Visual Tracking
Étienne Peillard – IMT Atlantique
🎯 Course Objectives
By the end of this lecture, you should be able to:
- Explain what tracking means in Mixed Reality
- Understand how cameras form digital images
- Connect image processing to feature detection
- Understand how features enable 6DoF tracking
- See the link with the practical lab (feature detection for tracking)
PART A — Why Tracking in Mixed Reality?
What is Tracking in Mixed Reality?
Tracking = continuous estimation of pose:
- Position (x, y, z)
- Orientation (roll, pitch, yaw)
This applies to:
- The camera/headset
- The user’s body
- Objects in the environment
- Sometimes the entire scene (SLAM)
Why is Tracking Critical in MR?
Because MR requires spatial alignment between:
- The real world (captured by sensors)
- The virtual world (rendered by the engine)
If tracking is wrong:
- Virtual objects drift
- Registration breaks
- The experience feels “fake” or unstable
The Registration Problem
Core question:
Given an image from a moving camera, how do we estimate its 6DoF pose in the real world?
Conceptual pipeline:
Three Families of Tracking in XR
Tracking and Latency
Good tracking must be:
- Accurate
- Low-latency
- Robust to occlusion
- Stable over time
Why? Because MR rendering depends on it in real time.
PART B — From Camera to Digital Image
Cameras as Sensors in XR
A camera is a physical sensor that:
- Captures light from the environment
- Converts it into a digital image
- Introduces limitations that affect tracking
Key issues:
- Discretization
- Aliasing
- Noise
- Limited dynamic range
Discretization of Images
The real world is continuous.
A digital image is sampled on a grid of pixels.
This creates:
- Loss of detail
- Potential aliasing
- Dependence on resolution
Aliasing and Moiré
If sampling is too coarse:
- High-frequency patterns create artifacts (moiré)
- Edges look jagged (aliasing)
This matters for tracking:
- Artifacts can confuse feature detectors
Nyquist–Shannon Theorem
To properly sample a signal:
Sampling frequency ≥ 2 × highest frequency in the signal
If violated → aliasing appears.
Color Quantization
Real light is continuous.
Digital images use a finite number of values:
- Typically 8-bit per channel (0–255)
- RGB representation
This limits precision in low-light conditions.
From RGB to Perception
Different color spaces exist:
- RGB (device-oriented)
- HSV (more aligned with human perception)
- Hue
- Saturation
- Value (brightness)
Useful in some tracking pipelines.
PART C — Projective Geometry (Core for Tracking)
Why Geometry Matters for Tracking
Tracking is fundamentally projective geometry.
We need to model:
- How 3D points project onto a 2D image
- How camera motion affects this projection
Intrinsic Parameters (K)
Describe the internal properties of the camera:
- Focal length
- Principal point
- Pixel aspect ratio
Fixed for a calibrated camera.
Extrinsic Parameters (R, t)
Describe the pose of the camera in the world:
- Rotation matrix R
- Translation vector t
👉 This is what tracking estimates.
The Projection Model (Simplified)
The core equation:
Where:
- X = 3D point in the world
- u = 2D pixel in the image
- K = intrinsics
- [R|t] = camera pose (tracking result)
Intuition of the Projection
- Transform 3D point from world to camera space (R, t)
- Project onto image plane (K)
- Obtain pixel coordinates (u)
This is the mathematical backbone of visual tracking.
PART D — Image Processing as a Precursor to Tracking
Why Image Processing for Tracking?
Three main goals:
- Reduce noise → more stable features
- Enhance structures → clearer edges/corners
- Normalize images → robustness across lighting changes
Grayscale Conversion
Most feature detectors work on grayscale images.
Classic formula:
Y = 0.299R + 0.587G + 0.114B
Reduces complexity while preserving structure.
Histogram Equalization
Improves contrast:
- Makes details more visible
- Helps feature detection in low-contrast images
Useful in challenging lighting conditions.
Contrast and Brightness Adjustment
A sliding kernel is applied over the image.
Each pixel becomes a weighted sum of its neighbors.
This enables:
- Blurring
- Edge detection
- Feature enhancement
Gaussian Blur
Used to:
- Reduce high-frequency noise
- Smooth the image before detecting features
Important for robust tracking.
Gradients and Edges
Edges correspond to strong intensity changes.
We compute gradients using filters such as:
Edges help locate meaningful structures.
Sobel Filter
f’(i)=2fi+1−fi−1 is equivalent to [21, 0, −21]
Laplacian (Second-Order Derivative)
Detects regions of rapid change:
- Useful for feature detection
- Highlights corners and fine details
The standard discrete Laplacian is written:
∇2f(i,j)=fi+1,j+fi−1,j+fi,j+1+fi,j−1−4fi,j
This corresponds to the kernel:
From Image Processing to Features
At this stage, we have a cleaned and enhanced image.
Next step:
👉 Detect meaningful points for tracking.
PART E — Feature Detection for Visual Tracking
What is a Feature?
A feature is a distinctive image pattern that can be reliably detected and matched across frames.
Examples:
What Makes a Good Feature?
A good feature should be:
- Local (robust to occlusion)
- Invariant (to translation, rotation, scale)
- Robust (to noise, lighting changes)
- Distinctive (easy to match)
- Repeatable (found again in next frame)
Why Corners are Ideal
Corners are:
- Stable across viewpoints
- Well-localized in both x and y
- Highly informative for motion estimation
Moravec Corner Detector (1977)
Idea:
- Measure how much intensity changes when shifting a small window in different directions.
Average intensity variation for a small displacement ((u,v))
E(u,v)=∑x,yw(x,y),(I(x+u,y+v)−I(x,y))2
- (w) specifies the considered neighborhood (value 1 inside the window and 0 outside);
- (I(x,y)) is the intensity at pixel ((x,y)).
Harris Corner Detector (1988)
Improvement over Moravec:
- Uses image gradients
- More robust
- Rotation invariant
Widely used in vision-based tracking.
Average intensity variation for a small displacement (u,v)
E(x,y)=∑u,vw(u,v),∣I(x+u,y+v)−I(x,y)∣2
- w specifies the considered neighborhood (value 1 inside the window, 0 outside);
- I(x,y) is the intensity at pixel (x,y).
Taylor expansion
I(x+u,y+v)=I(x,y)+u∂x∂I(x,y)+v∂y∂I(x,y)+o(u2,v2)
Neglecting o(u2,v2):
E(x,y)=∑u,vw(u,v)u∂x∂I(x,y)+v∂y∂I(x,y)2
Average intensity variation for a small displacement ((u,v)):
For small displacements (u,v)
E(x,y)=[u v]M[u v]T
M: symmetric, positive definite ⇒ eigenvalue decomposition possible.
Structure of the second-moment matrix
M=(A,C;C,B)
with:
A=(∂x∂I)2⊗w;B=(∂y∂I)2⊗w;C=(∂x∂I∂y∂I)⊗w
w: Gaussian window (isotropic).
A corner is characterized by a large variation of E in all directions of (x,y).
⇒ Compute the eigenvalues of M.
Corner response
Instead of computing the eigenvalues, we can compute:
det(M)=AB−C2=λ1λ2
trace(M)=A+B=λ1+λ2
and define the response:
R=det(M)−k,trace(M)2
Values of R:
- positive near corners,
- negative near edges,
- small in flat regions (k=0.04).
⇒ corners / interest points = local maxima of R.
Invariant by rotation: even after rotation, the matrix shape is unchanged, so the eigenvalues and the response R are unchanged.
Not invariant by scale
Potential solution: compute Harris response at multiple scales (Harris–Laplace).
PART F — From Features to Tracking
Feature-Based Tracking Pipeline
Frame t:
- Detect features
- Describe features
- Match with previous frame
- Estimate motion (R, t)
Repeat continuously.
Feature Matching
Features are compared using descriptors.
Goal:
- Find correspondences between frames
This enables motion estimation.
SIFT provides:
- Scale invariance
- Rotation invariance
- Robust descriptors
Widely used in classical computer vision.
SIFT Pipeline
- Build Gaussian pyramid (multi-scale representation)
- Compute Difference of Gaussians (DoG)
- Detect keypoints
- Assign orientation
- Compute descriptor from local gradients
Build Gaussian pyramid, compute DoG and detect keypoints
To obtain scale-independent descriptors, the image is resampled multiple times (building a “pyramid”).
The Gaussian difference is more efficient than a Laplacian calculation for each level.
Assign orientation and compute descriptor
The descriptor is a histogram of local gradients around the keypoint, weighted by a Gaussian window.
PART G — From 2D Features to 6DoF Tracking
The PnP Problem (Perspective-n-Point)
If we know:
- 3D world points
- Their 2D projections in the image
We can solve for camera pose (R, t).
This is central to many AR systems.
Visual Odometry
If no 3D map is available:
- Track motion between consecutive frames
- Estimate relative movement of the camera
Used in SLAM systems.
SLAM in XR
Modern headsets use SLAM:
- Simultaneous Localization and Mapping
- Build a map of the environment
- Track the headset within that map
Examples:
- ARKit
- ARCore
- Meta Quest inside-out tracking
PART H — Tracking in Modern XR Systems
Uses multiple cameras + IMU to:
- Track headset position
- Track controllers
- Map environment
Fully vision-based.
Optical Tracking (Vicon/OptiTrack)
Uses external cameras to track reflective markers.
Very accurate but requires a dedicated setup.
Hybrid Tracking Systems
Combine:
To improve robustness and reduce drift.
Tracking in Mixed Reality From Cameras to Robust Visual Tracking Étienne Peillard – IMT Atlantique