Tracking in Mixed Reality

From Cameras to Robust Visual Tracking

Étienne Peillard – IMT Atlantique

alt text

🎯 Course Objectives

By the end of this lecture, you should be able to:

  • Explain what tracking means in Mixed Reality
  • Understand how cameras form digital images
  • Connect image processing to feature detection
  • Understand how features enable 6DoF tracking
  • See the link with the practical lab (feature detection for tracking)

PART A — Why Tracking in Mixed Reality?

What is Tracking in Mixed Reality?

Tracking = continuous estimation of pose:

  • Position (x, y, z)
  • Orientation (roll, pitch, yaw)

This applies to:

  • The camera/headset
  • The user’s body
  • Objects in the environment
  • Sometimes the entire scene (SLAM)

Why is Tracking Critical in MR?

Because MR requires spatial alignment between:

  • The real world (captured by sensors)
  • The virtual world (rendered by the engine)

If tracking is wrong:

  • Virtual objects drift
  • Registration breaks
  • The experience feels “fake” or unstable

The Registration Problem

Core question:

Given an image from a moving camera, how do we estimate its 6DoF pose in the real world?

Conceptual pipeline:

alt text

Three Families of Tracking in XR

alt text

Tracking and Latency

Good tracking must be:

  • Accurate
  • Low-latency
  • Robust to occlusion
  • Stable over time

Why? Because MR rendering depends on it in real time.

PART B — From Camera to Digital Image

Cameras as Sensors in XR

A camera is a physical sensor that:

  • Captures light from the environment
  • Converts it into a digital image
  • Introduces limitations that affect tracking

Key issues:

  • Discretization
  • Aliasing
  • Noise
  • Limited dynamic range

Discretization of Images

The real world is continuous.
A digital image is sampled on a grid of pixels.

This creates:

  • Loss of detail
  • Potential aliasing
  • Dependence on resolution

alt text

Interpolation

alt text

Aliasing and Moiré

If sampling is too coarse:

  • High-frequency patterns create artifacts (moiré)
  • Edges look jagged (aliasing)

This matters for tracking:

  • Artifacts can confuse feature detectors

Aliasing

alt text

Moiré

alt text

Demo: effect of downsampling on camera

Nyquist–Shannon Theorem

To properly sample a signal: Sampling frequency ≥ 2 × highest frequency in the signal

If violated → aliasing appears.

alt text

Color Quantization

Real light is continuous.
Digital images use a finite number of values:

  • Typically 8-bit per channel (0–255)
  • RGB representation

This limits precision in low-light conditions.

alt text

From RGB to Perception

Different color spaces exist:

  • RGB (device-oriented)
  • HSV (more aligned with human perception)
    • Hue
    • Saturation
    • Value (brightness)

Useful in some tracking pipelines.

alt text

PART C — Projective Geometry (Core for Tracking)

Why Geometry Matters for Tracking

Tracking is fundamentally projective geometry.

We need to model:

  • How 3D points project onto a 2D image
  • How camera motion affects this projection

alt text

Intrinsic Parameters (K)

Describe the internal properties of the camera:

  • Focal length
  • Principal point
  • Pixel aspect ratio

Fixed for a calibrated camera.

alt text

Extrinsic Parameters (R, t)

Describe the pose of the camera in the world:

  • Rotation matrix R
  • Translation vector t

👉 This is what tracking estimates.

The Projection Model (Simplified)

The core equation:


u = K [R | t] X

Where:

  • X = 3D point in the world
  • u = 2D pixel in the image
  • K = intrinsics
  • [R|t] = camera pose (tracking result)

alt text

Intuition of the Projection

  1. Transform 3D point from world to camera space (R, t)
  2. Project onto image plane (K)
  3. Obtain pixel coordinates (u)

This is the mathematical backbone of visual tracking.

PART D — Image Processing as a Precursor to Tracking

Why Image Processing for Tracking?

Three main goals:

  1. Reduce noise → more stable features
  2. Enhance structures → clearer edges/corners
  3. Normalize images → robustness across lighting changes

Grayscale Conversion

Most feature detectors work on grayscale images.

Classic formula:

Y = 0.299R + 0.587G + 0.114B

Reduces complexity while preserving structure.

Histogram Equalization

Improves contrast:

  • Makes details more visible
  • Helps feature detection in low-contrast images

Useful in challenging lighting conditions.

alt text

Contrast and Brightness Adjustment

alt text

Convolution — The Key Tool

A sliding kernel is applied over the image.

Each pixel becomes a weighted sum of its neighbors.

This enables:

  • Blurring
  • Edge detection
  • Feature enhancement

alt text

alt text

Simple Blur

alt text

Gaussian Blur

Used to:

  • Reduce high-frequency noise
  • Smooth the image before detecting features

Important for robust tracking.

alt text

alt text

Gradients and Edges

Edges correspond to strong intensity changes.

We compute gradients using filters such as:

  • Sobel
  • Prewitt

Edges help locate meaningful structures.

alt text

Sobel Filter

$f’(i)=\frac{f_{i+1}-f_{i-1}}{2}\ \text{is equivalent to}\ [\tfrac12,\ 0,\ -\tfrac12]$

alt text

Laplacian (Second-Order Derivative)

Detects regions of rapid change:

  • Useful for feature detection
  • Highlights corners and fine details

The standard discrete Laplacian is written:

$\nabla^2 f(i,j)=f_{i+1,j}+f_{i-1,j}+f_{i,j+1}+f_{i,j-1}-4f_{i,j}$

This corresponds to the kernel:

alt text

alt text

From Image Processing to Features

At this stage, we have a cleaned and enhanced image.

Next step: 👉 Detect meaningful points for tracking.

PART E — Feature Detection for Visual Tracking

What is a Feature?

A feature is a distinctive image pattern that can be reliably detected and matched across frames.

Examples:

  • Corners
  • Blobs
  • Edges

alt text

What Makes a Good Feature?

A good feature should be:

  • Local (robust to occlusion)
  • Invariant (to translation, rotation, scale)
  • Robust (to noise, lighting changes)
  • Distinctive (easy to match)
  • Repeatable (found again in next frame)

Why Corners are Ideal

Corners are:

  • Stable across viewpoints
  • Well-localized in both x and y
  • Highly informative for motion estimation

Moravec Corner Detector (1977)

Idea:

  • Measure how much intensity changes when shifting a small window in different directions.

alt text

Average intensity variation for a small displacement ((u,v))

alt text

$E(u,v)=\sum_{x,y} w(x,y),(I(x+u,y+v)-I(x,y))^2$

  • (w) specifies the considered neighborhood (value 1 inside the window and 0 outside);
  • (I(x,y)) is the intensity at pixel ((x,y)).

Results

alt text

Limitation:

  • Not rotation invariant.

alt text

Harris Corner Detector (1988)

Improvement over Moravec:

  • Uses image gradients
  • More robust
  • Rotation invariant

Widely used in vision-based tracking.

Average intensity variation for a small displacement $(u,v)$

$E(x,y)=\sum_{u,v} w(u,v),|I(x+u,y+v)-I(x,y)|^2$

  • $w$ specifies the considered neighborhood (value 1 inside the window, 0 outside);
  • $I(x,y)$ is the intensity at pixel $(x,y)$.

Taylor expansion

$I(x+u,y+v)=I(x,y)+u\frac{\partial I}{\partial x}(x,y)+v\frac{\partial I}{\partial y}(x,y)+o(u^2,v^2)$

Neglecting $o(u^2,v^2)$:

$E(x,y)=\sum_{u,v} w(u,v)\left|u\frac{\partial I}{\partial x}(x,y)+v\frac{\partial I}{\partial y}(x,y)\right|^2$

Quadratic form

Average intensity variation for a small displacement ((u,v)):

alt text

For small displacements $(u,v)$

$E(x,y)=[u\ v]M[u\ v]^T$

$M$: symmetric, positive definite $\Rightarrow$ eigenvalue decomposition possible.

Structure of the second-moment matrix

$M=\begin{pmatrix}A,C;C,B\end{pmatrix}$

with:

$A=\left(\frac{\partial I}{\partial x}\right)^2 \otimes w ; B=\left(\frac{\partial I}{\partial y}\right)^2 \otimes w ; C=\left(\frac{\partial I}{\partial x}\frac{\partial I}{\partial y}\right)\otimes w$

$w$: Gaussian window (isotropic).

A corner is characterized by a large variation of $E$ in all directions of $(x,y)$.

$\Rightarrow$ Compute the eigenvalues of $M$.

Corner response

Instead of computing the eigenvalues, we can compute:

$\det(M)=AB-C^2=\lambda_1\lambda_2$

$\text{trace}(M)=A+B=\lambda_1+\lambda_2$

and define the response:

$R=\det(M)-k,\text{trace}(M)^2$

Values of $R$:

  • positive near corners,
  • negative near edges,
  • small in flat regions ($k=0.04$).

$\Rightarrow$ corners / interest points = local maxima of $R$.

alt text

alt text

Invariant by rotation: even after rotation, the matrix shape is unchanged, so the eigenvalues and the response $R$ are unchanged.

alt text

Not invariant by scale

alt text

Potential solution: compute Harris response at multiple scales (Harris–Laplace).

PART F — From Features to Tracking

Feature-Based Tracking Pipeline

Frame t:

  1. Detect features
  2. Describe features
  3. Match with previous frame
  4. Estimate motion (R, t)

Repeat continuously.

alt text

Feature Matching

Features are compared using descriptors.

Goal:

  • Find correspondences between frames

This enables motion estimation.

alt text

Example: SIFT (Scale-Invariant Feature Transform)

SIFT provides:

  • Scale invariance
  • Rotation invariance
  • Robust descriptors

Widely used in classical computer vision.

SIFT Pipeline

  1. Build Gaussian pyramid (multi-scale representation)
  2. Compute Difference of Gaussians (DoG)
  3. Detect keypoints
  4. Assign orientation
  5. Compute descriptor from local gradients

Build Gaussian pyramid, compute DoG and detect keypoints

alt text

To obtain scale-independent descriptors, the image is resampled multiple times (building a “pyramid”).

The Gaussian difference is more efficient than a Laplacian calculation for each level.

Assign orientation and compute descriptor

alt text

The descriptor is a histogram of local gradients around the keypoint, weighted by a Gaussian window.

PART G — From 2D Features to 6DoF Tracking

The PnP Problem (Perspective-n-Point)

If we know:

  • 3D world points
  • Their 2D projections in the image

We can solve for camera pose (R, t).

This is central to many AR systems.

alt text

Visual Odometry

If no 3D map is available:

  • Track motion between consecutive frames
  • Estimate relative movement of the camera

Used in SLAM systems.

SLAM in XR

Modern headsets use SLAM:

  • Simultaneous Localization and Mapping
  • Build a map of the environment
  • Track the headset within that map

Examples:

  • ARKit
  • ARCore
  • Meta Quest inside-out tracking

alt text

PART H — Tracking in Modern XR Systems

Inside-Out Tracking (Meta Quest)

Uses multiple cameras + IMU to:

  • Track headset position
  • Track controllers
  • Map environment

Fully vision-based.

alt text

Optical Tracking (Vicon/OptiTrack)

Uses external cameras to track reflective markers.

Very accurate but requires a dedicated setup.

alt text

Hybrid Tracking Systems

Combine:

  • Vision
  • IMU
  • Depth sensors

To improve robustness and reduce drift.