Algebraic Vision

Taken from a talk on 11/15/2024 by Jessie Loucks - Tavitas from Sac State.

1: What is Computer Vision?

Specifically from mathematicians' perspective. Two motivating questions:

Given cameras (positions, angles) + images (color data), recover the object: Traingulation
Given objects + images, recover the camera: Resectioning

You have two givens, can you find the third?

2: What is a pinhole camera?

This is the idea of the hole in a box projecting the world's image onto the backboard of the box. Mathematically, the camera is a map $A : R^{3} \to R^{2}$ . Where $(x, y, z) \to (\frac{x}{z}, \frac{y}{z})$ .

But notice that this map is not invertible, and if $z = 0$ then we're toast! It's also non-linear, so we can't use a matrix for it.

The other thing is that in 3D space two parallel lines don't intersect; but their map via $A$ may have this (think of the train tracks' lines intersecting at the horizon).

The fix...

3: Perspective & Projective Geometry

The idea is that we look at this via perspective geometry, where we actually have these converging lines. This is in contrast to orthographic geometry, where the 2D counterparts also are parallel if they are in the $R^{3}$ .

Projective space

P^{n}

where

n = 2, 3

$P^{3} = {(x : y : z : 1)} \cup {(x : y : z : 0)}$

You can think of the left set as $R^{3}$ and the RHS as the Limit Points of $R^{3}$ . The rules are:

$(α x : α y : α z : α w) = (x : y : z : w)$ where $α \neq 0$ .

At least one non-zero coordinate $(0, 0, 0, 0) \notin P^{3}$ .

For example, if a line through $\vec{0}$ in $R^{3}$ is:

l (t) = (x t : y t : z t : 1) = (\frac{x : y : z : 1}{t}) \Rightarrow lim_{t \to \infty} l (t) = (x : y : z : 0)

So the end point of this line is just this limit point $(x : y : z : 0)$ .

For $P^{2}$ it's the same idea, except with three coordinates instead of 4.

With projective space then we get linearity. The map $A : P^{3} \to P^{2}$ is a valid linear map and is:

A = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}]

4: Triangulation and Resectioning

Triangulation

Say we have $m$ cameras. Then we have a map $\overset{―}{A} = (\overset{―}{A_{1}}, \dots, \overset{―}{A_{m}})$ . A multiview configuration is a tuple of cmaeras, capturing multiple scene points. The multiview variety of $\overset{―}{A}$ is:

Γ_{\overset{―}{A}, P}^{m, n} := \overset{―}{im (\overset{―}{A})}

Here $n$ is the number of world-points that we get images of. The idea is $Γ$ is useful for reconstructing scene points.

Resectioning

This is the meat and potatoes here. A hypercamera configuration is a tuple of world points $q_{1}, \dots, q_{n}$ being captured by multiple cameras $m$ of them. This recovers camera structure.

5: Duality

Like with the Chapter 3 (cont.) - Products and Quotients of Vector Spaces#3.F Duality, a lot of things have duality like Graphs!

This gets into Carlsson-Weinshall Duality.