Perception
Perception is the process that turns raw sensor data into useful understanding of the environment
Perception takes raw data like:
- image pixels
- depth values
- point cloud
And extracts:
- Objects
- Position
- Shapes
- Lines
must read before : Image Geometry
Project 3D point to image plane

Convert World point in camera coordinate system (\(X_c, Y_c, Z_c\)) to (u,v) image plane using intrinsic matrix K
\[
K =
\begin{bmatrix}
f_x & 0 & c_x \\
0 & f_y & c_y \\
0 & 0 & 1
\end{bmatrix}
\]
3D point in camera frame
\[
\mathbf{P} =
\begin{bmatrix}
X \\
Y \\
Z
\end{bmatrix}
\]
Matrix multiplication
\[
K \mathbf{P} =
\begin{bmatrix}
f_x & 0 & c_x \\
0 & f_y & c_y \\
0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
X \\
Y \\
Z
\end{bmatrix}
=\begin{bmatrix}
f_x X + c_x Z \\
f_y Y + c_y Z \\
Z
\end{bmatrix}
\]
Homogeneous image coordinate
\[
\begin{bmatrix}
\tilde{u} \\
\tilde{v} \\
\tilde{w}
\end{bmatrix}
=\begin{bmatrix}
f_x X + c_x Z \\
f_y Y + c_y Z \\
Z
\end{bmatrix}
\]
Convert homogeneous coordinate to pixel coordinate
The matrix multiplication gives a scaled form of the pixel location, and the scale is removed by dividing by the third component.
\[
u = \frac{\tilde{u}}{\tilde{w}}, \qquad v = \frac{\tilde{v}}{\tilde{w}}
\]
since \(\tilde{w} = Z\)
\[
u = \frac{f_x X + c_x Z}{Z} = \frac{f_x X}{Z} + c_x
\]
\[
v = \frac{f_y Y + c_y Z}{Z} = \frac{f_y Y}{Z} + c_y
\]
Homogeneous image coordinate
Instead of representing a pixel as (u,v) We represent it as a 3D vector
\[
\begin{bmatrix}
u \\
v \\
1
\end{bmatrix}
\]
Demo: Python code to calc u,v
Demo: from pixel -> camera -> world
Pixel -> Camera coordinate
pixel (u,v) as homogeneous
\[
\mathbf{p} =
\begin{bmatrix}
u \\
v \\
1
\end{bmatrix}
\]
Then
\[
\mathbf{P}_{cam} = Z \cdot K^{-1} \mathbf{p}
\]
inverse K matrix
\[
K^{-1} =
\begin{bmatrix}
1/f_x & 0 & -c_x/f_x \\
0 & 1/f_y & -c_y/f_y \\
0 & 0 & 1
\end{bmatrix}\]
Multiple inverse with homogeneous P vector
\[
K^{-1} p =
\begin{bmatrix}
(u-c_x)/f_x \\
(v-c_y)/f_y \\
1
\end{bmatrix}
\]
Point in camera coordinate system
\[
P_{cam} = Z K^{-1} p
\]
\[
P_{cam} =
\begin{bmatrix}
X_c \\
Y_c \\
Z_c
\end{bmatrix} =
\begin{bmatrix}
(u-c_x)Z/f_x \\
(v-c_y)Z/f_y \\
Z
\end{bmatrix}
\]
Camera to World coordinate system

\[
T =
\begin{bmatrix}
R & t \\
0 & 1
\end{bmatrix}
\]
then
\[
\mathbf{P}_{world} = R \cdot \mathbf{P}_{cam} + t
\]
Demo: Using ROS
To get a world 3D point from image pixel, you need:
- u, v
- camera intrinsics: fx, fy, cx, cy
- depth Z
- TF from camera_optical_frame to world
ROS Node using camera_info and TF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |