Project 4: [Auto]Stitching Photo Mosaics

Tej Bade, tbade12@berkeley.edu

Part A: Image Warping and Mosaicing

Taking the Pictures

The first step is taking pictures of a scene from different perspectives but keeping the same center of projection.

We then identify feature points, or correspondences, between the two images that will be used to warp an image onto the coordinate space of another image. The images with labeled correspondences are shown below.

In the following sections, im1 refers to the left image, im2 refers to the right image, and im1_corr and im2_corr refer to the correspondence points of im1 and im2 respectively.

Recovering Homographies

The homography matrix $H$ is a matrix of the form $\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix}$ that will transform the points im1_corr to the points im2_corr. Specifically, if $(x, y)$ and $(x',y')$ is a correspondence pair, then we want $\begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix} = H\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}$ for some scalar $w.$ Manipulating the equations that result from this setup gives us a new matrix equation:

$\begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -x'x & -x'y \\ 0 & 0 & 0 & x & y & 1 & -y'x & -y'y \end{bmatrix}\begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix}.$

Thus, every correspondence pair gives us 2 rows in the leftmost matrix (yielding 2 equations in 8 variables). Having more than 4 correspondence points will yield an overdetermined system with 8 variables and more than 8 equations. In this case, we can use the method of linear least-squares to find the optimal values of $a, b, c, d, e , f, g$ and $h$ that satisfy the matrix equation and construct the homography matrix $H.$

Warping the Images

We are given two images im1 and im2 representing different perspectives of the same scene and want to produce a “stitched image.” We use the correspondences to compute the homography matrix H from im1 to im2. We then warp the corners of im1 using H to find the polygon that defines warped_im1, the warped version of im1 within the coordinate space of im2. Since the coordinates of the pixels in the polygon can be negative, we shift the entire image in the x and y directions so that all coordinates are positive. For each pixel in this polygon, we shift the coordinates back and apply the inverse of H to get the corresponding pixel in im1 to interpolate from (I used scipy.interpolate.RegularGridInterpolator with method = ‘linear’ for the interpolation). Finally, I use im2, warped_im1, and the x and y shifts to stitch the images together. In this process, im2 is essentially placed on top of warped_im1 without any weighted averaging. The result of this method using the first pair of images from the “Taking the Pictures” section is shown below.

Notice that we can easily tell that im2 is placed on top of warped_im1 from the edge artifacts. This is resolved in the “Blending the Images into a Mosaic” section.

Image Rectification

Image rectification refers to taking a picture that contains an object known to be rectangular and warping the image so that the object is a rectangle that aligns with the x- and y-axes of the image. I used the two images below.

To do this, I picked the 4 corners of the rectangular object as correspondences, found the dimensions of the object by taking the distances between correspondence points, and warping the image with the method in the previous section. The homography matrix is computed by taking the object corners as “im1_corr” and $[[0, 0], [d_x, 0], [d_x, d_y], [0, d_y]]$ as “im2_corr” where $d_x$ and $d_y$ are the dimensions of the object. The original images are shown below on the left (along with the corners of the object) and the rectified images are shown below on the right.

Blending the Images into a Mosaic

The warping technique used before works well, but there are noticeable edge artifacts. In other words, it is obvious that we are just placing im2 on top of warped_im1. To blend the images better, I use cv2.distanceTransform to create an “alpha mask”, where each pixel inside the boundary of an image is set to the distance to the edge of the image. The images below show the masks for warped_im1 and im2 for the first pair of images in the “Taking the Pictures” section.

Then, I normalize the masks and use them to compute a weighted average of warped_im1 and im2. This produces the final blended image below.

Shown below are the results for the two other pairs of images.

Part B: Feature Matching for Autostitching

The goal of this part is complete all of the steps in part A without manually choosing correspondence points.

Detecting Features

I first use skimage.feature.corner_harris and skimage.feature.peak_local_max to compute the Harris corner measure responses and get the Harris corners for the image. I set threshold_rel to 0.01 in the peak_local_max function (since otherwise the entire image will be filled with corners) and discard corners around the edges. Shown below are the resulting Harris corners on the two Soda Hall images from part A.

Harris corners for left Soda Hall image (441 in total)

Harris corners for right Soda Hall image (384 in total)

Notice that there are a lot of clumps containing feature points that are very close to each other. Using Adaptive Non-Maximal Suppression (ANMS), we can reduce the total number of features while keeping them spread apart. ANMS works by first calculating the suppression radius for each point $i$ as follows:

$r_i = \min_j{d(i, j) : f(i) < 0.9 * f(j)}$ , where $d(i, j)$ is the Euclidean distance between the points and $f(i)$ is the Harris corner response of point $i.$ Then, we choose the top 100 points with the largest suppression radii. Shown below are the 100 points for both of the Soda Hall images.

Feature points after ANMS for left Soda Hall image

Feature points after ANMS for right Soda Hall image

Extracting Feature Descriptors

We then assign a descriptor for each feature, which contains information about the pixels around that feature point. For each point, we take the 40 by 40 patch centered at that point, downsample it to an 8 by 8 patch, and flatten it into a 64-dimensional vector. This gives us 100 64-dimensional vectors for each image that we can use to match features.

Feature Matching

For each point $i$ , we find the points $j$ and $k$ with the closest and second closest feature vectors (using Euclidean distance as the metric). We can then employ Lowe’s trick for rejecting possible outliers. If $d(i, j)$ is the squared Euclidean distance between points $i$ and $j$ , then we only keep the match $(i, j)$ if $d(i, j)$ / $d(i, k) < threshold$ . I set the threshold to 0.7. Doing this gives us the following feature matches for the Soda Hall images:

Matched feature points between the Soda Hall images. There are 52 in total.

Notice that there are still some outliers.

Robust Homographies

To further reject the outliers, we can implement the RANSAC method. For each iteration, we randomly sample four feature points, compute the homography from those points, and find the number of inliers among the features, where an inlier is a point in one image that, when projected using the homography, is close to its corresponding point in the other image. We set a threshold for how close the projected point has to be to the corresponding point and keep track of the largest set of inliers across the iterations. The resulting inliers from the method are shown below on the Soda Hall images.

Feature points after RANSAC for the Soda Hall images. There are 42 in total.

Finally, to autostitch the images, we use the inlier set from the RANSAC method as our correspondences for im1 and im2 in the stitching/blending method from part A.

Final Results

The following pairs of images consist of the mosaic made with manually chosen correspondences on the left (these are the images computed in part A) alongside the autostitched mosaics on the right.

Stitched image using manual correspondences

Reflection

I think the coolest thing I learned were the different techniques to reject feature outliers, including Lowe’s thresholding and the RANSAC method. It is very interesting that we can automatically find feature matches between two images even when the information around multiple corners in an image are similar.