Detectors and descriptors

⌛ 2015 год
👀 251 просмотр
📌 192 загрузки

Выбери формат для чтения

Конспект лекции по дисциплине «Detectors and descriptors», pdf

Загружаем конспект в формате pdf

Это займет всего пару минут! А пока ты можешь прочитать работу в формате Word 👇

Конспект лекции по дисциплине «Detectors and descriptors», Word формат

Lecture 10 Detectors and descriptors This lecture is about detectors and descriptors, which are the basic building blocks for many tasks in 3D vision and recognition. We’ll discuss some of the properties of detectors and descriptors and walk through examples. • Properties of detectors • Edge detectors • Harris • DoG • Properties of descriptors • SIFT • HOG • Shape context Silvio Savarese Lecture 10 - 16-Feb-15 From the 3D to 2D & vice versa Previous lectures have been dedicated to characterizing the mapping between 2D images and the 3D world. Now, we’re going to put more focus on inferring the visual content in images. P = [x,y,z] p = [x,y] 3D world •Let’s now focus on 2D Image How to represent images? The question we will be asking in this lecture is - how do we represent images? There are many basic ways to do this. We can just characterize them as a collection of pixels with their intensity values. Another, more practical, option is to describe them as a collection of components or features which correspond to “interesting” regions in the image such as corners, edges, and blobs. Each of these regions is characterized by a descriptor which captures the local distribution of certain photometric properties such as intensities, gradients, etc. The big picture… Feature extraction and description is the first step in many recent vision algorithms. They can be considered as building blocks in many scenarios where it is critical to: 1) Fit or estimate a model that describes an image or a portion of it 2) Match or index images 3) Detect objects or actions form images Feature Detection e.g. DoG Feature Description e.g. SIFT • • • • Estimation Matching Indexing Detection Estimation Here are some examples we’ve seen in the class before. In earlier lectures, we took for granted that we could extract out keypoints to fit a line to. This lecture will discuss how some of those keypoints are found and utilized. Courtesy of TKK Automation Technology Laboratory Estimation H This is an example where detectors/descriptors are used for estimating a homographic transformation. Estimation Here’s an example using detectors/descriptors for matching images in panorama stitching Matching Image 1 Here’s another example from earlier in the class. When we were looking at epipolar geometry and the fundamental matrix F relating an image pair, we just assumed that we had matching point correspondences across images. Keypoint detectors and descriptors allow us to find these keypoints and related matches in practice. Image 2 Object modeling and detection A We can also use detectors/descriptors for object detection. For example, later in this lecture, we discuss the shape context descriptor and describe how it can be used for solving a shape matching problem. Notice that in this case, descriptors and their locations typically capture local information and don’t take into account for the spatial or temporal organization of semantic components in the image. This is usually achieved in a subsequent modeling step, when an object or action – needs to be represented. Lecture 10 Detectors and descriptors • Properties of detectors • Edge detectors • Harris • DoG • Properties of descriptors • SIFT • HOG • Shape context Lecture 10 - Silvio Savarese 16-Feb-15 Edge detection What causes an edge? Identifies sudden changes in an image • Depth discontinuity • Surface orientation discontinuity • Reflectance discontinuity (i.e., change in surface material properties) • Illumination discontinuity (e.g., highlights; shadows) The first type of feature detectors that we will examine today are edge detectors. What is an edge? An edge is defined as a region in the image where there is a “significant” change in the pixel intensity values (or high contrast) along one direction in the image, and almost no changes in the pixel intensity values (or low contrast) along its orthogonal direction. Why do we see edges in images? Edges occur when there are discontinuities in illumination, reflectance, surface orientation, or depth in an image. Although edge detection is an easy task for humans, it is often very difficult and ambiguous for computer vision algorithms. In particular, an edge can be induced by: • Depth discontinuity • Surface orientation discontinuity • Reflectance discontinuity (i.e., change in surface material properties) • Illumination discontinuity (e.g., highlights; shadows) We will now examine ways for detecting edges in images regardless of where they come in the 3D physical world. Edge Detection • Criteria for optimal edge detection (Canny 86): An ideal edge detector would have good detection accuracy. This means that we want to minimize false positives (detecting an edge when we don’t actually have one) and false negatives (missing real edges). We would also like good localization (our detected edges must be close to the real edges) and single response (we only detect one edge per real edge in the image). – Good detection accuracy: • • minimize the probability of false positives (detecting spurious edges caused by noise), false negatives (missing real edges) – Good localization: • edges must be detected as close as possible to the true edges. – Single response constraint: • minimize the number of local maxima around the true edge (i.e. detector must return single point for each true edge point) • Examples: True edge Here are some examples of the edge detector properties mentioned in the previous slide. In each of these cases, the red line represents a real edge. The blue, green and cyan edges represent edges detected by non-ideal edge detectors. The situation in blue shows an edge detector that does not perform well in the presence of noise (low accuracy). The next edge detector (green) does not locate the real edge very well. The last situation shows an edge without the single response property – we detected 3 possible edges for one real edge. Edge Detection Poor robustness to noise Poor localization Too many responses Designing an edge detector • Two ingredients: • Use derivatives (in x and y direction) to define a location with high gradient . • Need smoothing to reduce noise prior to taking derivative There are two important parts to designing an edge detector. First, we will use derivatives in the x and y direction to define an area w/ high gradient (high contrast in the image). This is where edges are likely to lie. The second component of an edge detector is that we need some smoothing to reduce noise prior to taking the derivative. We don’t want to detect an edge anytime there is small amounts of noise (derivatives are very sensitive to noise). This image shows the two components of an edge detector mentioned in the last slide (smoothing to make it more robust to noise and finding the gradient to detect an edge) in action. Here, we are looking at a 1-d slice of an edge. Designing an edge detector At the very top, in plot f, we see a plot of the intensities of this edge. The very left contains dark pixels. The very right has bright pixels. In the middle, we can see a transition from black to white that occurs around index 1000. In addition, we can see that the whole row is fairly noisy: there are a lot of small spikes within the dark and bright regions. f g In the next plot, we see a 1-d Gaussian kernel (labeled g) to low pass filter the original image with. We want a low pass filter to smooth out the noise that we have in the original image. f*g The next plot shows the convolution between the original edge and the Gaussian kernel. Because the Gaussian kernel is a low pass filter, we see that the new edge has been smoothed out considerably. If you are unfamiliar with filtering or convolution, please refer to the CS 131 notes. d ( f ∗ g) dx [Eq. 1] [Eq. 2] = dg ∗ f = “derivative of Gaussian” filter dx Source: S. Seitz The last plot shows the derivative of the filtered edge. Because we are working with a 1-d slice, the gradient is simply the derivative in the x direction. We can use this as a tool to locate where the edge is (at the maximum of this derivative). As a side note, because of the linearity of gradients and convolution, we can re-arrange the final result d/dx (f * g) = d/dx(g) * f, (Equations 1 and 2 are equivalent) where the “*” represents In 2-d, the process is very similar. We smooth by first convolving the image with a 2-d Gaussian filter. We denote this by Eq. 3 (convolving the image w/ a Gaussian filter to get a smoothed out image). The 2d Gaussian is defined for your convenience in Eq. 4. This still has the property of smoothing out noise in the image. Then, we take the gradient of the smoothed image. This is shown in Eq. 5, where Eq. 6 is the definition of the 2d gradient for your convenience. Finally, we check for high responses which indicate edges. Edge detector in 2D • Smoothing x2 +y2 I' = g ( x , y ) ∗ I [Eq. 3] •Derivative S = ∇(g ∗ I ) = (∇g )∗ I = g(x, y) = − 1 2 e 2σ 2π σ 2 [Eq. 4] & ∂g # $ ! &gx # ∇g = $ ∂∂gx ! = $ ! $ ! %g y " $% ∂y !" 'g x $ &gx ∗ I # ! # = % "∗I = $ ! = " Sx Sy $ = gradient vector g & y# %g y ∗ I " [Eq. 5] Canny Edge Detection [Eq. 6] (Canny 86): See CS131A for details original Canny with Canny with • The choice of σ depends on desired behavior – large σ detects large scale edges – small σ detects fine features A popular edge detector that is built on this basic premise (smooth out noise, then find gradient) is the Canny Edge Detector. Here, we vary the standard deviation of the Gaussian σ to define how granular we want our edge detector to be. For more details on the Canny Edge Detector, please see the CS 131A notes. Other edge detectors: Some other edge detectors include the Sobel, Canny-Deriche, and Differential edge detectors. We won’t cover these in class, but they all have different trade-offs in term of accuracy, granularity, and computational complexity. - Sobel - Canny-Deriche - Differential Corner/blob detectors Corner/blob detectors • Repeatability – The same feature can be found in several images despite geometric and photometric transformations • Saliency – Each feature is found at an “interesting” region of the image • Locality – A feature occupies a “relatively small” area of the image; Edges are useful as local features, but corners and small areas (blobs) are generally more helpful in computer vision tasks. Blob detectors can be built by extending the basic edge detector idea that we just discussed. We judge the efficacy of corner or blob detectors on three metrics: repeatability (can we find the same corners under different geometric and photometric transformations), saliency (how “interesting” this corner is), and locality (how small of a region this blob is). Repeatability Illumination invariance Ideally, we want our corner detector to consistently find the same corners in different images. In this example, we have detected the corner of this cow’s ear. We would like our corner detector to be able to detect the same corner regardless of lighting conditions, scale, perspective shifts, and pose variations. Scale invariance Pose invariance •Rotation •Affine • Saliency ☺ ☹ ☺ •Locality Our corner/blob detector should also pick up on interesting (salient) keypoints. In the top image, an example of an area with poor saliency is indicated by the dashed bounding box. There is not much texture or interesting substance in this area. We also would like our interesting features to have good locality. That is, it should take up a small portion of the whole image. If our blob detector returned the whole image (or nearly the whole image) – e.g., see region indicated by the dashed bounding box— that would defeat the purpose of keypoint detection. ☹ Here is an example of some corners that are detected in an image. Corners detectors Harris corner detector One of the more popular corner detectors is the Harris Corner Detector. We will discuss this briefly in the next few slides. Please see CS 131 notes for a detailed treatment of the subject. C.Harris and M.Stephens. "A Combined Corner and Edge Detector.“ Proceedings of the 4th Alvey Vision Conference: pages 147--151. See CS131A for details Harris Detector: Basic Idea Explore intensity changes within a window as the window changes location “flat” region: no change in all directions “edge”: no change along the edge direction Results The basic idea of the Harris Corner Detector is to slide a small window across an image and observe changes of intensity values of the pixels within that window. If we slide our window in a flat region (as shown on the left) we shouldn’t get much variation of intensity values as we slid the window. If we moved along an edge, we would see a large variation as we slid perpendicular to the edge, but not parallel to it. At a corner, we would see large variation in any direction. “corner”: significant change in all directions Here is an example result from the Harris Corner detector. It detected small corners throughout the cow figurine and was even able to pick up many of the same corners in an image with much different illumination. Now we will talk a bit about general blob detectors. Blob detectors Blob detectors build on a basic concept from edge detection. This is the series of pictures from our earlier discussion about edge detection. From top to bottom, the plots are our edge, the Gaussian kernel, the edge convolved with the kernel, and the first derivative of the smoothed out edge. Edge detection f g f*g d ( f ∗ g) dx Source: S. Seitz Edge detection f g f*g [Eq. 7] d2 ( f ∗ g) dx 2 [Eq. 8] f∗ d2 g dx 2 = “second derivative of Gaussian” filter = Laplacian of the gaussian We can extend the basic idea of an edge detector. Here we show the second derivative of the smoothed edge in the last plot. This is equivalent to the Laplacian of the Gaussian filter (second derivative in this one dimensional example) convolved with the original edge, shown on the next slide. Edge detection as zero crossing Edge f d2 g dx 2 f∗ Here, we can see that the result from convolving the second derivative of the Gaussian filter (second plot) with the edge (first plot) is the same as convolving the original Gaussian filter and taking the second derivative. Similar to the previous case where we were examining the first derivative, the second derivative and convolution are both linear operations and therefore interchangeable. Thus, Eq. 7 from the previous slide and Eq. 8 are equivalent statements. We also see that the zero crossing of the final plot (Laplacian of Gaussian convolved with edge) corresponds well with the edge in the original image. Laplacian d2 g dx 2 Edge = zero crossing of second derivative [Eq. 8] Edge detection as zero crossing Now, we know that convolving the Laplacian of Gaussian with an edge gives us a zero crossing where an edge occurs. Here is an example where the original image has two edges; if we convolve it with the Laplacian of Gaussian, we get two zero crossings as expected. * = edge edge From edges to blobs • Can we use the laplacian to find a blob (RECT function)? * * * = = = maximum Magnitude of the Laplacian response achieves a maximum at the center of the blob, provided the scale of the Laplacian is “matched” to the scale of the blob Now, let’s generalize from the edge detector to the blob detector. The question is: can we use the Laplacian to find a blob or, more formally speaking, a RECT function? Let’s see what happens if we change the scale (or width) of the blob. The central panel shows the result of convolving the Laplacian kernel with a blob with smaller scale. The right panel shows the result of convolving the Laplacian kernel with a blob with an even smaller scale. As we decrease the scale, the zero crossings of the filtered blob come closer together until they superimpose to produce a peak in the response curve: the magnitude of the Laplacian response achieves a maximum at the center of the blob, provided the scale of the Laplacian is “matched” to the scale of the blob. So what if the blob is slightly thicker or slimmer? What if we don’t know the exact size of the blob (which is what happens in practice)? We discuss a solution to this problem on the next couple of slides. From edges to blobs • Can we use the laplacian to find a blob (RECT function)? * * * = = = maximum What if the blob is slightly thicker or slimmer? Scale selection Convolve signal with Laplacians at several sizes and looking for the maximum response To be able to identify blobs of all different sizes, we can convolve our candidate blob with Laplacians at difference scales and only keep the scale where we achieve the maximum response. How do we obtain Laplacians with different scales? By varying the variance σ of the Gaussian kernel. The figure shows that the scale of the Laplacian kernel increases as we increase σ. increasing σ Scale normalization • To keep the energy of the response the same, must multiply Gaussian derivative by σ • Laplacian is the second Gaussian derivative, so it must be multiplied by σ2 Normalized Laplacian x2 g(x) = − 2 1 e 2σ 2π σ σ2 d2 g dx2 To be able to compare responses at different scale, we must normalize the energy of the responses to each Gaussian kernel. This means that each Gaussian derivative must be multiplied by sigma and each Laplacian must be multiplied by sigma^2 so that the responses from different kernels are comparable (calibrated). We define the characteristic scale as the scale that produces maximum response at the location of the blob. In this image, we show our original candidate blob signal on the left. On the right, we show its Laplacian response after scale normalization for a series of different sigmas. We see that the maximum response comes at sigma = 8, so we say the characteristic scale is at sigma = 8. Characteristic scale We define the characteristic scale as the scale that produces peak of Laplacian response The concept of characteristic scale was introduced by Lindeberg in 1998. Original signal Scale-normalized Laplacian response σ=1 σ=8 σ=4 σ=2 σ = 16 Maximum ☺ T. Lindeberg (1998). "Feature detection with automatic scale selection." International Journal of Computer Vision 30 (2): pp 77--116. Here we see the same results when the Laplacian kernels we convolve the blob with are not normalized. Notice that, because the energy of the kernel decreases as sigma increases, the response tapers off and the response for sigma=8 is significantly lower than before. Characteristic scale Here is what happens if we don’t normalize the Laplacian: Original signal σ=1 σ=8 σ=4 σ=2 σ = 16 This should give the max response ☹ In 2D, the same idea applies. Here, we use a 2D Gaussian distribution to design the normalized 2D Laplacian kernel (Eq. 9). Blob detection in 2D • Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D [Eq. 9] Scale-normalized: ∇ 2 norm & ∂2g ∂2g # g = σ $$ 2 + 2 !! ∂y " % ∂x 2 Just like in the 1-d case, we define the scale with the maximum response as the characteristic scale. It is possible to prove that for a binary circle of radius r, the Laplacian achieves a maximum at σ = r/sqrt(2) Scale selection • For a binary circle of radius r, the Laplacian achieves a maximum at r 2 Laplacian response σ =r/ image r/ 2 scale (σ) Scale-space blob detector This slide summarizes our process of finding blobs at different scales. First, we convolve the image with scale-normalized Laplacian of Gaussian filters. Next, we find the maximum response to the scale normalized Laplacian. If the max response is above a threshold, the location in the image where the max response appears returns the location of the blob, and the corresponding scale returns the scale of the blob. 1. Convolve image with scale-normalized Laplacian at several scales 2. Find maxima of squared Laplacian response in scale-space The maxima indicate that a blob has been detected and what’s its intrinsic scale Here’s an example of a blob detector run on an image. Scale-space blob detector: Example Scale-space blob detector: Example Scale-space blob detector: Example Difference of Gaussians (DoG) David G. Lowe. "Distinctive image features from scale-invariant keypoints.” IJCV 60 (2), 04 • Approximating the Laplacian with a difference of Gaussians: L = σ 2 (Gxx ( x, y, σ ) + G yy ( x, y, σ ) ) (Laplacian) [Eq. 10] DoG = G(x, y,2σ ) − G(x, y,σ ) Difference of gaussian with scales 2 σ and σ [Eq. 11] In general: DoG = G(x, y,kσ ) − G(x, y,σ ) ≈ (k −1)σ 2L [Eq. 12] This is the response after convolving the image with a single Laplacian of Gaussian kernel (with σ = 11.9912). Different responses can be found if σ is smaller or bigger. After convolving with many different sizes of Laplacian of Gaussian kernels, we can threshold the image and see where blobs are detected. This image shows all of the blobs detected over several scales. The scale normalized Laplacian of Gaussian (Eq. 10) works well if one wants to detect blobs, but computing a Laplacian can be very computationally expensive. In practice, we often use the difference of two Gaussian distributions at different variances (e.g. 2×σ and σ) to approximate a Laplacian (Eq. 11). Eq. 12 illustrates this approximation for a generic scale k. In the plot, the blue curve is a Laplacian and the red curve is the difference of Gaussian approximation which illustrates that the approximation given in Eq. 12 is fairly close. The difference of Gaussians (DoG) scheme is used in the SIFT feature detector proposed by D Lowe in 2004 There are many extensions to this basic blob detection scheme. A popular one – by Mikolajczyk and Schmid, in 2004— is to make the detection invariant to affine transformations in addition to scale and introduce the similar concept of characteristic shape. Affine invariant detectors K. Mikolajczyk and C. Schmid, Scale and Affine invariant interest point detectors, IJCV 60(1):63-86, 2004. Similarly to characteristic scale, we can define the characteristic shape of a blob Properties of detectors Detector Illumination Rotation Scale View point Lowe ’99 (DoG) Yes Yes Yes No Scale-normalized: We can summarize the feature detectors we have seen so far along with their properties (i.e., robustness to illumination changes, rotation changes, scale changes and view point variations) in a table. For instance, the Lowe’s Difference of Gaussian blob detector is invariant to shifts in illumination, rotation, and scale (but not view point). & ∂2g ∂2g # ∇ 2norm g = σ 2 $$ 2 + 2 !! ∂y " % ∂x Here we see a more extensive list of feature detectors along with relevant properties. Properties of detectors Detector Illumination Rotation Scale View point Lowe ’99 (DoG) Yes Yes Yes No Harris corner Yes Yes No No Mikolajczyk & Schmid ’01, ‘02 Yes Yes Yes Yes Tuytelaars, ‘00 Yes Yes No (Yes ’04 ) Yes Kadir & Brady, 01 Yes Yes Yes no Matas, ’02 Yes Yes Yes no Lecture 10 Detectors and descriptors Until this point, we have been concerned with finding interesting keypoints in an image. Now, we will describe some of the properties of keypoints and how to describe them with descriptors. • Properties of detectors • Edge detectors • Harris • DoG • Properties of descriptors • SIFT • HOG • Shape context Lecture 10 - Silvio Savarese 16-Feb-15 Let’s take a step back and look at the big picture. After detecting keypoints in images, we need ways to describe them so that we can compare keypoints across images or use them for object detection or matching. The big picture… Feature Detection e.g. DoG Feature Description e.g. SIFT • • • • Estimation Matching Indexing Detection Properties Depending on the application a descriptor must incorporate information that is: • Invariant w.r.t: •Illumination •Pose •Scale •Intraclass variability A a • Highly distinctive (allows a single feature to find its correct match with good probability in a large database of features) As we did for feature detectors, let’s analyze some of the properties we want a feature descriptor to have. These include the ability of the descriptor to be invariant with respect to illumination, pose, scale, and intraclass variability. We also would like our descriptors to be highly distinctive, which allows a single feature to find its correct match with good probability in a large database of features. In the next few slides, we are going to run through some descriptors and discuss how well each descriptor meets these standards. The simplest, naïve descriptor we could make is just a 1 × NM vector w of pixel intensities that are obtained by considering a small window (say N × M pixels) around the point of interest. The simplest descriptor M N 1 x NM vector of pixel intensities w= [ ] … We could also normalize this vector to have zero mean and norm 1 to make it invariant to affine illumination transformations (see Eq. 13). Normalized vector of intensities M N 1 x NM vector of pixel intensities w= [ ] … wn = (w − w ) (w − w ) Makes the descriptor invariant with respect to affine transformation of the illumination condition [Eq. 13] Illumination normalization • Affine intensity change: w→ w+b [Eq. 14] →aw+b wn = (w − w ) (w − w ) • Make each patch zero mean: remove b w • Make unit variance: remove a Index of w What does “affine illumination change” mean? It means that the w before and after the illumination change are related by Eq. 14. So the effect of imposing the constraint that the mean of w is zero corresponds to removing b and the effect of imposing the constraint that the variance of w is 1, corresponds to removing a. This method is simple, but it has many drawbacks. It is very sensitive to location, pose, scale, and intra-class variability. In addition, it is poorly distinctive. That is, keypoints that are described by this normalized illumination vector may be easily (mis)matched even if they are not actually related. Why can’t we just use this? • Sensitive to small variation of: • location • Pose • Scale • intra-class variability • Poorly distinctive For instance, this slide illustrates an example where a similarity metric based on cross-correlation is sensitive to small pose variations. As already analyzed in lecture 6, the normalized cross-correlation (NCC) (blue plot in the figure) may be used to solve the correspondence problem in rectified stereo pairs – that is, the problem of measuring the similarity between a patch on the left and corresponding patches in the right image for different values of u along the scanline (dashed orange line). The u value that corresponds to the max value of NCC is shown by the red vertical line and indicates the location of the matched corresponding patch (which is correct in this case). As we apply a very minor rotation to the patch on the left, the corresponding NCC as function of u is shown in green (dashed line). Notice that max value of NCC is found at different value of u (which is not correct in this case) and, overall, the green dashed line no longer provides a meaningful measurement of similarity. Sensitive to pose variations NCC Normalized Correlation: w n ⋅ w !n = ( w − w )( w ! − w !) ( w − w )( w ! − w !) u Properties of descriptors Descriptor Illumination PATCH Good Pose Intra-class variab. Poor Poor Similar to what we introduced for feature detectors, we can summarize descriptors along with their properties (i.e., robustness to illumination changes, pose variations and intra-class variability) in a table. A descriptor based on normalized pixel intensity values, indicated as patch, is robust to illumination variations (when normalized), but not robust against pose and intra-class variation. An alternative approach is to use a filter bank to generate a descriptor around the detected key point in the image. The idea is to record the responses to different filters of the filter bank as our descriptor. In the example in the figure, the filter bank comprises 4 “gabor -like” filters (2 horizontal and two vertical). The result of convolving the image (that depicts a horizontal texture in this example) with each of these filters are shown on the right and are denoted as the 4 filter responses. The descriptor associated to, say, the key-point shown in yellow in the original image is obtained by concatenating the responses at the same pixel location (also shown in yellow) for each of the 4 filter responses. In this example the descriptor is a 1x4 dimension because we have 4 filters in the filter bank. In general, the dimension of the descriptor is equal to the number of filters in the filter bank. A descriptor based on filter banks can be designed to be computationally efficient and less sensitive to view point transformations than the “patch” descriptor is. This concept led to the GIST descriptor proposed by Oliva and Torralba in 2001 Bank of filters = * filter bank image filter responses descriptor More robust but still quite sensitive to pose variations http://people.csail.mit.edu/billf/papers/steerpaper91FreemanAdelson.pdf A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 2001. We summarize these results in the table. Properties of descriptors Descriptor Illumination Pose Intra-class variab. PATCH Good Poor Poor FILTERS Good Medium Medium SIFT descriptor David G. Lowe. "Distinctive image features from scale-invariant keypoints.” IJCV 60 (2), 04 • Alternative representation for image regions • Location and characteristic scale s given by DoG detector s Image window A very popular descriptor that is widely used in many computer vision applications, from matching to object recognition, is the SIFT descriptor which was proposed by David Lowe in 1999. SIFT is often used to describe the image around a key point that is detected using the Difference of Gaussian detector. Because DOG returns the characteristic scale s of the keypoint, SIFT is typically computed within a window W of size s. The sift descriptor is computed by following these steps: SIFT descriptor 1. Compute the gradients at each pixel within the window W • Alternative representation for image regions • Location and characteristic scale s given by DoG detector s •Compute gradient at each pixel 2. Divide the window into N x N rectangular areas (bins). A bin in is indicated by a black square in the image. In each bin i, we compute the histogram h of the orientations of the gradient of each pixel in the bin. i That is, within each bin, we keep track of how many gradients are between 0 and θ θ and θ , etc. This count of orientations within each 1, 1 2 area forms the basis for the SIFT descriptor vector. Typically N= 8, which means we are discretizing the orientations from 0 to 45 degrees, from 45 to 90 degrees, etc… • N x N spatial bins • Compute an histogram hi of M orientations for each bin i θ1 θ2 θM-1 θM 2 2 3. Concatenate h for i=1 to N to form a 1xMN vector H SIFT descriptor i • Alternative representation for image regions • Location and characteristic scale s given by DoG detector s •Compute gradient at each pixel • N x N spatial bins • Compute an histogram hi of M orientations for each bin i • Concatenate hi for i=1 to N2 to form a 1xMN2 vector H SIFT descriptor • Alternative representation for image regions • Location and characteristic scale s given by DoG detector s •Compute gradient at each pixel • N x N spatial bins • Compute an histogram hi of M orientations for each bin i • Concatenate hi for i=1 to N2 to form a 1xMN2 vector H • Normalize to unit norm • Gaussian center-weighting Typically M = 8; N= 4 H = 1 x 128 descriptor 4. Normalize the norm of H – that, is divide H by the total number of elements of H, such that the area of the histogram is 1. 5. Reweight each histogram h by a gaussian centered at the center of the i window W and with a sigma proportional to the size of W. The SIFT descriptor is very popular because it is robust to small variations in all of the categories we have mentioned. It is robust to changes in illumination because we are calculating the direction of gradients in normalized blobs. It is invariant to small changes in pose and small intraclass variations because we are binning the direction of gradients (rather than keeping the exact values) in our orientation histograms. It is invariant to scale because we are using scale-normalized DOGs. SIFT descriptor • Robust w.r.t. small variation in: • Illumination (thanks to gradient & normalization) • Pose (small affine variation thanks to orientation histogram ) • Scale (scale is fixed by DOG) • Intra-class variability (small variations thanks to histograms) Rotational invariance • Find dominant orientation by building a orientation histogram • Rotate all orientations by the dominant orientation Finally, it is possible to make SIFT rotationally invariant. This is done by finding the dominant gradient orientation (e.g., black arrow in the figure) by computing a histogram of gradient orientations within the entire window – the peak of such histogram (shown by the red arrow) returns the dominant gradient orientation. Once such dominant gradient orientation is computed, we can rotate all the gradients in the window such that the dominant gradient orientation is aligned along a direction that is fixed beforehand (e.g. the vertical direction). 2 π This makes the SIFT descriptor rotational invariant Properties of descriptors Descriptor Illumination Pose Intra-class variab. PATCH Good Poor Poor FILTERS Good Medium Medium SIFT Good Good Medium SIFT’s properties are summarized and compared to other descriptors in this table HoG = Histogram of Oriented Gradients Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 An extension of the SIFT is the histogram of oriented gradients (HOG) which has become commonly used for describing objects in object detection tasks. The gradients of the HOG descriptor are sampled on a dense, regular grid around the object of interest and gradients are contrast normalized in overlapping blocks. For details, please refer to the original paper Histograms of Oriented Gradients for Human Detection by Dalal and Triggs. • Like SIFT, but… – Sampled on a dense, regular grid around the object – Gradients are contrast normalized in overlapping blocks Another popular descriptor is the shape context descriptor. It is very popular for optical character recognition (among other things). To use the shape context descriptor, we first use our favorite key-point detector to locate all of the keypoints in an image. Then, for each keypoint, we build circular bins of different radius with the keypoint at the center of the bins. We then divide the circular bins by angle as well as shown in the image. We then count how many keypoints fall within each bin and use this as our descriptor for the keypoint at the center. Shape context descriptor Belongie et al. 2002 The figure shows an example of such an histograms computed over a th keypoint on the letter “A”. Notice that the 13 bin of the histogram contains 3 keypoints. Histogram (occurrences within each bin) A 3 1 1 2 3 4 5 // 10 11 12 13 14 …. Bin # 13th Shape context descriptor Courtesy of S. Belongie and J. Malik descriptor 1 descriptor 2 descriptor 3 Here is an example of the shape context descriptor and relevant representations of descriptors for a few of the keypoints. The image shows 2 letters A (which are similar up to some degree of intraclass variability). Notice that shape context descriptors associated to key points that are located on a similar regions of the letter A do look very similar (descriptor 1 and 2), whereas shape context descriptors associated to key points that are located on a different regions of the letter A do look different (descriptor 1 and 3) Other detectors/descriptors • HOG: Histogram of oriented gradients Dalal & Triggs, 2005 • SURF: Speeded Up Robust Features Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346-‐-‐359, 2008 • FAST (corner detector) Rosten. Machine Learning for High-‐speed Corner Detection, 2006. • ORB: an efficient alternative to SIFT or SURF Ethan Rublee, Vincent Rabaud, Kurt Konolige, Gary R. Bradski: ORB: An efficient alternative to SIFT or SURF. ICCV 2011 • Fast Retina Key-‐ point (FREAK) A. Alahi, R. Ortiz, and P. Vandergheynst. FREAK: Fast Retina Keypoint. In IEEE Conference on Computer Vision and Pattern Recognition, 2012. CVPR 2012 Open Source Award Winner. Next lecture: Image Classification by Deep Networks by Ronan Collobert (Facebook) Here is a list of some other popular detector/descriptors. They all have trade-offs in computational complexity, accuracy, and robustness to different kinds of noise. Which one is the best to use is dependent on your application and what is important to your performance.