COMPUTER VISION, NEURAL NETWORKS

Advantages of FCNN Over CNN - Part 1

April 12, 2022

The Fourier domain is utilized in computer vision and machine learning because image processing tasks in the Fourier domain are similar to those in the spatial domain, but with distinct operations. Convolutional Neural Networks (CNNs) is a type of neural network that uses machine learning to obtain state-of-the-art results in a variety of computer vision applications. The computational cost of updating a large number of convolution parameters is one of CNNs’ key limitations.

Implementing convolutional neural networks in the frequency domain rather than spatial domain is a great way to reduce the time complexity and get the global representation of images on each layer. Before we dive into Fourier Convolutional Neural Network (FCNN), let’s briefly discuss classical CNN and its drawbacks. 

Regular CNNs

As we all know, in regular CNNs, convolutional operations are more computationally expensive operations compared to non-linearities, pooling, and fully connected operations. The convolution is normally conducted using a traditional sliding window approach across the whole input image with the application of a kernel function. Convolving the input image with kernel filter gives us a feature representation of our input image which has a receptive field of neighbor pixels i.e., we get only local information on each layer.

So, how does CNN integrate local information across the whole image?

The answer to this question is – by going for multiple layers. The more layers we have, the more spatial information can be included and more spatial information means greater accuracy, but it also increases the computation cost.

Since each pixel is multiplied by the kernel, it is time-consuming if we increase the number of layers which in turn means that CNNs are often not viable for large image computer vision tasks.  And another major drawback is after convolution, the resultant size would decrease if we don’t use padding. So, we lose edge information. To address this issue, we can use the idea of using the Fourier transformation.

Fourier transformation

Fourier transform states that any signal in this case any image can be expressed as a sum of a series of sinusoids and co-sinusoids, so it’s a series expansion of sin or cosine orthogonal basis functions. 

Let’s say we have a square image function (N X N) on a spatial domain which is f(u,v) and if we take a 2D Fourier transform (DFT) then it will be

Fourier transformation Equation

and the inverse transform will be 

Fourier transformation Equation

Where the exponential term is the basis and the spatial domain image is now migrated onto Fourier domain and each point of the spatial domain image now corresponds to F(k,l) in Fourier space.

So, each point of the image is decomposed into sinusoidal components so, it becomes very easy to examine all the frequencies and thus we can access the geometrical structure of the spatial domain image. 

The number of frequencies generally corresponds to the number of pixels in the spatial domain image, The Fourier transform produces a complex number valued output image that can be displayed with two images, either with the real and imaginary part or with magnitude and phase, and if we want to re-transform the Fourier image into the correct spatial domain after some frequency-domain processing, we must ensure that both magnitude and phase of the Fourier image are preserved. So, the Fourier transform divides our image into frequencies, allowing us to distinguish between numerous sinusoidal signals and select a specific signal by designating a specific frequency to certain signals and then filtering it out. 

Fourier convolutional neural networks

Images are processed and represented using FCNN in the Fourier domain, with a convolution method like that employed in classic CNN algorithms. The underlying intuition given by the Convolution Theorem which states that for two functions κ and u, we have

Fourier convolutional neural networks Equation

where F denotes the Fourier transform, denotes convolution, and denotes the Hadamard Pointwise Product. 

This enables the use of Fast Fourier Transforms (FFTs) to calculate convolution more efficiently. Given the efficiency of the Fourier transform and the fact that convolution corresponds to the Hadamard product in the Fourier domain, this approach requires many fewer computer operations than the sliding kernel spatial method and is thus significantly faster. The data is transformed to the Fourier domain before the procedure begins and stays in the Fourier domain throughout; no inverse FFTs are required at any time.  

Pooling is also implemented in the Fourier domain, which is not only more efficient than max pooling but also produces better results. Dense layers and dropout are the other layers implemented in the FCNN. The comparable spatial layers are identical to these Fourier layers. To prevent over-fitting, Dropout drops nodes in our network at a probability of p. This holds in both the Fourier and the spatial domains. Similarly, dense layers for learning abstract relationships within convolved image data work in the same way with Fourier data as they do with spatial data. 

The discrete Fourier transform for an n×n image u requires n2 multiplications and n(n-1) additions, however, this can be significantly reduced using an FFT method like Cooley-Tukey, which can compute the Direct Fourier Transform (DFT) with n/2 log2 n multiplications and n log2 n additions. This improves the FFT from O (n 2) operations to O (n log n). 

Fourier pooling 

Image data is distributed differently in the Fourier domain than in the spatial domain. This allows us to reduce the data size by the same amount as in the spatial domain while still retaining additional information. The center of a Fourier matrix contains high-frequency data, whereas the edges include low-frequency data. As a result, the matrices borders are truncated since the high-frequency Fourier data contains more spatial information that we want to keep. A simple Fourier pooling layer for FCNN is provided by this method. When data is down-sampled by the same factor, Fourier pooling maintains more spatial information than when data is max-pooled, this is because of the nature of the Fourier domain. 

To know about how spatial domain convolution is equivalent to Fourier domain multiplication, read Advantages of FCNN Over CNN – Part 2

Conclusion:

When we convolve in the spatial domain, we simply convolve across pixel’s neighbors; but, when we convolve in the Fourier domain, we convolve across neighboring frequencies. As a result, each component of the Fourier spectrum represents data for the full pixel. This Fourier convolution will be localized in frequency space but global in pixel space, which is pretty cool since ordinary convolution is localized in pixel space, and we can have the advantage of large training time reduction without sacrificing efficiency with the FCNN. With the suggested FCNN, larger images may be processed in a reasonable amount of time, and the network efficiency is significantly improved.