Misplaced Pages

Frequency principle/spectral bias

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Phenomenon observed in the study of Artificial Neural Networks
This article is an orphan, as no other articles link to it. Please introduce links to this page from related articles; try the Find link tool for suggestions. (December 2024)

The frequency principle/spectral bias is a phenomenon observed in the study of artificial neural networks (ANNs), specifically deep neural networks (DNNs). It describes the tendency of deep neural networks to fit target functions from low to high frequencies during the training process.

This phenomenon is referred to as the frequency principle (F-Principle) by Zhi-Qin John Xu et al. or spectral bias by Nasim Rahaman et al. The F-Principle can be robustly observed in DNNs, regardless of overparametrization. A key mechanism of the F-Principle is that the regularity of the activation function translates into the decay rate of the loss function in the frequency domain.

The discovery of the frequency principle has inspired the design of DNNs that can quickly learn high-frequency functions. This has applications in scientific computing, image classification, and point cloud fitting problems. Furthermore, it provides a means to comprehend phenomena in practical applications and has inspired numerous studies on deep learning from the frequency perspective.

Main results (informal)

Experimental results

Fig 1: This image illustrates the frequency principle in one-dimension. The abscissa represents the frequency and the ordinate represents the amplitude to the corresponding frequency. The red dash line is the DFT of the one-dimension target function. The blue solid line is the DFT of the DNNs output.

In one-dimensional problems, the Discrete Fourier Transform (DFT) of the target function and the output of DNNs can be obtained, and we can observe from Fig.1 that the blue line fits the low-frequency faster than the high-frequency.

Fig 2: The picture illustrates the frequency in the two-dimension. The first subfigure is the real data of the camera man. And the following three subfigures are the outputs of DNNs at step 80, 2000, 58000.

In two-dimensional problems, Fig.2 utilises DNN to fit an image of the camera man. The DNN starts learning from a coarse image and produces a more detailed image as training progresses. This demonstrates learning from low to high frequencies, which is analogous to how the biological brain remembers an image. This example shows the 2D frequency principle, which utilises DNNs for image restoration by leveraging preferences for low frequencies, such as in inpainting tasks. However, it is important to account for insufficient learning of high-frequency structures. To address this limitation, certain algorithms have been developed, which are introduced in the Applications section.

In high-dimensional problems, one can use projection method to visualize the frequency convergence in one particular direction or use Gaussian filter to roughly see the convergence of the low-frequency part and the high-frequency part.

Theoretical results

Based on the following assumptions, i.e., i) certain regularity of target function, sample distribution function and activation function; ii) bounded training trajectory with loss convergence, Luo et al. prove that the change of high-frequency loss over the total loss decays with the separated frequency with a certain power, which is determined by the regularity assumption. A key aspect of the proof is that composite functions maintain a certain regularity, causing decay in the frequency domain. Thus this result can be applied to general network structures with multiple layers. While this characterization of the F-Principle is very general, it is too coarse-grained to differentiate the effects of network structure or special properties of DNNs. It provides only a qualitative understanding rather than quantitatively characterizing differences.

There is a continuous framework to study machine learning and suggest gradient flows of neural networks are nice flows and obey the F-Principle. This is because they are integral equations which have higher regularity. The increased regularity of integral equations leads to faster decay in the Fourier domain.

Applications

Algorithms designed to overcome the challenge of high-frequency

Phase shift DNN: PhaseDNN converts high-frequency component of the data downward to a low-frequency spectrum for learning, and then converts the learned one back to the original high frequency.

Adaptive activation functions: Adaptive activation functions replace the activation function σ ( x ) {\displaystyle \sigma (x)} by σ ( μ a x ) {\displaystyle \sigma (\mu ax)} , where μ {\displaystyle \mu } is a fixed scale factor with μ 1 {\displaystyle \mu \geq 1} and a {\displaystyle a} is a trainable variable shared for all neurons.

Fig 3: Illustration of two MscaleDNN structures.

Multi-scale DNN: To alleviate the high-frequency difficulty for high-dimensional problems, a Multi-scale DNN (MscaleDNN) method considers the frequency conversion only in the radial direction. The conversion in the frequency space can be done by scaling, which is equivalent to an inverse scaling in the spatial space.

For the first a MscaleDNN takes the following form f ( x ; θ ) = W [ L 1 ] σ ( ( W [ 1 ] σ ( K ( W [ 0 ] x ) + b [ 0 ] ) + b [ 1 ] ) ) + b [ L 1 ] , {\displaystyle f({\boldsymbol {x}};{\boldsymbol {\theta }})={\boldsymbol {W}}^{}\sigma \circ (\cdots ({\boldsymbol {W}}^{}\sigma \circ ({\boldsymbol {K}}\odot ({\boldsymbol {W}}^{}{\boldsymbol {x}})+{\boldsymbol {b}}^{})+{\boldsymbol {b}}^{})\cdots )+{\boldsymbol {b}}^{},} where x R d {\displaystyle {\boldsymbol {x}}\in \mathbb {R} ^{d}} , W [ l ] R m l + 1 × m l {\displaystyle {\boldsymbol {W}}^{}\in \mathbb {R} ^{m_{l+1}\times m_{l}}} , m l {\displaystyle m_{l}} is the neuron number of l {\displaystyle l} -th hidden layer, m 0 = d {\displaystyle m_{0}=d} , b [ l ] R m l + 1 {\displaystyle {\boldsymbol {b}}^{}\in \mathbb {R} ^{m_{l+1}}} , σ {\displaystyle \sigma } is a scalar function and {\displaystyle \circ } means entry-wise operation, {\displaystyle \odot } is the Hadamard product and K = ( a 1 , a 1 , , a 1 1st part , a 2 , , a i 1 , a i , a i , , a i ith part , , a N , a N , a N Nth part ) T {\displaystyle {\boldsymbol {K}}=(\underbrace {a_{1},a_{1},\cdots ,a_{1}} _{\text{1st part}},a_{2},\cdots ,a_{i-1},\underbrace {a_{i},a_{i},\cdots ,a_{i}} _{\text{ith part}},\cdots ,\underbrace {a_{N},a_{N}\cdots ,a_{N}} _{\text{Nth part}})^{T}} where K R m 1 {\displaystyle {\boldsymbol {K}}\in \mathbb {R} ^{m_{1}}} , a i = i {\displaystyle a_{i}=i} or a i = 2 i 1 {\displaystyle a_{i}=2^{i-1}} . This structure is called Multi-scale DNN-1 (MscaleDNN-1).

The second kind of MscaleDNN which is denoted as MscaleDNN-2 in Fig.3 is a sum of N {\displaystyle N} subnetworks, in which each scale input goes through a subnetwork. In MscaleDNN-2, weight matrices from W [ 1 ] {\displaystyle W^{}} to W [ L 1 ] {\displaystyle W^{}} are block diagonal. Again, the scale coefficient a i = i {\displaystyle a_{i}=i} or a i = 2 i 1 {\displaystyle a_{i}=2^{i-1}} .

Fourier feature network: Fourier feature network map input x {\displaystyle {\boldsymbol {x}}} to γ ( x ) = [ a 1 cos ( 2 π b 1 T x ) , a 1 cos ( 2 π b 1 T x ) , , a m cos ( 2 π b m T x ) , a m cos ( 2 π b m T x ) ] {\displaystyle \gamma ({\boldsymbol {x}})=} for imaging reconstruction tasks. γ ( x ) {\displaystyle \gamma ({\boldsymbol {x}})} is then used as the input to neural network. An extended Fourier feature network for PDE problem, where the selection for b i {\displaystyle b_{i}} is from different ranges. Ben Mildenhall et al. successfully apply this multiscale Fourier feature input in the neural radiance fields for view synthesis

Multi-stage neural network: Multi-stage neural networks (MSNN) use a superposition of DNNs, where sequential neural networks are optimized to fit the residuals from previous neural networks, boosting approximation accuracy. MSNNs have been applied to both regression problems and physics-informed neural networks, effectively addressing spectral bias and achieving accuracy close to the machine precision of double-floating point.

Frequency perspective for understanding experimental phenomena

Compression phase: The F-Principle explains the compression phase in information plane. The entropy or information quantifies the possibility of output values, i.e., more possible output values lead to a higher entropy. In learning a discretized function, the DNN first fits the continuous low-frequency components of the discretized function, i.e., large entropy state. Then, the DNN output tends to be discretized as the network gradually captures the high-frequency components, i.e., entropy decreasing. Thus, the compression phase appears in the information plane.

Increasing complexity: The F-Principle also explains the increasing complexity of DNN output during the training.

Strength and limitation: The F-Principle points out that deep neural networks are good at learning low-frequency functions but difficult to learn high-frequency functions.

Early-stopping trick: As noise is often dominated by high-frequency, with early-stopping, a neural network with spectral bias can avoid learn high-frequency noise.

References

  1. ^ Xu, Zhi-Qin John; Zhang, Yaoyu; Xiao, Yanyang (2019). "Training Behavior of Deep Neural Network in Frequency Domain". In Tom Gedeon; Kok Wai Wong; Minho Lee (eds.). Neural Information Processing. Vol. 11953. Cham: Springer International Publishing. pp. 264–274. arXiv:1807.01251. doi:10.1007/978-3-030-36708-4_22. ISBN 978-3-030-36707-7. S2CID 49562099.
  2. Xu, Zhi-Qin John; Zhang, Yaoyu; Luo, Tao; Xiao, Yanyang; Ma, Zheng (2020). "Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks". Communications in Computational Physics. 28 (5): 1746–1767. arXiv:1901.06523. Bibcode:2020CCoPh..28.1746X. doi:10.4208/cicp.OA-2020-0085. ISSN 1815-2406. S2CID 58981616.
  3. Rahaman, Nasim; Baratin, Aristide; Arpit, Devansh; Draxler, Felix; Lin, Min; Hamprecht, Fred; Bengio, Yoshua; Courville, Aaron (2019-05-24). "On the Spectral Bias of Neural Networks". Proceedings of the 36th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 5301–5310. Retrieved 2023-07-14.
  4. ^ Xu, Zhi-Qin John; Zhang, Yaoyu; Luo, Tao (2022-10-18). "Overview frequency principle/spectral bias in deep learning". arXiv:2201.07395 .
  5. Luo, Tao; Ma, Zheng; Xu, Zhi-Qin John; Zhang, Yaoyu (2021). "Theory of the Frequency Principle for General Deep Neural Networks". CSIAM Transactions on Applied Mathematics. 2 (3): 484–507. arXiv:1906.09235. doi:10.4208/csiam-am.SO-2020-0005. ISSN 2708-0560. S2CID 195317121. Retrieved 2023-07-14.
  6. E, Weinan; Ma, Chao; Wu, Lei (2020-11-01). "Machine learning from a continuous viewpoint, I". Science China Mathematics. 63 (11): 2233–2266. arXiv:1912.12777. doi:10.1007/s11425-020-1773-8. ISSN 1869-1862. S2CID 209515941. Retrieved 2023-07-14.
  7. Cai, Wei; Li, Xiaoguang; Liu, Lizuo (2020). "A Phase Shift Deep Neural Network for High Frequency Approximation and Wave Problems". SIAM Journal on Scientific Computing. 42 (5): –3285–A3312. arXiv:1909.11759. Bibcode:2020SJSC...42A3285C. doi:10.1137/19M1310050. ISSN 1064-8275. S2CID 209376162. Retrieved 2023-07-16.
  8. Jagtap, Ameya D.; Kawaguchi, Kenji; Karniadakis, George Em (2020-03-01). "Adaptive activation functions accelerate convergence in deep and physics-informed neural networks". Journal of Computational Physics. 404: 109136. arXiv:1906.01170. Bibcode:2020JCoPh.40409136J. doi:10.1016/j.jcp.2019.109136. ISSN 0021-9991. S2CID 174797885. Retrieved 2023-07-16.
  9. Liu, Ziqi; Cai, Wei; Xu, Zhi-Qin John (2020). "Multi-Scale Deep Neural Network (MscaleDNN) for Solving Poisson-Boltzmann Equation in Complex Domains". Communications in Computational Physics. 28 (5): 1970–2001. arXiv:2007.11207. Bibcode:2020CCoPh..28.1970L. doi:10.4208/cicp.OA-2020-0179. ISSN 1815-2406. S2CID 220686331. Retrieved 2023-07-16.
  10. Tancik, Matthew; Srinivasan, Pratul P.; Mildenhall, Ben; Fridovich-Keil, Sara; Raghavan, Nithin; Singhal, Utkarsh; Ramamoorthi, Ravi; Barron, Jonathan T.; Ng, Ren (2020-06-18). "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains". arXiv:2006.10739 .
  11. Wang, Sifan; Wang, Hanwen; Perdikaris, Paris (2021-10-01). "On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks". Computer Methods in Applied Mechanics and Engineering. 384: 113938. arXiv:2012.10047. Bibcode:2021CMAME.384k3938W. doi:10.1016/j.cma.2021.113938. ISSN 0045-7825. S2CID 229331851. Retrieved 2023-07-17.
  12. Mildenhall, Ben; Srinivasan, Pratul P.; Tancik, Matthew; Barron, Jonathan T.; Ramamoorthi, Ravi; Ng, Ren (2020-08-03). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis". arXiv:2003.08934 .
  13. Wang, Yongji; Lai, Ching-Yao (1 May 2024). "Multi-stage neural networks: Function approximator of machine precision". Journal of Computational Physics. 504: 112865. arXiv:2307.08934. doi:10.1016/j.jcp.2024.112865. ISSN 0021-9991.
  14. Shwartz-Ziv, Ravid; Tishby, Naftali (2017-04-29). "Opening the Black Box of Deep Neural Networks via Information". arXiv:1703.00810 .
Category: