Misplaced Pages

Gated recurrent unit

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
(Redirected from Gated recurrent units) Memory unit used in neural networks
Part of a series on
Machine learning
and data mining
Paradigms
Problems
Supervised learning
(classification • regression)
Clustering
Dimensionality reduction
Structured prediction
Anomaly detection
Artificial neural network
Reinforcement learning
Learning with humans
Model diagnostics
Mathematical foundations
Journals and conferences
Related articles

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features, but lacks a context vector or output gate, resulting in fewer parameters than LSTM. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.

The operator {\displaystyle \odot } denotes the Hadamard product in the following.

Fully gated unit

Gated Recurrent Unit, fully gated version

Initially, for t = 0 {\displaystyle t=0} , the output vector is h 0 = 0 {\displaystyle h_{0}=0} .

z t = σ ( W z x t + U z h t 1 + b z ) r t = σ ( W r x t + U r h t 1 + b r ) h ^ t = ϕ ( W h x t + U h ( r t h t 1 ) + b h ) h t = ( 1 z t ) h t 1 + z t h ^ t {\displaystyle {\begin{aligned}z_{t}&=\sigma (W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-z_{t})\odot h_{t-1}+z_{t}\odot {\hat {h}}_{t}\end{aligned}}}

Variables ( d {\displaystyle d} denotes the number of input features and e {\displaystyle e} the number of output features):

  • x t R d {\displaystyle x_{t}\in \mathbb {R} ^{d}} : input vector
  • h t R e {\displaystyle h_{t}\in \mathbb {R} ^{e}} : output vector
  • h ^ t R e {\displaystyle {\hat {h}}_{t}\in \mathbb {R} ^{e}} : candidate activation vector
  • z t ( 0 , 1 ) e {\displaystyle z_{t}\in (0,1)^{e}} : update gate vector
  • r t ( 0 , 1 ) e {\displaystyle r_{t}\in (0,1)^{e}} : reset gate vector
  • W R e × d {\displaystyle W\in \mathbb {R} ^{e\times d}} , U R e × e {\displaystyle U\in \mathbb {R} ^{e\times e}} and b R e {\displaystyle b\in \mathbb {R} ^{e}} : parameter matrices and vector which need to be learned during training

Activation functions

Alternative activation functions are possible, provided that σ ( x ) [ 0 , 1 ] {\displaystyle \sigma (x)\in } .

Type 1
Type 2
Type 3

Alternate forms can be created by changing z t {\displaystyle z_{t}} and r t {\displaystyle r_{t}}

  • Type 1, each gate depends only on the previous hidden state and the bias.
    z t = σ ( U z h t 1 + b z ) r t = σ ( U r h t 1 + b r ) {\displaystyle {\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (U_{r}h_{t-1}+b_{r})\\\end{aligned}}}
  • Type 2, each gate depends only on the previous hidden state.
    z t = σ ( U z h t 1 ) r t = σ ( U r h t 1 ) {\displaystyle {\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1})\\r_{t}&=\sigma (U_{r}h_{t-1})\\\end{aligned}}}
  • Type 3, each gate is computed using only the bias.
    z t = σ ( b z ) r t = σ ( b r ) {\displaystyle {\begin{aligned}z_{t}&=\sigma (b_{z})\\r_{t}&=\sigma (b_{r})\\\end{aligned}}}

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:

f t = σ ( W f x t + U f h t 1 + b f ) h ^ t = ϕ ( W h x t + U h ( f t h t 1 ) + b h ) h t = ( 1 f t ) h t 1 + f t h ^ t {\displaystyle {\begin{aligned}f_{t}&=\sigma (W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(f_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-f_{t})\odot h_{t-1}+f_{t}\odot {\hat {h}}_{t}\end{aligned}}}

Variables

  • x t {\displaystyle x_{t}} : input vector
  • h t {\displaystyle h_{t}} : output vector
  • h ^ t {\displaystyle {\hat {h}}_{t}} : candidate activation vector
  • f t {\displaystyle f_{t}} : forget vector
  • W {\displaystyle W} , U {\displaystyle U} and b {\displaystyle b} : parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU) removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

z t = σ ( BN ( W z x t ) + U z h t 1 ) h ~ t = ReLU ( BN ( W h x t ) + U h h t 1 ) h t = z t h t 1 + ( 1 z t ) h ~ t {\displaystyle {\begin{aligned}z_{t}&=\sigma (\operatorname {BN} (W_{z}x_{t})+U_{z}h_{t-1})\\{\tilde {h}}_{t}&=\operatorname {ReLU} (\operatorname {BN} (W_{h}x_{t})+U_{h}h_{t-1})\\h_{t}&=z_{t}\odot h_{t-1}+(1-z_{t})\odot {\tilde {h}}_{t}\end{aligned}}}

LiGRU has been studied from a Bayesian perspective. This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

References

  1. Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv:1406.1078 .
  2. Felix Gers; Jürgen Schmidhuber; Fred Cummins (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
  3. "Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML". Wildml.com. 2015-10-27. Archived from the original on 2021-11-10. Retrieved May 18, 2016.
  4. ^ Ravanelli, Mirco; Brakel, Philemon; Omologo, Maurizio; Bengio, Yoshua (2018). "Light Gated Recurrent Units for Speech Recognition". IEEE Transactions on Emerging Topics in Computational Intelligence. 2 (2): 92–102. arXiv:1803.10225. doi:10.1109/TETCI.2017.2762739. S2CID 4402991.
  5. Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network". Neurocomputing. 356: 151–161. arXiv:1803.01686. doi:10.1016/j.neucom.2019.04.044. S2CID 3675055.
  6. Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 .
  7. Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?", Frontiers in Artificial Intelligence, 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
  8. Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 .
  9. Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks". arXiv:1701.05923 .
  10. Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks". arXiv:1701.03452 .
  11. Bittar, Alexandre; Garner, Philip N. (May 2021). "A Bayesian Interpretation of the Light Gated Recurrent Unit". ICASSP 2021. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE. pp. 2965–2969. 10.1109/ICASSP39728.2021.9414259.
Artificial intelligence
Concepts
Applications
Implementations
Audio–visual
Text
Decisional
People
Architectures
Categories: