LSTM Question

2017-12-10 PV:

Sigmoids make sense for the gates, as they control how much of the signal is left into/out of the cell. Think of it as a percentage: how many percent of the input signal should I store in the cell (or put out of the cell). It doesn’t make sense to amplify a signal and write 110% of the current cell signal to the output. That’s not what the gates are for. Likewise, it doesn’t make sense for the input unit to say “the current input is 901% relevant for the memory cell, so please store it 9 times as strongly as usual”. If that were the case, the input/output weights would have caused the signal to be 900% stronger to begin with.

For the output activation, ReLU can of course be used. However you might easily run into numerical problems, given that gradients already need to be truncated oftentimes (and ReLU doesn’t dampen them the way sigmoids do). If I recall correctly Bengio’s lab has a paper somewhere where they use ReLU for RNNs and they said they had problems of this kind (I may be wrong though, and I’m unable to find the paper right now).

Also, one of the benefits of ReLUs is that they stop vanishing gradients. But LSTM was already designed not to suffer from that, to begin with. Given that you don’t have vanishing gradient problems, it comes down to the question whether relu is better than sigmoids on principle (because it can learn better functions) or because its main advantage is easier training. Of course, this is a simplified view, especially since LSTM was not originally designed to be “deep”. if you use many layers of lstms, you might still get vanishing gradients if you use sigmoids.