Welcome to my first MiCro post. Micro posts will be shorter posts showcasing a single idea or result (something I've been working on or reading/thinking about that seems worth sharing). This time I show some visual results from training a neural network (built to mimic the controlled hallucination understanding of the brain) to predict the next image in a very simple (but noisy) video.
The network I'm working on gets a single frame as input and attempts to predict the next frame. Its prediction is compared against the actual next frame, and it is trained to reduce the error between its prediction and reality. This alone is already a good strategy. However, we can do better. Anything that it gets wrong (its errors) are passed up to a second network. The second network is structured exactly like the first, when it gets an input (errors from the first network), it tries to predict what the input (again errors from the first network) for the next frame will be. This whole process is repeated many times, with higher layers all predicting the errors that the previous layers are going to make.
The first network by itself makes quite a good prediction. The first network combined with the second network (that attempts to correct for its errors) makes an even better prediction. Combining the first prediction with the second, and so on, all the way to the top, gets the full prediction of the entire network. There is no single place where the full prediction exists (just like there's no single place in the brain where our experience exists).
The following video shows a clip from the video that is used as input to the network. It's just some simple static images with some added noise.
The results are from training on just 500 input frames to the networks. On the left below is the result from training the network to output its inputs. There is no prediction going on here, just an attempt to copy the input to the output. On the right is the predicting version. It attempts to predict the next input. The two experiments are run with the exact same settings, the only difference is that one is attempting to predict the next frame, while the other is just attempting to copy the current one.
The reason for this result is fairly simple. Training based on copying the input learns to recreate the noise. Training to predict learns to output the difference between two different instances of noise, which on average will cancel out. The really neat thing is that if you're predicting the inputs then you get this result for free. The noise is filtered out automatically.
This is a neat visual example of just one of the reasons why predicting inputs is a good strategy for decoding information about the world. Using a prediction strategy automatically filters out (unbiased) noise.