Let’s assume you are initializing a neural network to create realistic looking names. The training aim of the neural network would be to based on previous characters predict the next character.

If we are now training the neural network with names data

it aims to predict based on last N characters the character N+1

WHY SHOULD WE BE CAREFUL WHEN INITIALIZING NEURAL NETWORK PARAMETERS?

If we look at the possibilities for making a prediction each prediction should have at least 1/27 chance to predict next character correctly. This is 4 % chance.

If we would give each character equal probability with log likelihood average loss should be around

Untitled

This is because wrong prediction will cost us little if we are not too confident in the wrong prediction.

As you can see, negative log likelihood function will go to 0 if our prediction is close to 1 and to infinity if our prediction is close to 0.