In the previous blog I mentioned that I noticed that the layers reduce at different rates when they are pruned using the MAAV method. One way I thought of to illustrate this was to train some models with 1,000 activations in each layer. All of the models that I work with start with the biggest layer, then they reduce in size. I always just assumed this was the obvious way of designing a model. Each layer extracts, and compresses some information and the next layer needs fewer activations to do its job. I wanted to see if this would find a similar shape to what I have been using. These plots show how layers reduce compared to the overall size of the model.
MNIST
FMNIST
These both find a triangular shape, but there is a difference. With MNIST the layer sizes are closer together, and for the FMNIST model, the first layer has barely changed by the time the 3rd layer has lost almost half its activations.
I am going to use some cautious language to explain what is going on here. This is one method of finding useful parameters. This is finding that these useful parameters are spread more evenly across activations in the first layer, and are more concentrated in the 3rd layer. This suggests that we are on the right track with a triangular model. It also suggests to me that the more complicated dataset needs to have the first 2 layers bigger, to extract more information before reducing the size. One thing to remember here is that this is removing activations by brute force, so there is no mechanism for the model to say no, and there is no way to make layers bigger.
These do not end up with the same shape as the original 1,000-800-400-10 model that I have been working with. This is because the first layer will have parameters removed before the 3rd layer has lost more than half its parameters. The shape you get from this method will depend on the starting shape.
I don’t want to fall into using circular logic here, saying this is finding the best shape because this is the shape it is finding. I think that this is finding a better shape, at this stage it is interesting, but I don’t know how useful it is. The idea of using the data to help decide the shape of the model is potentially very useful. This method automatically finds some things that I assumed to be true about the shape of the models, and the complexity of the dataset.
I did not experiment as much on these models as I did in the previous blog, so I cannot say too much about any increase in test accuracy. Here I looked at models that were too wide, next I will be looking at models that are too deep.