I need to start by explaining why I have only pruned the middle 2 layers of this model. For the initial layer it is quite obvious, we want all of the input information. As a simple example, using MNIST a 7 is drawn with a horizontal line starting in the left part of a square, and most other numbers won’t be using that part of the square as much. If we supress the input from a few pixels there, 7s will be identified as 1s. If you think this is useful, then you can have that one for free, but if you want to know how to turn 3s into 2s it is going to cost you. I don’t think it is very useful, and can’t think of how this knowledge would help any other type of model, so I am going to rule it out and look to try and modify internal representations within the inner layers. As for the final layer, I could just manually set all of the weights in an output class to 0 and forget that class. If I were to include the final layer using this method, this would happen immediately and mask anything interesting happening in the preceding layers.
Pruning both layers worked better than I was expecting. I thought there would be weeks of tweaking and testing to get something maybe half as good. The initial plan was to just prune the 2 layers, but after I saw the results I worried that this should not be this simple. That is why I decided to look at pruning individual layers as part of the first experiment. I had to wonder if there were a small number of activations that were feeding into the final layer that would represent a 7 and removing those would be pretty much like zeroing out all of the weights in that class of the final layer. This would just be supressing the output. I am satisfied with what I have shown in the last blog, that this is not happening, at least in the useful bit of the plot. more on that later.
My imaginary use case is for a large language model to be used by lawyers to help with research. LLMs that are initially trained on the whole internet are going to be trained on all sorts of legal sounding data. Some will be fictional, from movies and books, lots will be data from other countries with different laws. If a model forgets this before fine tuning on relevant data, the model could be much more likely to return useful results. You can think of other examples like this, for me it helps to explain, and visualise concepts.
In this case I would have to wonder if this technique would be simply supressing the output of words like ‘jury’ and ‘case’, or would it be removing the associations between them internally. The difference between the 2 would be important in this case because when retraining, if the words are just reintroduced and the unwanted associations are still there, the unwanted behaviour would just come out again. Supressing the outputs is not necessarily bad, because, in my limited experience, performance can be recovered with retraining. It would just mask other things going on.
With MNIST it seems that there is some sort of internal representation that I can target with this pruning method. Even if it is supressing outputs, this method is finding the parameters that are associated with 1 class. These ideas of an internal representation vs. supressing outputs is just something I came up with. There is going to be some overlap between the 2. I need to defend looking for the downside of what seems like a very successful experiment. It helps me when thinking about what I am going to look for in the next experiment.
Next
The next blog looks at the how this method of pruning affects output values.