Sparsity - Welcome

I do experiments on sparse neural networks, and I believe that working with sparsity is one way to progress machine learning. Sparse neural networks are just neural networks that have some of their weight values set to 0. I have a novel method of pruning neural networks. I have written a couple of blogs about it before, but have not published any papers. I have since lost the diagrams that go with the blogs, so I no longer have them online. They were also too long and not very focussed. I will cover everything that they said here and in future blogs, so no real loss.

Neural networks are often described as being modeled on human brains, the analogy is far from perfect, but getting inspiration from what we know of the human brain can still be useful. Of the dozens of neurons (hundreds even) in the human brain, they are not all connected together (Like a fully connected, dense, artificial neural network). They more closely resemble the structure of a sparse neural network. Convolutional neural networks capture a structured kind of sparsity to achieve better performance on image recognition tasks.

My software, and experiments

The software I have made for this is made with Numpy, Cupy(Numpy on a GPU), Scipy(for sparse operations), and Numba(for custom GPU scripts). Dense matrix operations on a GPU are the fastest way of running things, because they are built and optimised by teams of people who built the chips, and are very good at their jobs. This means that the fastest way for me to do these experiments is to use dense matrices with values set to 0. I use a couple of tricks that seem less efficient, but are faster.

These experiments work as a proof of concept, and everything I have built for both sparse matrices, and full matrices with some tricks, produce the same results. One is just faster. If someone were to optimise the sparse code, many parts would work faster with high levels of sparsity. This could be worthwhile for someone with more resources than me.

My experiments are also mostly done on small models with the MNIST dataset. If this was good enough for decades of machine learning research that lead to the field we know today, it is good enough for me. I do often write about the constraints of using these models and data, but it does work well as a proof of concept.

Performance and practical concerns

Sparse operations are more efficient with matrices with a high degree of sparsity. A matrix with 80% 0 values won’t have the speed and space improvements that a 95% sparse matrix will. I have read papers that have headline claims of massive sparsity. A better indicator would be the total number of parameters, or a breakdown of the shape and sparsity of each layer. If you are reading the paper to begin with, it is not difficult to understand. Claiming a model has 99.5% sparsity means less when the model is massively oversized to begin with.

I am looking at ways of making training more efficient, but at this stage, using a well established library with GPUs is the fastest. There are still advantages to using a sparse model. Inference can be faster using sparse models and sparse operations on a CPU. Maybe you want to have a model on a server that doesn’t have a GPU. Models can easily be trained on big machines then pruned and converted into sparse models for regular inference. I have not tried converting a model from Tensorflow or torch to sparse models. I have however trained models and converted them back and forth from dense to sparse and from GPU to CPU data without any issue.

Randomly initialized sparse models

An experiment that I tried on sparse neural nets shows that structure is important. Randomly initialized sparse nets don’t train well at all. This experiment doesn’t warrant a big write up. All I did was randomly initialize large sparse models, in the order of 5,000 activations, with about 5% non zero values. There was some learning happening, but far below any reasonable performance. much smaller fully connected models pruned to the same sparsity ~5% non zero values have similar performance to other neural networks. This suggests that there is some structure that is found during training. Unfortunately this means that we either need to know a useful structure, like CNNs, or start with full models and prune them, removing weights that aren’t being used.

What next for sparse models?

I have done lots of work with pruning models, But I have not pruned any model to the point that sparse models would have the adventage. This is partly down to hardware constraints, but also because the ideas I have followed up on have not warranted it. I would like to come back to it at some point though. The hardware constraint is really the fact that I will need to start with a much bigger model, then prune until the sparsity is an advantage. I don’t have the resources to do that at this stage. I would also need to do more work to understand the structure to look for.