I have built a python library for training artificial neural networks(NNs) called Sparana. The name is short for sparse parameter analysis. I started with a library using tensorflow that made it easier for me to look at parameters, and prune them. I found tensorflow did not support optimization over sparse weights, which seemed like an interesting idea to pursue. So I started a new library doing things from scratch, sort of. Sparana is built using Cupy for GPU operations, Numpy for most basic maths, Scipy for sparse operations, Cupy also supports sparse operations on a GPU. There are some custom GPU operations made using Numba.
This is not as fast as other libraries like torch or tensorflow. I am one person developing this part time. If you want something fast and stable, with lots of developers and users, look at some of them. If you are interested in using sparse data structures, or building your own system, read on and learn from my mistakes.
Here is a notebook of how a model is initialized and trained. I am working on some more detailed documentation about how to use all of the features. The model class contains the layers, and has a few functions that for someone to use, mostly things like getting outputs, or accuracy of a model. The layer objects store the weights and biases, and contain all of the linear algebra to do forward passes, and calculate gradients to do backward passes. Gradient information is passed to an optimizer class which updates parameters. It is quite simple, not fast, but has given me a good understanding of backpropogation, and how optimizers work. I am self taught, I studied mathematics at university, but this did not involve very much coding, so this was somewhat of a project in teaching myself how to code. I have done some experiments into adversarial attacks, catastrophic forgetting, transfer learning, and learned some interesting things.
Adding dropout
This section is about writing code, it may be interesting to someone working on a similar project, or someone looking to modify this code for their own project. This is an example of how I need to go about expanding/modifying the funcitons of this library. It should give you some idea of how information moves from one section to another. This section covers model and layer classes.
I had some success implementing dropconnect in experiments using tensorflow, so I implemented it in this library. I never got it working quite right, I may have been calculating the gradients incorrectly. Dropconnect is similar to dropout, but it removes weights instead of activations. This means that it is more computationally intensive, there are more weights than activations. I have not read about it being widely used, dropout probably works well enough. My new focus on efficiency, so I will not have much use for dropconnect, especially if it doesn’t work properly. That means that some of the code is already there. I need to move some things around, and add and remove some things. This will not be a large undertaking, but will show how a model performs forward and backward passes.
Forward pass
First of all, in the model object dropout is defined, and passed to the layer objects when they are initialized, this code starts at line 10 of model.py. This is the init function, it is run when the class is initialized and defines all of the variables this class will use.
class model:
def __init__(self, input_size, layers, dropout = None, comp_type = 'GPU'):
self._input_size = input_size
self._layers = layers
# Can set the dropout value for the whole network.
self._dropout = dropout
for layer in self._layers:
layer._dropout = self._dropout
self._comp_type = comp_type
self._depth = len(layers)
# Infer layers type from first layer, important here for transposing the input/output
# Not planning to use mixed(full, and sparse) layers.
self._layer_type = layers[0]._layer_type
I don’t need to change anything here, this just passes the dropout value to the layer objects. The dropout value is a value from 0 to 1 representing the ratio of activations to be dropped out. I am not always very good at commenting my code, but this section has descriptive variable names. The model class is not very complicated, most of the work is done in other places. Next I need to work on the layer objects. There are several different layer objects, I am not going to show them all, this is the activate function of the full relu layer.
def activate(self, inputs):
if self._comp_type == 'GPU':
self._outputs = cp.dot(inputs, self._weights)
if self._comp_type == 'CPU':
self._outputs = inputs@self._weights
self._outputs = self._outputs + self._biases
self._relu = self._outputs>0
self._outputs = self._outputs*self._relu
# Apply dropout to the outputs of this layer
if self._dropout:
if self._comp_type == 'CPU':
self._dropout_mask = np.random.binomial(1, 1-self._dropout, size = self._outputs.shape)
if self._comp_type == 'GPU':
self._dropout_mask = cp.random.binomial(1, 1-self._dropout, size = self._outputs.shape)
self._outputs = self._outputs*self._dropout_mask
return self._outputs
The last section, after if self._dropout: generates a random array of 1s and 0s. The ratio of 0s is the dropout ratio, dropout ratio of 0.4 gives 40% 0s. When this is multiplied by the output it effectively drops out 40% of the activations. This is stored as a self variable to be used when calculating gradients.
For the linear layer the line above is not included
self._outputs = self._outputs*self._relu
For the softmax layer the following lines are added
self._outputs = np.exp(self._outputs)
self._outputs = self._outputs/(np.sum(self._outputs, axis = 1)).reshape(len(self._outputs), 1)
The model class loops through all of the layers passing inputs to the first layer, calling the activate function, then first layer outputs are passed to the second layer etc.
Backward pass
This is where gradients are calculated, the optimizer object calls the get_gradients function in layer objects. It passes in the inputs used in the forward pass, which are stored in the previous layer object, and the layer error, passed back from the error function through the layers. This bit of code starts at line 88 of layers.py
def get_gradients(self, layer_inputs, layer_error):
''' Returns an array for weights, and biases, and one for the previous layer'''
if self._dropout:
layer_error = layer_error*self._dropout_mask
if self._comp_type == 'CPU':
layer_error = layer_error*self._relu
bias_gradients = np.sum(layer_error, axis = 0)
weight_gradients = layer_inputs.transpose()@(layer_error)
previous_layer_error = layer_error@self._weights.transpose()
if self._comp_type == 'GPU':
layer_error = layer_error*self._relu
bias_gradients = cp.sum(layer_error, axis = 0)
weight_gradients = cp.dot(layer_inputs.transpose(), layer_error)
previous_layer_error = cp.dot(layer_error, self._weights.transpose())
return weight_gradients, bias_gradients, previous_layer_error
The line that I have added here is
if self._dropout:
layer_error = layer_error*self._dropout_mask
The rest of this is calculating gradients for this layers parameters, and the error to be passed back to the previous layer. I also removed some lines for using dropconnect. This is for the relu layer, for a linear layer the line
layer_error = layer_error*self._relu
is not included. I have included more lines here in my code that you may think are needed. Keeping all GPU and CPU operations seperate, even when the code is written the same makes it easier to debug. Lines are free.
The get_gradients function is called by a function in the optimizer file, also called get_gradients. The optimizer get_gradients function loops through backwards passing previous_layer_error to the previous layer. Here is where clear variable names come in handy. Weight gradients and bias gradients are stored, and used to update weights and biases in different ways for different optimizers.
Sparse layers
Sparse layers look a bit different, but the output arrays are still numpy, or cupy dense arrays. Dropout masks are applied in the same way. Sparse array gradients are calculated differently for weight matrices, but dropout does not affect this. I will write another blog about trying to increase the efficiency of sparse arrays in the future. This will give you a better understanding of how they work.
What’s next
This is the only blog I have planned about writing code for my software library. I will include code in future blogs, and notebooks, but adding features like dropout is not all that interesting to read about and is not really the focus of my blogs. The code I will include in other blogs will be mostly limited to novel algorithms.