Sorry for the long wait. I’m Ethan Block, and this is part 4 of an ongoing series about my artificial intelligence project. I’m joined by Clark Hubbard, who I’m sure loves artificial intelligence as much as I do.
As I mentioned in my second post about this artificial intelligence project, there are many different kinds of neural networks, and beyond that, many different approaches to artificial intelligence in general. Here, I hope to explain at least a few of the methods I am using to create Madison.
You know who else tried to create life? Dr. Frankenstein. That guy had a crappy life. You don’t want to do this.
Um, yes I do.
Back to neural networks: First, there’s the multilayer perceptron network, where the input data is fed through multiple layers and transformed by the activation function in the hidden layer. If it’s a deep neural network, there are multiple hidden layers, allowing for more complex problem processing and learning. The weights in the network, which multiply the data as it passes forward through nodes, are usually altered by a process called backpropagation. Other approaches exist, but I will save those for a later time.
Is this basically what we described in the second post?
Another type is the recurrent neural network, which acts quite like the previous one, but with a significant difference: it feeds into itself the output data of the previous timestep. By this, I mean that you feed some data into it, and it updates, propagating the data forward through the layers until it reaches the output layer. The next iteration, it uses both the input data you are feeding it and the data from some of its neurons from the last update to figure out the new output. This allows for neural networks that can make correlations between old and new data.
So instead of solely relying on new information, the recurrent neural network will re-use old information to help it figure things out, as well as the new info?
Precisely, Clark. But the recurrent net model is limited because, if the time between one update and another is too great (say, between update 1 and update 1000), it will be unable to make such correlations between each one’s output.
Long Short-Term Memory networks, introduced in 1997, do not have this problem. Their secret is that there is a memory cell contained within that stores previous data. You can think of this cell as a conveyor belt that carries the data from one update to the next. The information stored within is carefully controlled by several neural network layers called “gates”. First, the forget gate, which decides how much of the stored information should be kept for the next update. Then, the input gate, which determines what new information to store in the memory cell. Then, the block input gate, which, controlled by the last gate, actually stores the data in the cell. And lastly, the output gate, which decides what information to send from the memory cell to the output.
Is this the typical one used in the modern day? What are the pros and cons to this sort of approach?
Yes, LSTMs are widely used today in many commercial products. These networks can make correlations between the data no matter how large the timestep is, but they can be very costly in terms of computation thanks to their complex architecture.
That was complicated. The next network I will explain is less so. It is a self-organizing map, which uses several interesting techniques to map a set of values to another set of values. It is a (usually) two-dimensional grid of nodes, each of which contain a vector which contain many numbers referred to as “weights” (but don’t confuse them with the weights of a typical neural network). These weights are usually assigned at the beginning, representing a certain set of data. When you feed data into the self-organizing map, usually consisting of a two-dimensional grid of vectors, the map searches its own grid of nodes for the best matching unit, or the node with weights closest to one in the input grid. When it finds said unit, it adjusts the weights of all nodes in a radius around it to bring it a little bit closer to the input vector.
So what’s the use of this? Are these sort of like XY Axes from Algebra or am I way off?
Actually, Clark, you’re on the right track. Once the weights have been mapped closer to the input vector, you have successfully combined two different datasets. For instance, here is a color-coded dataset of poverty levels in the world (the countries with higher average income are on the top left, while the countries suffering from poverty the worst are towards the bottom right:
And after running this through a self-organizing map with an input consisting of a world map, this is what you get:
So, as you can see, self-organizing maps are great at sorting data into specific areas.
I think that’s enough for today. Next post, I’ll talk about more types of machine learning algorithms that are being used in my artificial intelligence project.
Be sure to follow Clark on Twitter at @Classic_Clark. Until next time!
Some more advanced explanations of neural network types by Christopher Olah
A good article on recurrent neural networks by Andrej Karpathy
An article on long short-term memory by Christopher Olah (his blog is fantastic)