Age of Extreme AI | Where did we go Wrong?
The year 1950. The world is healing from the deep wounds of World War 2, and the tale of Enigma had proven to all that investment in Intelligent Machines was as necessary as investment in arms and ammunition.
There again was Alan Turing, with a program to play Chess, in 1950. The first ever display of intelligence by a Machine.
1957, Frank Rosenblatt comes with his Perceptron, an electronic device that mimics the biological nervous system. It can discriminate a “Circle” from a “Square”, “Man” from “Woman”. This is the first ever Neural Network.
This is a time Video Games do not yet have Mario, no one has even dreamt of Star Wars.
By the time Mario is released, AI has backpropagation algorithm.
The next big leap comes in 2009, with ImageNet which brings 1.2M Human Labelled Images.
By 2015, Neural Networks are performing better than humans at image recognition.
“The Age of Extreme AI had arrived.”
But this comes at a cost. Neural networks now are huge and deep. So deep, they require millions of calculations to train one Neural Network.
Add to it, we are trapped by a Vonn Neumann Bottleneck. Vonn Neumann Bottleneck states that our computation speeds are often limited by the speed of Data retrieval from memory.
Neural Networks have a huge number of Activation Values and Weights, which are stored in storage. The process of getting them from storage to memory is not only slow but an energy intensive process. More energy is spent in retrieving from memory than in actually computing.
This leads to two problems: Slowing down systems, Wasting too much energy.
To train 1 modern Neural Network takes the energy consumption equivalent to that of 3 households in 1 year.
If that was not all, our ability to push more and more transistors into a chip is coming to a dead end. Physically our transistors have almost reached the level of atoms and breaking further seems to be fundamentally impossible.
We are bugged by Slow Memory Retrieval, Energy Wastage, CPU power.
All this has led to thinking, can Digital Devices really suit the Neural Networks that themselves are like Analog Systems?
For example, Rosenblatt’s Perceptron was an Analog device that was used to create a neural network.
Looking closely, a Neural Network is a huge collection of Multiply and Accumulate (MAC) systems which can more properly be modeled via Analog devices.
The Weights are like the Resistance and the Activation values are the Current or Voltage.
Suppose a node fires an activation value V, which comes as the voltage in the circuit, it meets with resistance R.
The Current (I) is calculated as: I = V/R = V.G (G = 1/R = Conductance).
Connect wires from different nodes at one point to combine the Currents. The final value of Current gives the value of input to the next node.
While the concept of Analog Systems for AI may seem to be very convincing for now, there are a lot of challenges. The world of Analog is bugged by manufacturing errors that add atleast 1% error to the calculations.
However, in the world of Classifier Neural Networks, a 1% error rate is not going to be a deal breaker, and perhaps we can proceed with Analog Devices.
The difficulty is, we cannot assemble a Perceptron circuit now. It will be too complicated with the modern depth of Neural Networks. Also, we cannot keep varying the weights (Resistances) manually. It needs to happen automatically and Fast to cater to real time requirements.
Do we have the work around yet? Yes, we do. Phase Change Memory (PCM) Devices.
PCM devices are normally used as Flash memory. They contain two Gates (Control Gate, Floating Gate).
When huge positive charge is given to the Control Gate, the electrons flow into the Floating Gate thus creating high resistance and blocking electron flow. This is called 0 State. When Floating Gate has no electrons, electrons can flow easily in the circuit, called 1 State. This is their Digital Computer Application.
To apply them to Analog Computers, we control the number of electrons in the Floating Gate.
When electrons is floating gate is programmed and controlled, the resistance offered is also controlled.
This acts as a great variable resistor.
Now that the PCM modules are ready, they can be assembled as shown in the figure.
The Activation Value from each node will be sent as a Voltage.
The Current (I) will be calculated as: I = V/R = V.G
Multiple Currents will combine to get the Input value for the next node.
The above eliminates the Fetch from Memory step and thus the Von Neumann Bottleneck.
There are huge savings on energy as well, which can be determined from the below quote by IBM:
“Moving 64 bits of data from DRAM to CPU consumes 1–2nJ, which is 10,000–2,000,000 times more energy than is dissipated in a PCM device performing a multiplication operation (1–100fJ)”