Google Brain’s new super fast and highly accurate AI: the Mixture of Experts Layer.

Théo Szymkowiak
3 min readMar 12, 2017

--

Conditional Training on unreasonable large networks.

One of the big problems in Artificial Intelligence is the gigantic amount of GPUs (or computers) needed to train large networks.

The training time of neural networks grows quadratically (think squared) in function of their size. This is due to how the network is trained. For each example, the entire network is modified, even though some parts might not even activate while processing this particular example.

However, the memory of a network is directly dependent on the size of the network. The larger the network, the more patterns it can learn and remember. Therefore, we have to build giant neural networks to process the ton of data that corporations like Google & Microsoft have.

Well, that was the case until Google released their paper Mixture of Experts Layer.

The Mixture of Experts Layer as shown in the original Paper.

The rough concept is to keep multiple experts inside the network. Each expert is itself a neural network. This does look similar to the PathNet paper, however, in this case, we only have one layer of modules.

You can think of experts as multiple humans specialized in different tasks.

In front of those experts stands the Gating Network that chooses which experts to consult for a given data (named x in the figure).

Training

The MoE (Mixture of Experts Layer) is trained using back-propagation. The Gating Network outputs an (artificially made) sparse vector that acts as a chooser of which experts to consult. More than one expert can be consulted at once (although the paper doesn’t give any precision on the optimal number of experts).

The Gating Network also decides on output weights for each expert.

The output of the MoE is then:

Results

It works surprisingly well.

Take for example machine translation from English to French:

The MoE with experts shows higher accuracy (or lower perplexity) than the state of the art using only 16% of the training time.

Conclusion

This technique lowers the training time while achieving better than state of the art accuracy. As PathNet, we can imagine that a future AI will have experts for every kind of data, possibly forming a General Artificial Intelligence.

Maybe in the next iteration of this technology, we will see direct communication between experts to form an Artificial Social Intelligence.

Paper is available here on ARXIV.

--

--