aPaperADay
11 Inception - Architecture and Considerations

Motivations

Architectures has been shown generally to improve by adding width and depth to the previous best model. There is a limitation to this as all compute budgets are finite, and so Inception is looking for ways to increase model capability without only naively increasing depth or width.

There are two problems with this naive approach. (1) You are more prone to overfitting and (2) and have dramatic increase in compute. Most of that compute is wasted on matmul of weights near or at 0. According to natural biology and other work regarding sparse matrices, they appear to be the most efficient solution for complex networks; however, currently there are problems with sparse matrices today.

The Problem of Sparse Matrices Today’s compute is extremely efficient at dense matmul, and comparatively inefficient at sparse matrix operations. For instance, reducing the complexity of operations by 100x, the overhead lookups and cache misses in todays hardware make matrix sparsity much less efficient to effectively use. This is interesting to me as it appears like sparsity could be a big next step for efficiency if ML can generate the need while hardware can supply the compute. Graphics are a great example, as generally the supply follows the demand, not proceeds it. The graphics industry has enabled a massive amount of innovation on the silicon side, which has made the modern advances possible. Still, the market will continue to bootstrap itself higher and higher and these problems will be solved in part simultaneously. One point I want to draw out is how the hardware affects the software. Architectures trended back to full connections to better optimize parallel computing. There is no ‘right’ answer in this case; it is simply important to realize the fundamental benefits of sparsity, and not to mistake the current trend in dense matrices as fundamentally necessary.

The Inception Network

Drawing on the aspiration towards sparsity, the Inception net tries to achieve similar properties to sparsity in its network design. By achieving sparsity, the idea is to gain the advantages of compute efficiency without paying any performance prices. This architecture started out as a case study based on the idea of sparsity and has shown good preliminary results, but at the time of the writing of the paper, the authors point out that the performance is shown mainly on vision tasks. They note a word of caution as the network had not yet been tested on other unrelated tasks. If it is able to perform well on other tasks, then the basis for the architecture being a general purpose approach would be greater.