aPaperADay
9 Mixtape - Introductions

The Paper’s Outline

This is a very efficient paper because it’s dealing with a very concrete problem and has a clean solution. The outline could be given as such. First describe the problem, it is the softmax problem. Then, review their solution, which is a three part solution, and finally look at experimental results and conclude.

The softmax bottleneck: 30,000 ft view

The problem with softmax is given large output degrees (say 10k-30k vocab words) the softmax is unable to retain proper information across a low-rank matrix (the softmax vector). This was described in another paper, and they look at the prominent technique to addressing the bottleneck: Mixture of Softmaxes (MoS).

Three Novel additions

The paper presents three key ideas to add to efficiently breaking the bottleneck.

  1. Logit Space Vector Gating
  2. Sigmoid Tree Decomposition
  3. Gate Sharing

Conclusions

From the beginning, we know that they claim to have gained a performance boost over MoS between 1.6x and 11.5x while being comparable or better than MoS on 4 benchmarks. Both MoS and Mixtape show benchmark performance over softmax, at a spend and memory penalty.