aPaperADay
18 CapsNets - Getting the method

There is a lot to see from the introduction. They give a small homage to the human vision system, and discuss the hope of building a parse tree-like structure for the image. This has interesting use cases with understanding images, and in turn projecting that understanding into language, as well as for other things that may benefit from a tree parsed structure.

They provide an interesting metaphor about the parse tree being carved out like a sculpture is carved out of a rock. This means the model has to be expansive enough to fit the entire full tree and the parsing is build by stripping away unactivated capsules. Interesting if there were a way to generate the parse tree dynamically, but I believe one could make an argument based in human neuroscience for the lack of this structure. I cannot tell either way.

So this parse tree does human understandable things at the top level; namely assigning parts to wholes.

The intro does a nice job of explaining in more detail the properties that the capsules themself will take on, such as encoding various types of parameters such as “pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.” Here they point out their particular way of determining the presence of objects is based on the lengths of final output capsules rather than training separate logistic units whose sole property is an existence property.

The paper also mentions forcing the orientation to represent the properties of the entity. This I will have to understand more.

Finally, the dynamic routing mechanism is touted as a powerful system to ensure the capsules get sent to the appropriate parent. It does seem to be a bit of an attention mechanism. There is a top-down feedback loop with increases coupling if there is agreement (and certainly slows the whole process down).

I’m not yet convinced that the top down routing is necessary. Can you not just allow aligned vectors to explode in length?

One final word. They discuss rate-coding and place-coding, both of which are unfamiliar concepts to me. I’ll have to see if the paper discusses this more in detail.