19 Skip-Gram - Starting on the abstract

STEP1 Read the title and make an opinion of what’s in the paper (e.g., the area, the task)

‘Distributed Representations of Words and Phrases and their Compositionality’. This is actually a mouthful. Each word makes sense to me individually, but taken together is a whole other animal. The biggest two terms are (1) Distributed Representations and (2) Compositionality. In truth, Distributed Representations took me the longest to understand. To quote Hinton at length

“Distributed representation” means a many-to- many relationship between two types of representation (such as concepts and neurons).
– Each concept is represented by many neurons
– Each neuron participates in the representation of many concepts

This is key because it informs a wider understanding of Machine Learning. It is a blessing and a curse that Machine Learning Algorithms optimize the space such that a single neuron does not hold a single idea. Models would be much easier to interpret if each single neuron informed on only one concept, yet it is not hard to see how inefficient this could become.
So Distributed Representations makes space for word vectors holding many-dimensional knobs that can mean many-dimensional things.

Make an opinion: Rereading the title points to vectored representations of both words and phrases, than can be composed together. The assumption is this paper thickly covers a way to represent text in ways that allows for recombination.

STEP2 Read the abstract well and form a hypothesis of

1. What's new in the paper?
2. Do you have a clear *overview* about what the paper is all about?

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.

I’ve been much more disciplined to closely read the abstract as it is the portal to paper acceptance and does summarize well what to expect from the paper.

The Skip-gram model is trained on a prediction of potential contextual neighbors, and cares primarily for the embedding that is generated through this process. It is an excellent first order approximation to the complex modeling the BERT family of models perform.

In this paper we present several extensions that improve both the quality of the vectors and the training speed.

There will be two things to look for, quality improvements, and training speed acceleration.

By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling.

Two key ideas are pulled out (emphasis mine).

An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases…

The back half of the abstract is motivating the need for word order. So finally I can expect the paper to make good on that promise.

Now that we know what to look for, we can move on to the next steps.