reference
cs224n

lecture 3 More Word Vectors

Stochastic gradients with word vectors

But in each window, we only have at most $2m+1$ words, so $\nabla_\theta J_t(\theta)$ is very sparse!
We may as well only update the word vectors that actually appear!
Solution: either you need sparse matrix update operations to only update columns of full embedding matrices $U$ and $V$, or you need to keep around a hash for word vectors
If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!

Approximations

The normalization factor is too computationally expensive
$$ p(o|c) = \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)} $$
Implement the skip-gram model with negative sampling
Main idea: train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word)

The skip-gram model and negative sampling

From paper Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. 2013)
Overall objective function: $J(\theta) = \frac{1}{T} \sum_{t=1}^{T} J_t(\theta)$
$$J_t(\theta) = \log \sigma(u_o^T v_c) + \sum_{i=1}^{k} E_{j\sim P(\omega)} \left[ \log \sigma (-u_j^T v_c)\right] $$
Where $k$ is the number of negative samples and we use
$\sigma$ is sigmoid function
So we maximize the probability of two words co-occurring in first log

Negative sampling

word2vec is a huge neural network!
The author of Word2Vec addressed the issue in their second paper
There are three innovations in this second paper:

Treating common word pairs or phrases as a single “words” in their model
Sub-sampling frequent words to decrease the number of training examples
Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights

Note: Sub-sampling frequent words and applying Negative Sampling not only reduced the compute burden of the training process, but also improved the quality of their resulting word vectors as well.

Word Pair and “Phrases”

Example: a word pair like “Boston Globe” has a much different meaning than the individual words “Boston” and “Globe”. So it makes sense to treat “Boston Globe” as a single word

Method of phrase detection: it is covered in the “Learning Phrases” section of paper. And the code is available in word2phrase.c of their published code

Sub-sampling Frequent Words

As this example
example

There are two problems with common words like the:

When looking at word pairs, ( fox, the ) doesn’t tell use much about the meaning of fox. the appears in the context of pretty much every word.
We will have many more samples of ( the, $\dots$) than we need to learn a good vector for the.

Word2Vec implements a “sub-sampling” scheme to address this. For each word we encounter in training text, there is a chance that we will effectively delete it from the text. The probability that we cut the word is related to the word’s frequency

If we have a window size of 10, and we remove a specific instance of the from our text:

As we train on the remaining words, the will not appear in any of their context windows.
We’ll have 10 fewer training samples where the is the input word.

Sampling rate
For any word $w_i$, $z(w_i)$ is the fraction of the total words in the corpus that are that word.
$P(w_i)$ is the probability of keeping the word:
$$ P(w_i) = \left( \sqrt{\frac{z(w_i)}{0.001}}+1\right) \frac{0.001}{z(w_i)}$$

Negative sampling

Negative sampling addresses the problem (tremendous number of weight) by having each training sample only modify a small percentage of the weights, rather than all of them.

When training the network on the word pair (fox,quick), output neuron corresponding to quick should output 1 (positive), and for all of the other output neurons should output 0 (negative). With negative sampling, we are instead going to randomly select just a small number of “negative” words to update the weights for. We will also still update the weights for our “positive” word.

Recall that the output layer of our model has a weight matrix that’s $300 \times 10000$. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1800 weight value total. That’s only $0.06\%$ of the 3M weights in the output layer.

In the hidden layer, only the weights for the input word are updated.

Selecting Negative Samples
The probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.
$$ P(w_i) = \frac{f(w_i)^{3/4}}{\sum_{j=0}^n \left( f(w_j)^{3/4}\right)}$$

Summary of word2vec

Go through each word of the whole corpus
Predict surrounding words of each word
This captures co-occurrence of words one at a time

Evaluation word vectors

Related to general evaluation in NLP: Intrinsic vs extrinsic
Intrinsic:
- Evaluation on a specific/intermediate subtask
- Fast to compute
- Helps to understand that system
- Not clear if really helpful unless correlation to real task is established
Extrinsic:
- Evaluation on a real task
- Can take a long time to compute accuracy
- Unclear if the subsystem is the problem or its interaction or other subsystems
- If replacing exactly one subsystem with another improves accuracy