reference
cs224n

lecture 2 Word Vectors

Word meaning

Definition: meaning

the idea that is represented by a word, phrase, etc.
the idea that a person wants to express by using words, signs, etc.
the idea that is expressed in a word of writing
Commonest linguistic way of thinking of meaning
signifier $\iff$ signified (idea or thing) = denotation

One-hot vector(meaning in computer)

Common answer: Use a taxonomy like WordNet that has hypernyms relationships and synonym sets
Problems with this discrete representation

Great as a resource but missing nuances, e.g. synonyms
- adept, expert, good, practiced, proficient, skillful
Missing new words (impossible to keep up to date):
- wicked, badness, nifty, crack, ace, wizard, genius, ninja
Subjective
Requires human labor to create and adapt
Hard to compute accurate word similarity
The vast majority of rule-based and statistical NLP work regards words as atomic symbols

We use usually a localist representation (“one-hot”) to represent discrete word, but the different word vector $ a^T b = 0$, which means that our query and document vectors are orthogonal. There is no natural notion of similarity in a set of one-hot vectors

“one-hot” vector could deal with similarity separately;
instead we explore a direct approach where vectors encode it

Distributional similarity based representations

You can get a lot of value by representing a word by means of its neighbors

You shall know a word by the company it keeps
We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context

Basic idea of learning neural network word embeddings

Define a model that aims to predict between a center word $w_t$ and context words in terms of vectors
$$ p(context | w_t) = \dots $$
which has a loss function, e.g.
$$ J = 1 - p(w_{-t} | w_t ) $$
We look at many positions $t$ in a big language corpus
We keep adjusting the vector representations of words to minimize this loss

Directly learning low-dimensional word vectors

Learning representations by back-propagating errors (Rumelhart et al., 1986)
A neural probabilistic language model (Bengio et al., 2003)
NLP (almost) from Scratch (Collobert & Weston, 2008)
A recent, even simpler and faster model:
word2vec (Mikolov et al. 2013) $\rightarrow$ intro now

Main idea of word2vec

Predict between every word and its context words
Two algorithms

Skip-grams(SG)
Predict context words given target (position independent)
Continuous Bag of Words(CBOW)
Predict target from bag-of-words context

Two (moderately efficient) training methods

Hierarchical softmax
Negative sampling
Naive softmax

The skip-gram model

reference: Skip-gram tutorial
Word2vec uses a trick that we train a simple neural network with a single hidden layer to perform a certain task(Fake Task), but then we’re not actually going to use that neural network for the task we trained it on!
Instead, the goal is actually just to learn the weights of the hidden layer (Similar to auto-encoder)

Fake Task
Task goal : Given a specific word in the middle of a sentence, look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.

When I say “nearby”, there is actually a “window size” parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead

A sample, window size = 5
sample
another explanation

Model detail

Input: one-hot vector(dimension means the scale of vocabulary)
Hidden layer: the word vector for picked word
Output layer: softmax layer, probability that a randomly selected nearby word is that vocabulary word

For example, we’re going to say that we’re learning word vectors with 300 features. So the hidden layer is going to be represented by a weight matrix with 10000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron)

word vector
So the end goal of all of this is really just to learn this hidden layer weight matrix.
one-hot vector $\times$ hidden layer weight matrix $\iff$ lookup table

objective function
For each word $t=1,\dots,T$, predict surrounding words in a window of “radius” $m$ of every word.

Maximize the probability of any context word given the current center word
$$ J’(\theta) = \prod_{t=1}^{\pi} \prod_{-m \le j \le m, j \neq 0 } p \left(w_{t+j} | w_t; \theta \right) $$
Negative Log likelihood
$$ J(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \le j \le m, j \neq 0} \log p \left( w_{t+j} | w_{t} \right) $$
Where $\theta$ represents all variable we will optimize