Word2Vec

Word2vec is not a singular algorithm, rather, it is a family of model architectures and optimizations. The optimization methods are good to know as they are the foundation for contrastive learning. The two main algorithms are CBOW(Continuous Bag of Words) and Skip-gram.

CBOW

It is a model that tries to predict words given the a context window. This is distinct from language modeling, since CBOW is not sequential. Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings.

Lets say we have a sentence “The quick brown fox jumps over the lazy dog”. We can create a context window of size 2 both the side. So the context window for the word “fox” will be “quick”, “brown”, “jumps”, “over”. The CBOW model will try to predict the word “fox” given the context window.

We have a binary vector of the size of the vocubalary of the corpus. The input vector will have 1 for the context words i.e “quick”, “brown”, “jumps”, “over” and 0 for the rest of the words. The output vector is one-hot for the word “fox” and 0 for the rest of the words. The input vector is multiplied with the embedding context matrix to get the embedding of the context words. Which basically sums the embedding of the words from the context matrix. The summed embedding is multiplied with the output word matrix to get the output over which softmax is applied to get probabilities. The loss is calculated using the cross entropy loss.

Skip-gram

Skip-gram is a model that tries to predict the context window given a word. This is distinct from language modeling.

Lets say we have a sentence “The quick brown fox jumps over the lazy dog”. We can create a context window of size 2 both the side. So the context window for the word “fox” will be “quick”, “brown”, “jumps”, “over”. The Skip-gram model will try to predict the context window given the word “fox”. We apply cross entropy loss but now insted of predicting one word we are predicting 4 words. So we have to sum the bce loss for all the 4 words.This is same as cross entropy loss in multi-label setting. The point to note here is that we are still applying softmax on the output of the model.

Optimizations

The problem with both the approches is that the softmax is an expensive operataion because a single softmax uses the entire vocabulary. There are several optimizations to make the traning efficient.

  1. Negative Sampling: The idea of negative sampling is to use the pair which do not appear togather with each other and use them as traning samples to learnig embedding. The modeling also changes from softmax to sigmoid. We have pair of word which are +ve pair meaning they apper in the context window with each other and we have negative pair which do not and the negative pair far out number the positive pair. To incorporate this for a single +ve sample k -ve samples are added in the traning set. The thing to remember there is that the inputs to the sigmoid is the embedding of the words which will be updated. For the positive pairs it should predict 1 and for the negative pairs it should predict 0. The loss is calculated using the bce loss.

  2. Contrastive Estimation: Contrastive Estimate take the idea of negative sampling one step futher and uses the negative samples or noise distribution. The noise distribution is the distribution of the words in the corpus. The idea is that the model should be able to distinguish between the positive pair and the noise/negative pairs but insted to doing is the negative sampling way we form triplet of the form (center word, positive word, negative word) and the model should be able to distinguish between the positive word and the negative word with some margin. The loss is calculated using the hinge loss.

\[L = max(0,s_p -(s_n + m))\]

Where $s_p$ is the similarity between the center word and the positive word and $s_n$ is the similarity between the center word and the negative word and $m$ is the margin. Reffer to loss function for more details.

  1. Hierarchical softmax: Hierarchical softmax is a way to approximate the softmax. The idea is to use a binary tree to represent the vocabulary. The leaf nodes of the tree are the words in the vocabulary. The internal nodes are the probability distribution over the words. The probability of a word is the product of the probabilities of the internal nodes on the path from the root to the leaf node. The probability of the internal node is the sigmoid of the dot product of the embedding of the word and the embedding of the internal node. The probability of the word is the product of the probabilities of the internal nodes on the path from the root to the leaf node. The loss is calculated using the cross entropy loss. The advantage of this approach is that the softmax is approximated by the product of the sigmoid which is much faster to compute. But we also add new learnable parameters for the binary tree.

results matching ""

    No results matching ""