
Everything you knew about word2vec is not true
- Transfer
The classic explanation of word2vec as a negative-sample Skip-gram architecture in the original scientific article and countless blog posts looks like this:
Indeed, if you google [word2vec skipgram], what we see:
But all these implementations are wrong .
The original implementation of word2vec in C works differently and is fundamentally different from this. Those who professionally implement systems with word embeddings from word2vec do one of the following:
Indeed, the only true implementation I know in C
The C implementation actually supports two vectors for each word . One vector for the word is in focus, and the second for the word in context. (It seems familiar? True, the GloVe developers borrowed the idea from word2vec without mentioning this fact!)
The implementation in the C code is exceptionally competent:
Once again, since this is not explained at all in the original articles and anywhere on the Internet , I can only speculate.
The hypothesis is that when negative samples come from the entire text and are not weighted by frequency, you can choose any word , and most often a word whose vector is not trained at all . If this vector has a meaning, then it will randomly shift the really important word in focus.
The bottom line is to set all negative examples to zero, so that only vectors that occur more or less often will affect the presentation of another vector .
This is actually pretty tricky, and I had never thought about how important initialization strategies are.
I spent two months of my life trying to reproduce word2vec as described in the original scientific publication and countless articles on the Internet, but failed. I could not achieve the same results as word2vec, although I tried my best.
I could not imagine that the authors of the publication literally fabricated an algorithm that does not work, while the implementation does something completely different.
In the end, I decided to study the source. For three days I was confident that I misunderstood the code, since literally everyone on the Internet talked about a different implementation.
I have no idea why the original publication and articles on the Internet do not say anything about the real mechanism of word2vec, so I decided to publish this information myself.
This also explains GloVe’s radical choice to set separate vectors for the negative context - they just did what word2vec does, but told people about it :).
Is this a scientific trick? I don’t know, a difficult question. But to be honest, I'm incredibly angry. Probably, I will never again be able to take seriously the explanation of algorithms in machine learning: next time I will immediately go look at the sources.
while(1) {
1. vf = vector of focus word
2. vc = vector of focus word
3. train such that (vc . vf = 1)
4. for(0 <= i <= negative samples):
vneg = vector of word *not* in context
train such that (vf . vneg = 0)
}
Indeed, if you google [word2vec skipgram], what we see:
- Wikipedia page that describes the algorithm at a high level
- Tensorflow page with the same explanation
- Towards Data Science blog with a description of the same algorithm , and the list goes on.
But all these implementations are wrong .
The original implementation of word2vec in C works differently and is fundamentally different from this. Those who professionally implement systems with word embeddings from word2vec do one of the following:
- Directly call the original implementation of C.
- Use an implementation
gensim
that is transliterated from the source C to the extent that the variable names match.
Indeed, the only true implementation I know in C
gensim
is known to me .C implementation
The C implementation actually supports two vectors for each word . One vector for the word is in focus, and the second for the word in context. (It seems familiar? True, the GloVe developers borrowed the idea from word2vec without mentioning this fact!)
The implementation in the C code is exceptionally competent:
- An array
syn0
contains a vector embedding of a word if it comes across as a word in focus. Here's a random initialization .https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369 for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { next_random = next_random * (unsigned long long)25214903917 + 11; syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; }
- Another array
syn1neg
contains the vector of the word when it occurs as a context word. Here initialization is zero . - During training (Skip-gram, negative sample, although other cases are about the same), we first select the focus word. It is maintained throughout the training on positive and negative examples. The gradients of the focus vector are accumulated in the buffer and applied to the focus word after training on both positive and negative examples.
if (negative > 0) for (d = 0; d < negative + 1; d++) { // if we are performing negative sampling, in the 1st iteration, // pick a word from the context and set the dot product target to 1 if (d == 0) { target = word; label = 1; } else { // for all other iterations, pick a word randomly and set the dot //product target to 0 next_random = next_random * (unsigned long long)25214903917 + 11; target = table[(next_random >> 16) % table_size]; if (target == 0) target = next_random % (vocab_size - 1) + 1; if (target == word) continue; label = 0; } l2 = target * layer1_size; f = 0; // find dot product of original vector with negative sample vector // store in f for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; // set g = sigmoid(f) (roughly, the actual formula is slightly more complex) if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; // 1. update the vector syn1neg, // 2. DO NOT UPDATE syn0 // 3. STORE THE syn0 gradient in a temporary buffer neu1e for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; } // Finally, after all samples, update syn1 from neu1e https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541 // Learn weights input -> hidden for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
Why random and zero initialization?
Once again, since this is not explained at all in the original articles and anywhere on the Internet , I can only speculate.
The hypothesis is that when negative samples come from the entire text and are not weighted by frequency, you can choose any word , and most often a word whose vector is not trained at all . If this vector has a meaning, then it will randomly shift the really important word in focus.
The bottom line is to set all negative examples to zero, so that only vectors that occur more or less often will affect the presentation of another vector .
This is actually pretty tricky, and I had never thought about how important initialization strategies are.
Why am i writing this
I spent two months of my life trying to reproduce word2vec as described in the original scientific publication and countless articles on the Internet, but failed. I could not achieve the same results as word2vec, although I tried my best.
I could not imagine that the authors of the publication literally fabricated an algorithm that does not work, while the implementation does something completely different.
In the end, I decided to study the source. For three days I was confident that I misunderstood the code, since literally everyone on the Internet talked about a different implementation.
I have no idea why the original publication and articles on the Internet do not say anything about the real mechanism of word2vec, so I decided to publish this information myself.
This also explains GloVe’s radical choice to set separate vectors for the negative context - they just did what word2vec does, but told people about it :).
Is this a scientific trick? I don’t know, a difficult question. But to be honest, I'm incredibly angry. Probably, I will never again be able to take seriously the explanation of algorithms in machine learning: next time I will immediately go look at the sources.