Everything you knew about word2vec is not true

Original author: Siddharth Bhat
  • Transfer
The classic explanation of word2vec as a negative-sample Skip-gram architecture in the original scientific article and countless blog posts looks like this:

while(1) {
   1. vf = vector of focus word
   2. vc = vector of focus word
   3. train such that (vc . vf = 1)
   4. for(0 <= i <= negative samples):
           vneg = vector of word *not* in context
           train such that (vf . vneg = 0)

Indeed, if you google [word2vec skipgram], what we see:

But all these implementations are wrong .

The original implementation of word2vec in C works differently and is fundamentally different from this. Those who professionally implement systems with word embeddings from word2vec do one of the following:

  1. Directly call the original implementation of C.
  2. Use an implementation gensimthat is transliterated from the source C to the extent that the variable names match.

Indeed, the only true implementation I know in Cgensim is known to me .

C implementation

The C implementation actually supports two vectors for each word . One vector for the word is in focus, and the second for the word in context. (It seems familiar? True, the GloVe developers borrowed the idea from word2vec without mentioning this fact!)

The implementation in the C code is exceptionally competent:

  • An array syn0contains a vector embedding of a word if it comes across as a word in focus. Here's a random initialization .

      for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
        next_random = next_random * (unsigned long long)25214903917 + 11;
        syn0[a * layer1_size + b] = 
           (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
  • Another array syn1negcontains the vector of the word when it occurs as a context word. Here initialization is zero .
  • During training (Skip-gram, negative sample, although other cases are about the same), we first select the focus word. It is maintained throughout the training on positive and negative examples. The gradients of the focus vector are accumulated in the buffer and applied to the focus word after training on both positive and negative examples.

    if (negative > 0) for (d = 0; d < negative + 1; d++) {
      // if we are performing negative sampling, in the 1st iteration,
      // pick a word from the context and set the dot product target to 1
      if (d == 0) {
        target = word;
        label = 1;
      } else {
        // for all other iterations, pick a word randomly and set the dot
        //product target to 0
        next_random = next_random * (unsigned long long)25214903917 + 11;
        target = table[(next_random >> 16) % table_size];
        if (target == 0) target = next_random % (vocab_size - 1) + 1;
        if (target == word) continue;
        label = 0;
      l2 = target * layer1_size;
      f = 0;
      // find dot product of original vector with negative sample vector
      // store in f
      for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
      // set g = sigmoid(f) (roughly, the actual formula is slightly more complex)
      if (f > MAX_EXP) g = (label - 1) * alpha;
      else if (f < -MAX_EXP) g = (label - 0) * alpha;
      else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
      // 1. update the vector syn1neg,
      // 2. DO NOT UPDATE syn0
      // 3. STORE THE syn0 gradient in a temporary buffer neu1e
      for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
      for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
    // Finally, after all samples, update syn1 from neu1e
    // Learn weights input -> hidden
    for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];

Why random and zero initialization?

Once again, since this is not explained at all in the original articles and anywhere on the Internet , I can only speculate.

The hypothesis is that when negative samples come from the entire text and are not weighted by frequency, you can choose any word , and most often a word whose vector is not trained at all . If this vector has a meaning, then it will randomly shift the really important word in focus.

The bottom line is to set all negative examples to zero, so that only vectors that occur more or less often will affect the presentation of another vector .

This is actually pretty tricky, and I had never thought about how important initialization strategies are.

Why am i writing this

I spent two months of my life trying to reproduce word2vec as described in the original scientific publication and countless articles on the Internet, but failed. I could not achieve the same results as word2vec, although I tried my best.

I could not imagine that the authors of the publication literally fabricated an algorithm that does not work, while the implementation does something completely different.

In the end, I decided to study the source. For three days I was confident that I misunderstood the code, since literally everyone on the Internet talked about a different implementation.

I have no idea why the original publication and articles on the Internet do not say anything about the real mechanism of word2vec, so I decided to publish this information myself.

This also explains GloVe’s radical choice to set separate vectors for the negative context - they just did what word2vec does, but told people about it :).

Is this a scientific trick? I don’t know, a difficult question. But to be honest, I'm incredibly angry. Probably, I will never again be able to take seriously the explanation of algorithms in machine learning: next time I will immediately go look at the sources.

Also popular now: