Rook2Vet4Word2Vec

Table of Contents

The goal of this blog is to help readers understand what word2vec means so that rookie like you could become a veteran on the concept of embeddings training process. I’ve tried to find resources to understand how word2vec actually works. What surprises me the most is that although it looks really simple in terms of structure, most materials I found online could not explain clearly. At least the default search results I found online, they could not be able to provide what would be the most essential chain of thoughts. After going through some blogs and even YouTube videos, those materials to me feels they lack certain detail especially for the most crucial steps. That is why I decide to compile a word2vec explanation by myself.

Efficient Estimation of Word Representations in Vector Space¹

Symbols

Based on the title it is about estimation of representations. skip-gram and CBOW, word2vec,

Embeddings: vector representation of words.

words with similar contexts have similar vectors.

continuous bag of words

Consider a sentence starting from \(\texttt{W}_{\texttt{1}}\) to \(\texttt{W}_{\texttt{N}}\), the dictionary has \(\mathrm{D}\) words, each word has a vector length of \(\mathrm{v}\).

Bag of Words: from \(\texttt{W}_{i-k}\) and \(\texttt{W}_{i+k}\) to predict target word \(\texttt{W}_{\texttt{i}}\), with the sliding window size of \(2k\). \(\star\)It should be noted that the order of thse context words do not affect the prediction process. Embedding Matrix \(W=V\times D\), Context Matrix \(W^{\star}=D\times V\). Predict the target word from context based on the window size. Small copus, faster to train.

\(ffff\)

\(\)

skip-gram: predict context words from the target word. Large corpus, higher dimensions, slower.

word2vec原理及gensim中word2vec的使用

word2vec原理(二) 基于Hierarchical Softmax的模型

Word2vec源码分析

word2vec原理推导与代码分析

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method²

word2vec parameter learning explained³

CBOW Neural Network Illustration Skip-Gram Neural Network Illustration

wevi: word embedding visual inspector

Sigmoid Function

Neural networks and learning machines⁴, Chapter \(\S4\) Multilayer Perceptons contains several useful illustrations on the neural network propagation and backpropagation:

Architectural graph of a multilayer perceptron with two hidden layers
Signal-flow graph highlighting the details of output neuron \(j\)
Signal-flow graph highlighting the details of output neuron \(k\) connected to hidden neuron \(k\)
Signal-flow graphical summary of back-propagation learning. Top part of the graph: forward pass. Bottom part of the graph: backward pass.
https://jalammar.github.io/illustrated-word2vec/
https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

https://arxiv.org/pdf/1704.04760

Symbols

The following are the symbols for this blog, I tried to find a set of symbols that are intuitive yet follow the mainsteam definition.

\begin{matrix} \underbrace{C()}_{\substack{\text{Context Func}}} & \underbrace{c}_{\substack{\text{Context Word}}}&\underbrace{w}_{\substack{\text{Target Word}}}&\underbrace{D}_{\substack{\text{Dictionary}}} &\underbrace{V}_{\substack{\text{Dict Size}}} & \underbrace{N}_{\substack{\text{Embedding Size}}}\\ \underbrace{T^{\star}}_{\substack{(w,C(w))}\in T} & \underbrace{T'}_{\substack{(w,C(w))}\notin T} & \underbrace{T}_{\substack{\text{Text/Corpus}}}& \underbrace{w^{\dagger}}_{\substack{\text{Input Layer}\\\text{Word Idx\ }i}}&\underbrace{w^{\ast}}_{\substack{\text{Output Layer}\\\text{Word Idx\ } i}}\\ \underbrace{\mathbf{h}}_{\substack{\text{Hidden Layer}}} & \\ \underbrace{\mathbf{W}}_{\substack{\text{Embedding Mat}}}& \underbrace{\overrightarrow{W_{c}}}_{\substack{\text{Embedding Vec}\\\text{Context}}}&\underbrace{\overrightarrow{W_{w}}}_{\substack{\text{Embedding Vec}\\\text{Target}}} &\underbrace{\mathbf{W^{*}}}_{\substack{\text{Output Mat}}}& \underbrace{\overrightarrow{W^{*}_{c}}}_{\substack{\text{Output Vec}\\\text{Context}}}&\underbrace{\overrightarrow{W^{*}_{w}}}_{\substack{\text{Output Vec}\\\text{Target}}} \end{matrix}

The following table contains the summary for the repo.

\(\begin{matrix} \mathrm{Symbols} & W_{V\times N} & V_{N} & g\cdot W^{\star}& W^{\star}_{N\times V} & \overline{W^{\star}}_{N\times V} & g & \eta\\ \mathrm{Code} & \texttt{syn0} & \texttt{neu1}& \texttt{neu1e}& \texttt{syn1} & \texttt{syn1neg} & \texttt{g} & \texttt{alpha}\\ \mathrm{Row} & \texttt{vocab_size} &\texttt{1} & \texttt{1}& \texttt{layer1_size} & \texttt{layer1_size} & \\ \mathrm{Col} & \texttt{layer1_size} & \texttt{layer1_size} & \texttt{layer1_size} & \texttt{vocab_size} & \texttt{vocab_size} \\ \end{matrix}\)

Vinilla

In this section, we will discuss the vinilla word2vec design in theory, without any extra practical techniques. Two matrices \(\mathbf{W}_{V\times N}\) and \(\mathbf{W}^{*}_{N\times V}\)are connected by a hidden layer \(\mathbf{h}\). One intuitive way of thinking about those 2 matrices \(\mathbf{W}_{V\times N}\) and \(\mathbf{W}^{*}_{N\times V}\) are that each coresponds to a row vector when a word \(w\) is could be represented as target word or a context word with vectors. It is not able to use one row vector to represent a word \(w\) because the probability of a word as a context word before the same word is really rare.

Consider the simplest case with one input word with index \(i\) and probability of a specific output with index of \(j\), if input word \(D_{i}\) which means \(i\)th word in dictionary has relationship with \(D_{j}\), in other words \( (D_{i}, D_{j})\in T\). So in corpus \(T\), there are more than \(0\) occurances where \(D_{i}\) and \(D_{j}\) are next to each other if we consider a small context window.

During training, we want to update the vectors in both \(\mathbf{W}_{V\times N}\) and \(\mathbf{W}^{*}_{N\times V}\) such that \( (D_{i}, D_{j})\in T\) should be closer in terms of inner product⁵.

For propagatoin, \(\overrightarrow{W_{i}}=\mathbf{h}\), meaning the word \(D_{i}\) with index \(i\) is activated at the input layer; on the output layer another word \(D_{j}\) with index \(j\) is also chosen for update based on the dataset or negative sampling.

D_{i^{\ast}}\xrightleftarrows[]{\mathbf{W}_{V\times N}}\underbrace{\mathbf{h}}_{\substack{\textcolor{red}{\overrightarrow{W_{i^{\ast}}}^{\intercal}}}}\xrightleftarrows[]{\mathbf{W}^{\ast}_{N\times V}} \mathbf{u}=\mathbf{h}^{\intercal}\odot \textcolor{red}{\overrightarrow{W^{\ast}_{j}}}\xrightleftarrows[]{\large\sigma()}y_{j^{\ast}}\Longleftarrow \textcolor{red}{e_{j^{\ast}}}= y_{j^{\ast}}-z_{j^{\ast}}

\underbrace{D_{i^{\ast}}}_{\substack{i^{\ast}\text{th}\\\text{word}}}\Longrightarrow \underbrace{x_{i^{\ast}}}_{\substack{\text{0 or 1}}} \xrightleftarrows[\textcolor{blue}{w_{i,k}=\frac{\partial E}{\partial \mathbf{h}_{k}}\cdot x_{i}}]{\mathbf{W}_{V\times N}}\underbrace{\mathbf{h}}_{\substack{\textcolor{red}{\overrightarrow{W_{i^{\ast}}}^{\intercal}}}}\xrightleftarrows[]{\mathbf{W}^{\ast}_{N\times V}} \mathbf{u}=\mathbf{h}^{\intercal}\odot \textcolor{red}{\overrightarrow{W^{\ast}_{j}}}\xrightleftarrows[]{\large\sigma()}y_{j^{\ast}}\Longleftarrow \textcolor{red}{e_{j^{\ast}}}= y_{j^{\ast}}-z_{j^{\ast}}

\begin{alignat*}{3} &&&\qquad\qquad\qquad\qquad\quad\overbrace{\substack{T(pt)}}^{\text{Corpus at } pt}\\ &&&\qquad\qquad\qquad\qquad\qquad\;\Downarrow\\ &\underbrace{D_{i^{\ast}}}_{\substack{i^{\ast}\text{th}\\\text{word}}}&&\xleftarrow[\text{Skip-Gram: Target}]{\text{\quad CBOW: Contexts\quad}}\;\underbrace{\left|D\right|=V}_{\substack{\text{Vocabulary}}}\xrightarrow[\text{\quad Skip-Gram: Contexts \quad}]{\text{CBOW: Target}}\underbrace{D_{j^{\ast}}}_{\substack{j^{\ast}\text{th}\\\text{word}}}\\ &\;\Downarrow&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\;\;\Downarrow \\ &\underbrace{x_{i^{\ast}}}_{\substack{\text{0 or 1}}}&&\xrightleftarrows[\textcolor{blue}{\quad w_{i,k}\quad}]{\mathbf{W}_{V\times N}} \underbrace{\mathbf{h}}_{\substack{\textcolor{red}{\overrightarrow{W_{i^{\ast}}}^{\intercal}}}} \xrightleftarrows[]{\mathbf{W}^{\ast}_{N\times V}} \mathbf{u}\rightsquigarrow\textcolor{red}{\overrightarrow{W^{\ast}_{j}}^{\intercal}}\cdot \mathbf{h}\xrightleftarrows[]{\displaystyle\sigma()}y_{j^{\ast}}\xleftarrow[]{\large\textcolor{red}{\;\;e_{j}}\;\;} z_{j^{\ast}} \\ &&&\;\;\;\;\; \quad \qquad\qquad\qquad\qquad\quad\Uparrow\rule{3.0cm}{1pt}\rule[-0.05ex]{1.6pt}{10pt}\\ &&&\qquad\qquad\qquad\qquad\qquad\quad\textcolor{magenta}{\frac{\partial E}{\partial \mathbf{u_{j^{\ast}}}}=-\log p(D_{j^{\ast}}|D_{i^{\ast}})}\\ &&&\qquad\qquad\qquad\quad\;\;\;\Uparrow\rule{0.96cm}{1pt}\rule[-0.05ex]{1.6pt}{10pt}\\ &&&\qquad\qquad\qquad\quad\;\textcolor{blue}{\frac{\partial E}{\partial w^{\ast}_{k,j}}=\frac{\partial E}{\partial u_{j}}\cdot \frac{\partial u_{j}}{\partial w^{\ast}_{k,j}}=e_{j}\cdot \mathbf{h}_{k}}\\ &&&\qquad\qquad\;\;\;\Uparrow\rule{2.0cm}{1pt}\rule[-0.05ex]{1.6pt}{10pt}\\ &&&\qquad\qquad\;\textcolor{blue}{\frac{\partial E}{\partial \mathbf{h_{k}}}=\sum_{j=1}^{V}\frac{\partial E}{\partial \mathbf{u}_{j}}\cdot \frac{\partial \mathbf{u}_{j}}{\partial \mathbf{h}_{k}}=\sum_{j=1}^{V}e_{j}\cdot w^{\ast}_{k,j}}\\ &&&\quad\;\;\;\Uparrow\rule{0.78cm}{1pt}\rule[-0.05ex]{1.6pt}{10pt}\\ &&&\;\;\;\;\textcolor{blue}{\frac{\partial E}{\partial w_{i,k}}=\frac{\partial E}{\partial \mathbf{h}_{k}}\cdot \frac{\partial \mathbf{h}_{k}}{\partial w_{i,k}}=\frac{\partial E}{\partial \mathbf{h}_{k}}\cdot x_{i}}\\ \end{alignat*}

The following shows the loss function to be optimized by taking the partial derivative.

\begin{align*} E &= -\log\underbrace{p(w_{j}|w_{i})}_{\substack{\text{to maximize}}}=-\log\frac{e^{u_{j}}}{\sum_{k=1}^{V} e^{u_k}}=-u_{j}+\log\sum_{k=1}^{V} e^{u_k} \\ \frac{\partial E}{\partial u_{j}} &= \frac{\partial}{\partial u_{j}} \left( - \sum_{k=1}^{V} z_k u_k + \log\left(\sum_{i=1}^{V} e^{u_i}\right) \right) = \frac{\partial}{\partial u_j} \left( - \sum_{k=1}^{V} z_k u_k \right) + \frac{\partial}{\partial u_j} \left(\log\left(\sum_{i=1}^{V} e^{u_i}\right)\right) \\ &= -z_j + \frac{1}{\sum_{i=1}^{V} e^{u_i}} \cdot \frac{\partial}{\partial u_j} \left(\sum_{i=1}^{V} e^{u_i}\right) = -z_j + \frac{e^{u_j}}{\sum_{i=1}^{V} e^{u_i}} = \color {red}{y_j - z_j}=e_{j} \end{align*}

For backpropagation, first need to update values for \(\overrightarrow{W^{\ast}_{j}}\) with size of \(N\times 1\), consider each value to be \(w^{\ast}_{k,j}\) with \(1\leq k \leq N\).

\(\displaystyle \frac{\partial E}{\partial w^{\ast}_{k,j}}=\frac{\partial E}{\partial u_{j}}\cdot \frac{\partial u_{j}}{\partial w^{\ast}_{k,j}}=e_{j}\cdot \mathbf{h}_{k}\)

To update the values for \(w_{i,k}\) or \(\mathbf{h}_{k}\) for each value in row \(i\) with \(1\leq k \leq N\):

\frac{\partial E}{\partial \mathbf{h}_{k}}=\sum_{j=1}^{V}\frac{\partial E}{\partial \mathbf{u}_{j}}\cdot \frac{\partial \mathbf{u}_{j}}{\partial \mathbf{h}_{k}}=\sum_{j=1}^{V}e_{j}\cdot w^{\ast}_{k,j}

When taking the derivative further to \(\mathbf{W}\), specifically \(w_{i,k}\)

\frac{\partial E}{\partial w_{i,k}}=\frac{\partial E}{\partial \mathbf{h}_{k}}\cdot \frac{\partial \mathbf{h}_{k}}{\partial w_{i,k}}=\frac{\partial E}{\partial \mathbf{h}_{k}}\cdot x_{i}

CBOW

Skip-Gram

Extra

Negative Sampling

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method⁶

Positive words would be the ground truth provided by the text corpus. For negative samples, which are not in the text sentences. \(P_{n}(w)=freq(\texttt{vocab[i]})^{\texttt{power}}\), \(\displaystyle \arg \underset{\theta}{\max}\prod_{(w,c) \in D} p(D = 1|w, c; \theta)=\arg \underset{\theta}{\max}\prod_{(w,c) \in D}\frac{1}{1+e^{-v_{c}\cdot v_{w}}}\)

\(\displaystyle \mathcal{L} = \sigma(v_w \cdot v_c) \times \prod_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_n(w)} \left[ \sigma(-v_{w_i} \cdot v_{c}) \right]\Rightarrow \log \sigma(v_{w} \cdot v_{c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-v_{w_i} \cdot v_c) \right]\)

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus⁷.

void InitUnigramTable() {  //frequency based probability for words
  int a, i;  //
  long long train_words_pow = 0;  //sum of power for all words
  real d1, power = 0.75;  
  table = (int *)malloc(table_size * sizeof(int));  
  for (a = 0; a < vocab_size; a++) //iter vocab，统计能量总值train_words_pow，power为指数系数。  
      train_words_pow += pow(vocab[a].cn, power);  
  i = 0;  
  d1 = pow(vocab[i].cn, power) / (real)train_words_pow; //init the power ratio for idx 0  
  for (a = 0; a < table_size; a++) { //iter table with a as idx of table, i as idx of vocab  
    table[a] = i; //table[a] contains index i  
    //table反映的了单词能量的分布，单词能量越大对应占用table中的位置越多，越容易被抽样到  
    //table shows words power dist, higher power words have more samll indexes in table, easier to sample.
    if (a / (real)table_size > d1) {  
      i++;
      d1 += pow(vocab[i].cn, power) / (real)train_words_pow;  
    }  
    if (i >= vocab_siInitNetze) i = vocab_size - 1;  
  }  
}

Consider a corpus of \(\texttt{“We had meal”}\), in the table it should be that higher power words have more occurances like \([\texttt{“We”,”We”,”We”,”We”,”We”,”We”,”had”,”had”,”had”,”had”,”meal”}]\).

\(\begin{matrix} \mathrm{Corpus} & \texttt{We} & \texttt{had} & \texttt{meal} & & \\ \mathrm{Vocab\ Idx} & 0 & 1 & 2 & & \\ \mathrm{Vocab} & \texttt{We} & \texttt{had} & \texttt{meal} & \\ \mathrm{Table\ Idx} & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3} & \texttt{4} & \texttt{5} & \texttt{6} & \texttt{7} &\texttt{8} &\texttt{9} &\texttt{10} &\\ \mathrm{Table\ Val}& \texttt{0} & \texttt{0} & \texttt{0} & \texttt{0} & \texttt{0} & \texttt{0} & \texttt{1} & \texttt{1} &\texttt{1} &\texttt{1} &\texttt{2} \\ \mathrm{Table} & \texttt{We} & \texttt{We} & \texttt{We} & \texttt{We} & \texttt{We} &\texttt{We} & \texttt{had} & \texttt{had} &\texttt{had} &\texttt{had} & \texttt{meal} & \\ \end{matrix}\)

Hierarchical Softmax

Considering a full binary tree with \(V\) leafy nodes, the overall number of node should be \(2\cdot V-1\).

Huffman tree generated from the exact frequencies of the text “this is an example of a huffman tree”. Encoding the sentence with this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information). The frequencies and codes of each character are shown in the accompanying table. Char Freq Code space ' ' 7 111 a 4 010 e 4 000 f 3 1101 h 2 1010 i 2 1000 m 2 0111 n 2 0010 s 2 1011 t 2 0110 l 1 11001 o 1 00110 p 1 10011 r 1 11000 u 1 00111 x 1 10010

An example binary tree for the hierarchical softmax model

Illustration of one path of Huffman encoding from \(\texttt{root}\) to the leaf node \(\texttt{Virginia}\). Based on the maximum number of nodes for a binary tree, the first \(V\) values should be all of the vocabulary which are the leafy nodes, the last \(V-1\) values should be the non leafy nodes.

\(\begin{matrix} \mathrm{depth} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \mathrm{word} & \boxtimes&\boxtimes & \boxtimes& \boxtimes& \boxtimes&\boxtimes &\boxtimes & \texttt{virginia}\\ \mathrm{count} & 800&700&600&500&400&300&200&100\\ \mathrm{point} & \left | C\right |+a & & & & & & &i\\ \mathrm{code} &\qquad\qquad & \quad\;\;0\;\;\quad & \quad\;01\;\quad & \;\;010\;\; & \;\;0101\; & \;\;01010 & \;010101 & 0101010\\ \end{matrix}\)

\(\displaystyle \underbrace{\textcolor{blue}{\bullet}}_{\substack{\text{root}}}\underbrace{\longmapsto}_{\substack{\text{explain}}}\circ \longmapsto \circ \longmapsto \circ \longmapsto\circ \longmapsto\circ\longmapsto\circ\longmapsto\underbrace{\textcolor{green}{\bullet}}_{\substack{\text{leaf}}}\)

\begin{matrix} \underbrace{\textcolor{blue}{\bullet}}_{\substack{\text{root}}} & \underbrace{\longmapsto}_{\substack{\text{left}}} & \circ & \underbrace{\longmapsto}_{\substack{\text{right}}} &\circ & \underbrace{\longmapsto}_{\substack{\text{left}}}& \cdots&\circ& \underbrace{\longmapsto}_{\substack{\text{right}}} & \underbrace{\textcolor{green}{\bullet}}_{\substack{\text{leaf}}} \\ &&\begin{bmatrix} w_{1,1}^{\star} \\ w_{1,2}^{\star} \\ \vdots\\ w_{1,n-1}^{\star}\\ w_{1,n}^{\star} \end{bmatrix}&& \begin{bmatrix} w_{2,1}^{\star} \\ w_{2,2}^{\star} \\ \vdots\\ w_{2,n-1}^{\star}\\ w_{2,n}^{\star} \end{bmatrix} &&& \begin{bmatrix} w_{J-1,1}^{\star} \\ w_{J-1,2}^{\star} \\ \vdots\\ w_{J-1,n-1}^{\star}\\ w_{J-1,n}^{\star} \end{bmatrix}&& \begin{bmatrix} w_{J,1}^{\star} \\ w_{J,2}^{\star} \\ \vdots\\ w_{J,n-1}^{\star}\\ w_{J,n}^{\star} \end{bmatrix} \end{matrix}

for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];//计算内积
f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];  //内积之后sigmoid函数  
// 'g' is the gradient multiplied by the learning rate  
g = (1 - vocab[word].code[d] - f) * alpha;//
// Propagate errors output -> hidden 反向传播误差，从Huffman树传到隐藏  
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];  
// Learn weights hidden -> output 更新当前内节点的向量  
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];

f = 0;
l2 = vocab[word].point[d] * layer1_size;
// Propagate hidden -> output
// 待预测单词的 word vecotr 和 隐层-霍夫曼树非叶节点权重 的内积
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
// 同cbow中hs的讨论
f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha; // 这里的code[d]其实是下一层的，code错位了，point和code是错位的！
// Propagate errors output -> hidden
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
// Learn weights hidden -> output
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];

The explanation for the update of Huffman Encoding Tree matrix weight is based on the blog written in Chinese⁸.

In CBOW, \(\displaystyle H_{w}=\sum_{c\in C(w)}\overrightarrow{W_{c}}\) is the sum of context words \(c\) embeding vectors where \(c\in C(w)\) depending on the context window size.

As Huffman Encoding Tree is a binary tree, it is not possible that the leafy node is a right child node if the parent node has one child node. In other word, if a node has one child node, that child node must be a left node. Thus we want the sigmoid function to represent the probability of choosing a left child node, meaning when \(d_{j}^{w^{\ast}}=0\) since it would choose to go left from level \(j-1\), \(p(d_{j}^{w}=0|\mathbf{h}_{w}, \theta_{j-1}^{w})=\sigma(d_{j}^{w}|\mathbf{h}_{w}, \theta_{j-1}^{w})\); similarly, when \(d_{j}^{w}=1\) as the binary tree parent node constructs left child node before it is for the right child node, \(p(d_{j}^{w}=1|\mathbf{h}_{w}, \theta_{j-1}^{w})=1-\sigma(d_{j}^{w}|\mathbf{h}_{w}, \theta_{j-1}^{w})\)

\begin{align*} \frac{\partial \mathcal{L}(w^{\ast}, j)}{\partial \theta_{j-1}^{w^{\ast}}} &= \frac{\partial}{\partial \theta_{j-1}^w} \left\{ \underbrace{(1 - d_j^w) \cdot \log \left[ \sigma \left( \mathbf{h}_{w} \theta_{j-1}^{w^{\ast}} \right) \right]}_{\substack{\color{navy}\text{Prob To Left Child Node After } log()}} + \underbrace{d_j^w \cdot \log \left[ 1 - \sigma \left(\mathbf{h}_{w} \theta_{j-1}^w \right) \right]}_{\substack{\color{navy}\text{Prob to Right Child Node After }log()}} \right\}\\ %eq1 &\underbrace{\looparrowleft}_{\substack{\text{rules}}} \underbrace{\sigma'(x)=\sigma(x)[1-\sigma(x)]}_{\substack{\color{navy}\text{sigmoid func derivative}}}; \underbrace{ln(x)'=1/x}_{\substack{\color{navy}\text{log derivative}}}\\ &= (1 - d_j^w)\cdot\underbrace{ \frac{1}{\sigma \left( \mathbf{h}_{w} \cdot \theta_{j-1}^w \right)}}_{\substack{\color{navy}\text{log derivative}}}\cdot \underbrace{\sigma \left( \mathbf{h}_{w}\cdot \theta_{j-1}^w \right)\cdot [1-\sigma \left( \mathbf{h}_{w} \cdot\theta_{j-1}^w \right)]}_{\substack{\color{navy}\text{sigmoid derivative}}}\cdot \mathbf{h}_{w}\\ &+ d_j^w\cdot \underbrace{\frac{1}{1-\sigma \left( \mathbf{h}_{w}\cdot \theta_{j-1}^w \right)}}_{\substack{\color{navy}\text{log derivative}}}\cdot \underbrace{(-1)\cdot \sigma \left( \mathbf{h}_{w}\cdot \theta_{j-1}^w \right)\cdot [1-\sigma \left( \mathbf{h}_{w}\cdot \theta_{j-1}^w \right)]}_{\substack{\color{navy}\text{sigmoid derivative}}}\cdot \mathbf{x}_w\\ &= \left(1 - d_j^w-\sigma \left( \mathbf{h}_{w}\cdot \theta_{j-1}^w \right)\right)\cdot \mathbf{h}_{w} \end{align*}

Based on the derivative of the likelihood, the parameter vector \(\theta_{j-1}^{w}\) in level level \(j-1\) within Huffman Encoding Tree Trace to the specific word \(w\) could be updated based on the gradient calculation shown above.

\(\displaystyle \theta_{j-1}^w := \theta_{j-1}^w + \eta \left[ 1 – d_j^w – \sigma(\mathbf{x}_w^{\top}\theta_{j-1}^w) \right] \mathbf{x}_w\Longleftarrow \theta_{j-1}^w + \eta \left[1 – d_j^w – 1/e^{-(\mathbf{x}_w^{\top}\theta_{j-1}^w)} \right] \mathbf{x}_w\)

// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha; // 
// Learn weights hidden -> output 更新当前内节点的向量  
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];

\underbrace{\texttt{g}}_{\substack{\text{gradient}\\\text{coefficient}}}\texttt{=}\texttt{(1 - }\underbrace{\texttt{vocab[word].code[d]}}_{\substack{\text{level j code of word w: 0 or 1}}}\texttt{ - }\underbrace{\texttt{f}}_{\substack{\text{sigmoid}}}\texttt{)}\texttt{ * }\underbrace{\texttt{alpha}}_{\substack{\text{step size}}}\texttt{;}

\underbrace{\texttt{syn1[c + l2]}}_{\substack{\theta_{j}\text{}}} \texttt{ += } \underbrace{\texttt{g}}_{\substack{\text{gradient}\\\text{coefficient}}} * \underbrace{\texttt{neu1[c]};}_{\substack{\text{Hidden Layer Vec}}}

Similarly, for partial derivative of \(\mathbf{x}_{w}\), we could replace \(\mathbf{x}_{w}\) with \(\theta_{j-1}^{w}\) for the gradient calculation. Thus for the trace from root node through each non-leafy node to node \(w\), \(\theta_{j-1}^{w}\) is updated once, while \(\mathbf{x}_{w}\) is updated \(j -1\) times.

// 'g' is the gradient multiplied by the learning rate  
g = (1 - vocab[word].code[d] - f) * alpha;//  
// Propagate errors output -> hidden 反向传播误差，从Huffman树传到隐藏  
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];

Backpropagation

In \(\texttt{void *TrainModelThread(void *id)}\)

Continuous Bag of Words

\mathcal{L} =\prod_{w \in C}^{}p()\Longrightarrow \sum_{w \in C} \log p(w \mid \text{Context}(w))

\(C\) means the whole Corpus or the dataset,

for (a = b; a < window * 2 + 1 – b; a++) if (a != window){ //CBOW  b->rand window len
//[setence_position – window + b, sentence_position + window – b]
    c = sentence_position – window + a;  //c->context idx, sentence_position->loc of taget word
    if (c < 0) continue;  
    if (c >= sentence_length) continue;  
    last_word = sen[c];  
    if (last_word == -1) continue;
    //syn0=>(long long)vocab_size * layer1_size * sizeof(real) ); 
    //syn0[last_word*layer1_size][0:layer1_size] is last_word’s embedding vector.
    for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; //add context embeddings  
}  
if (hs) for (d = 0; d < vocab[word].codelen; d++){ //Hierarchical-Softmax from root 
//codelen: Huffman Encoding len, point: idxes of path to leaf. 
    f = 0;  
    l2 = vocab[word].point[d] * layer1_size;  //point记录Huffman的路径，找到当前节点，并算出偏移  
    // Propagate hidden -> output  
    for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];//计算内积  
    if (f <= -MAX_EXP) continue; //内积不在范围内直接丢弃  
    else if (f >= MAX_EXP) continue;  
    else f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];  //内积之后sigmoid函数  
    // ‘g’ is the gradient multiplied by the learning rate  
    g = (1 – vocab[word].code[d] – f) * alpha;//  
    // Propagate errors output -> hidden 反向传播误差，从Huffman树传到隐藏  
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];  
    // Learn weights hidden -> output 更新当前内节点的向量  
    for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];  
}  
if (negative > 0) for (d = 0; d < negative + 1; d++) {  
    if (d == 0) {  
        target = word;  
        label = 1;  
    } else {  
        next_random = next_random * (unsigned long long)25214903917 + 11;  
        target = table[(next_random >> 16) % table_size];  
        if (target == 0) target = next_random % (vocab_size – 1) + 1;  
        if (target == word) continue;  
        label = 0;  
    }  
    l2 = target * layer1_size;  
    f = 0;  
    for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2];  
    if (f > MAX_EXP) g = (label – 1) * alpha;  
    else if (f < -MAX_EXP) g = (label – 0) * alpha;  
    else g = (label – expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]) * alpha;  
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];  
    for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c];  
}  
// hidden -> in  
for (a = b; a < window * 2 + 1 – b; a++) if (a != window) {  
    c = sentence_position – window + a;  
    if (c < 0) continue;  
    if (c >= sentence_length) continue;  
    last_word = sen[c];  
    if (last_word == -1) continue;  
    for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];//update embeddings
}

for (a = b; a < window * 2 + 1 - b; a++) if (a != window){ //CBOW  b->rand window len
//[setence_position - window + b, sentence_position + window - b]
    c = sentence_position - window + a;  //c->context idx, sentence_position->loc of taget word
    if (c < 0) continue;  
    if (c >= sentence_length) continue;  
    last_word = sen[c];  
    if (last_word == -1) continue;
    //syn0=>(long long)vocab_size * layer1_size * sizeof(real) ); 
    //syn0[last_word*layer1_size][0:layer1_size] is last_word's embedding vector.
    for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; //add context embeddings  
}  
if (hs) for (d = 0; d < vocab[word].codelen; d++){ //Hierarchical-Softmax from root 
//codelen: Huffman Encoding len, point: idxes of path to leaf. 
    f = 0;  
    l2 = vocab[word].point[d] * layer1_size;  //point记录Huffman的路径，找到当前节点，并算出偏移  
    // Propagate hidden -> output  
    for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];//计算内积  
    if (f <= -MAX_EXP) continue; //内积不在范围内直接丢弃  
    else if (f >= MAX_EXP) continue;  
    else f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];  //内积之后sigmoid函数  
    // 'g' is the gradient multiplied by the learning rate  
    g = (1 - vocab[word].code[d] - f) * alpha;//  
    // Propagate errors output -> hidden 反向传播误差，从Huffman树传到隐藏  
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];  
    // Learn weights hidden -> output 更新当前内节点的向量  
    for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];  
}  
if (negative > 0) for (d = 0; d < negative + 1; d++) {  
    if (d == 0) {  
        target = word;  
        label = 1;  
    } else {  
        next_random = next_random * (unsigned long long)25214903917 + 11;  
        target = table[(next_random >> 16) % table_size];  
        if (target == 0) target = next_random % (vocab_size - 1) + 1;  
        if (target == word) continue;  
        label = 0;  
    }  
    l2 = target * layer1_size;  
    f = 0;  
    for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2];  
    if (f > MAX_EXP) g = (label - 1) * alpha;  
    else if (f < -MAX_EXP) g = (label - 0) * alpha;  
    else g = (label - expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]) * alpha;  
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];  
    for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c];  
}  
// hidden -> in  
for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {  
    c = sentence_position - window + a;  
    if (c < 0) continue;  
    if (c >= sentence_length) continue;  
    last_word = sen[c];  
    if (last_word == -1) continue;  
    for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];//update embeddings
}

\(\blacktriangleright\Box\Box\Box\boxplus\Box\Box\Box\Box\boxtimes\Box\Box\Box\Box\boxplus\Box\Box\Box\blacktriangleleft\)

\([\texttt{setence_position – window + b}, \texttt{sentence_position + window – b}]\)

Dynamic Window Size⁹

subsampling and rare-word pruning¹⁰

Skip Gram

//4.2.2.skip-gram模型：
for (a = b; a < window * 2 + 1 – b; a++) if (a != window) { // 预测非中心的单词（邻域内的单词）
  c = sentence_position – window + a;
  if (c < 0) continue;
  if (c >= sentence_length) continue;
  last_word = sen[c];
  if (last_word == -1) continue;
  l1 = last_word * layer1_size;
  // 累计误差项清零
  for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
  // HIERARCHICAL SOFTMAX

  //4.2.2.skip-gram模型：HIERARCHICAL SOFTMAX
  if (hs) for (d = 0; d < vocab[word].codelen; d++) {
    f = 0;
    l2 = vocab[word].point[d] * layer1_size;
    // Propagate hidden -> output
    // 待预测单词的 word vecotr 和 隐层-霍夫曼树非叶节点权重 的内积
    for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
    // 同cbow中hs的讨论
    if (f <= -MAX_EXP) continue;
    else if (f >= MAX_EXP) continue;
    // 以下内容同之前的cbow
    else f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];
    // ‘g’ is the gradient multiplied by the learning rate
    g = (1 – vocab[word].code[d] – f) * alpha; // 这里的code[d]其实是下一层的，code错位了，point和code是错位的！
    // Propagate errors output -> hidden
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
    // Learn weights hidden -> output
    for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
  }
  // NEGATIVE SAMPLING

  //4.2.2.skip-gram模型：NEGATIVE SAMPLING
  if (negative > 0) for (d = 0; d < negative + 1; d++) {
    if (d == 0) {
    target = word;
    label = 1;
    } else {
    next_random = next_random * (unsigned long long)25214903917 + 11;
    target = table[(next_random >> 16) % table_size];
    if (target == 0) target = next_random % (vocab_size – 1) + 1;
    if (target == word) continue;
    label = 0;
    }
    l2 = target * layer1_size;
    f = 0;
    for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
    // 以下内容同之前的cbow
    if (f > MAX_EXP) g = (label – 1) * alpha;
    else if (f < -MAX_EXP) g = (label – 0) * alpha;
    else g = (label – expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]) * alpha;
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
    for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
  }
  // Learn weights input -> hidden
  for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
}

//4.2.2.skip-gram模型：
for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { // 预测非中心的单词（邻域内的单词）
  c = sentence_position - window + a;
  if (c < 0) continue;
  if (c >= sentence_length) continue;
  last_word = sen[c];
  if (last_word == -1) continue;
  l1 = last_word * layer1_size;
  // 累计误差项清零
  for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
  // HIERARCHICAL SOFTMAX

  //4.2.2.skip-gram模型：HIERARCHICAL SOFTMAX
  if (hs) for (d = 0; d < vocab[word].codelen; d++) {
    f = 0;
    l2 = vocab[word].point[d] * layer1_size;
    // Propagate hidden -> output
    // 待预测单词的 word vecotr 和 隐层-霍夫曼树非叶节点权重 的内积
    for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
    // 同cbow中hs的讨论
    if (f <= -MAX_EXP) continue;
    else if (f >= MAX_EXP) continue;
    // 以下内容同之前的cbow
    else f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];
    // 'g' is the gradient multiplied by the learning rate
    g = (1 - vocab[word].code[d] - f) * alpha; // 这里的code[d]其实是下一层的，code错位了，point和code是错位的！
    // Propagate errors output -> hidden
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
    // Learn weights hidden -> output
    for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
  }
  // NEGATIVE SAMPLING

  //4.2.2.skip-gram模型：NEGATIVE SAMPLING
  if (negative > 0) for (d = 0; d < negative + 1; d++) {
    if (d == 0) {
    target = word;
    label = 1;
    } else {
    next_random = next_random * (unsigned long long)25214903917 + 11;
    target = table[(next_random >> 16) % table_size];
    if (target == 0) target = next_random % (vocab_size - 1) + 1;
    if (target == word) continue;
    label = 0;
    }
    l2 = target * layer1_size;
    f = 0;
    for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
    // 以下内容同之前的cbow
    if (f > MAX_EXP) g = (label - 1) * alpha;
    else if (f < -MAX_EXP) g = (label - 0) * alpha;
    else g = (label - expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]) * alpha;
    for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
    for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
  }
  // Learn weights input -> hidden
  for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
}

Attention in transformers, step-by-step | DL6

Code Repo

word2vec 源代码完整注释

Originial v 0.1c word2vec.c

Patched v 0.1b https://github.com/William-Yeh/word2vec-mac, this modified version fixes issues where the original repo cannot run a macOS computer.

When comparing the original with the Patched version of word2vec.c file at v 0.1b. This comparison is useful because it allows researchers to at least know how to handle compatibility issue on platform.

Filename	Original	Patched
`compute-accuracy.c`	`#include <malloc.h>`	`#include <stdlib.h>`
`demo-*.sh`	`wget gzip`	`curl unzip`
`distance.c`	`#include <malloc.h> char bestw[N][max_size];` `bestd[a] = 0;`	`#include <stdlib.h> char bestw[N]; for bestw[a] = (char )malloc();` `bestd[a] = -1;`
`word-analogy.c`	`#include <malloc.h>`	`#include <stdlib.h>`

The good thing is that for most of the c files, malloc.h needs to be changed to stdlib.h, which is straightfoward, this is similar to demo-*.sh where wget and gzip needs to be replaced by curl and unzip. The more complex issue is the distance.c, one issue is about handling bestw char matrix, one more issue is about initialization of bestd[a].

I cannot believe the researcher I’m trying to learn from encounters such a tragic story.

Rest In Peace Xin Rong, you will be forever remembered.

Print 🖨 eBook 📱

Mikolov, Tomas, Kai Chen, Greg S. Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv preprint arXiv:1301.3781 (2013).📁🗒[↩]
Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
Rong, Xin. “word2vec parameter learning explained.” arXiv preprint arXiv:1411.2738 (2014).[↩]
Haykin, Simon. Neural networks and learning machines, 3/E. Pearson Education India, 2009.[↩]
Rong, Xin. “word2vec parameter learning explained.” arXiv preprint arXiv:1411.2738 (2014).[↩]
Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
Nikkarinen, Irene, Tiago Pimentel, Damián E. Blasi, and Ryan Cotterell. “Modeling the unigram distribution.” arXiv preprint arXiv:2106.02289 (2021).[↩]
word2vec原理推导与代码分析[↩]
Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]

Posted

April 26, 2025

Xiaomeng Wang

Tags:

ai, llm, machine learning, python, research

Rook2Vet4Word2Vec

Symbols

Vinilla

CBOW

Skip-Gram

Extra

Negative Sampling

Hierarchical Softmax

Backpropagation

Continuous Bag of Words

Skip Gram

Code Repo

Comments

Leave a Reply Cancel reply