The goal of this blog is to help readers understand what word2vec means so that rookie like you could become a veteran on the concept of embeddings training process. I’ve tried to find resources to understand how word2vec actually works. What surprises me the most is that although it looks really simple in terms of structure, most materials I found online could not explain clearly. At least the default search results I found online, they could not be able to provide what would be the most essential chain of thoughts. After going through some blogs and even YouTube videos, those materials to me feels they lack certain detail especially for the most crucial steps. That is why I decide to compile a word2vec explanation by myself.
Efficient Estimation of Word Representations in Vector Space1
Based on the title it is about estimation of representations. skip-gram and CBOW, word2vec,
Embeddings: vector representation of words.
words with similar contexts have similar vectors.
continuous bag of words
Consider a sentence starting from \(\texttt{W}_{\texttt{1}}\) to \(\texttt{W}_{\texttt{N}}\), the dictionary has \(\mathrm{D}\) words, each word has a vector length of \(\mathrm{v}\).
Bag of Words: from \(\texttt{W}_{i-k}\) and \(\texttt{W}_{i+k}\) to predict target word \(\texttt{W}_{\texttt{i}}\), with the sliding window size of \(2k\). \(\star\)It should be noted that the order of thse context words do not affect the prediction process. Embedding Matrix \(W=V\times D\), Context Matrix \(W^{\star}=D\times V\). Predict the target word from context based on the window size. Small copus, faster to train.
\(ffff\)
\(\)
skip-gram: predict context words from the target word. Large corpus, higher dimensions, slower.
word2vec原理(二) 基于Hierarchical Softmax的模型
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method2
word2vec parameter learning explained3
CBOW Neural Network Illustration Skip-Gram Neural Network Illustration
wevi: word embedding visual inspector
Symbols
The following are the symbols for this blog, I tried to find a set of symbols that are intuitive yet follow the mainsteam definition.
\begin{matrix} \underbrace{C()}_{\substack{\text{Context Func}}} & \underbrace{c}_{\substack{\text{Context Word}}}&\underbrace{w}_{\substack{\text{Target Word}}}&\underbrace{D}_{\substack{\text{Dictionary}}} &\underbrace{V}_{\substack{\text{Dict Size}}} & \underbrace{N}_{\substack{\text{Embedding Size}}}\\ \underbrace{T^{\star}}_{\substack{(w,C(w))}\in T} & \underbrace{T'}_{\substack{(w,C(w))}\notin T} & \underbrace{T}_{\substack{\text{Text/Corpus}}}& \underbrace{w^{\dagger}}_{\substack{\text{Input Layer}\\\text{Word Idx\ }i}}&\underbrace{w^{\ast}}_{\substack{\text{Output Layer}\\\text{Word Idx\ } i}}\\ \underbrace{\mathbf{h}}_{\substack{\text{Hidden Layer}}} & \\ \underbrace{\mathbf{W}}_{\substack{\text{Embedding Mat}}}& \underbrace{\overrightarrow{W_{c}}}_{\substack{\text{Embedding Vec}\\\text{Context}}}&\underbrace{\overrightarrow{W_{w}}}_{\substack{\text{Embedding Vec}\\\text{Target}}} &\underbrace{\mathbf{W^{*}}}_{\substack{\text{Output Mat}}}& \underbrace{\overrightarrow{W^{*}_{c}}}_{\substack{\text{Output Vec}\\\text{Context}}}&\underbrace{\overrightarrow{W^{*}_{w}}}_{\substack{\text{Output Vec}\\\text{Target}}} \end{matrix}
The following table contains the summary for the repo.
\(\begin{matrix} \mathrm{Symbols} & W_{V\times N} & V_{N} & g\cdot W^{\star}& W^{\star}_{N\times V} & \overline{W^{\star}}_{N\times V} & g & \eta\\ \mathrm{Code} & \texttt{syn0} & \texttt{neu1}& \texttt{neu1e}& \texttt{syn1} & \texttt{syn1neg} & \texttt{g} & \texttt{alpha}\\ \mathrm{Row} & \texttt{vocab_size} &\texttt{1} & \texttt{1}& \texttt{layer1_size} & \texttt{layer1_size} & \\ \mathrm{Col} & \texttt{layer1_size} & \texttt{layer1_size} & \texttt{layer1_size} & \texttt{vocab_size} & \texttt{vocab_size} \\ \end{matrix}\)
Vinilla
In this section, we will discuss the vinilla word2vec design in theory, without any extra practical techniques. Two matrices \(\mathbf{W}_{V\times N}\) and \(\mathbf{W}^{*}_{N\times V}\)are connected by a hidden layer \(\mathbf{h}\). One intuitive way of thinking about those 2 matrices \(\mathbf{W}_{V\times N}\) and \(\mathbf{W}^{*}_{N\times V}\) are that each coresponds to a row vector when a word \(w\) is could be represented as target word or a context word with vectors. It is not able to use one row vector to represent a word \(w\) because the probability of a word as a context word before the same word is really rare.
Consider the simplest case with one input word with index \(i\) and probability of a specific output with index of \(j\), if input word \(D_{i}\) which means \(i\)th word in dictionary has relationship with \(D_{j}\), in other words \( (D_{i}, D_{j})\in T\). So in corpus \(T\), there are more than \(0\) occurances where \(D_{i}\) and \(D_{j}\) are next to each other if we consider a small context window.
During training, we want to update the vectors in both \(\mathbf{W}_{V\times N}\) and \(\mathbf{W}^{*}_{N\times V}\) such that \( (D_{i}, D_{j})\in T\) should be closer in terms of inner product4.
For propagatoin, \(\overrightarrow{W_{i}}=\mathbf{h}\), meaning the word \(D_{i}\) with index \(i\) is activated at the input layer; on the output layer anoth er word \(D_{j}\) with index \(j\) is also chosen for update based on the dataset or negative sampling.
D_{i}\xrightleftarrows[]{\mathbf{W}_{V\times N}}\underbrace{\mathbf{h}}_{\substack{\textcolor{red}{\overrightarrow{W_{i}}^{\intercal}}}}\xrightleftarrows[]{\mathbf{W}^{\ast}_{N\times V}} \mathbf{u}=\mathbf{h}^{\intercal}\odot \textcolor{red}{\overrightarrow{W^{\ast}_{j}}}\xrightleftarrows[]{\large\sigma()}y_{j}\Longleftarrow \textcolor{red}{e_{j}}= y_{j}-z_{j}
\begin{align*} E &= -\log p(w_{O}|w_{I})=-\log\frac{e^{u_{j^{\ast}}}}{\sum_{k=1}^{V} e^{u_k}}=-u_{j^{\ast}}+\log\sum_{k=1}^{V} e^{u_k} \\ \frac{\partial E}{\partial u_{j^{\ast}}} &= \frac{\partial}{\partial u_{j^{\ast}}} \left( - \sum_{k=1}^{V} t_k u_k + \log\left(\sum_{i=1}^{V} e^{u_i}\right) \right) \\ &= \frac{\partial}{\partial u_j} \left( - \sum_{k=1}^{V} t_k u_k \right) + \frac{\partial}{\partial u_j} \left(\log\left(\sum_{i=1}^{V} e^{u_i}\right)\right) \\ &= -t_j + \frac{1}{\sum_{i=1}^{V} e^{u_i}} \cdot \frac{\partial}{\partial u_j} \left(\sum_{i=1}^{V} e^{u_i}\right) \\ &= -t_j + \frac{1}{\sum_{i=1}^{V} e^{u_i}} \cdot e^{u_j} \\ &= -t_j + \frac{e^{u_j}}{\sum_{i=1}^{V} e^{u_i}} \\ &= -t_j + y_j \\ &= y_j - t_j \end{align*}
For backpropagation, first need to update values for \(\overrightarrow{W^{\ast}_{j}}\) with size of \(N\times 1\), consider each value to be \(w^{\ast}_{k,j}\).
\(\displaystyle \frac{\partial E}{\partial w^{\ast}_{k,j}}=\frac{\partial E}{\partial u_{j}}\cdot \frac{\partial u_{j}}{\partial w^{\ast}_{k,j}}=e_{j}\cdot \mathbf{h}_{k}\)
To update the values for \(w_{i,k}\) or \(\mathbf{h}_{k}\) for each value in row \(i\):
\frac{\partial E}{\partial \mathbf{h}_{k}}=\sum_{j=1}^{V}\frac{\partial E}{\partial \mathbf{u}_{j}}\cdot \frac{\partial \mathbf{u}_{j}}{\partial w^{\ast}_{k,j}}
CBOW
Skip-Gram
Extra
Negative Sampling
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method5
Positive words would be the ground truth provided by the text corpus. For negative samples, which are not in the text sentences. \(P_{n}(w)=freq(\texttt{vocab[i]})^{\texttt{power}}\), \(\displaystyle \arg \underset{\theta}{\max}\prod_{(w,c) \in D} p(D = 1|w, c; \theta)=\arg \underset{\theta}{\max}\prod_{(w,c) \in D}\frac{1}{1+e^{-v_{c}\cdot v_{w}}}\)
\(\displaystyle \mathcal{L} = \sigma(v_w \cdot v_c) \times \prod_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_n(w)} \left[ \sigma(-v_{w_i} \cdot v_{c}) \right]\Rightarrow \log \sigma(v_{w} \cdot v_{c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-v_{w_i} \cdot v_c) \right]\)
The unigram distribution is the non-contextual probability of finding a specific word form in a corpus6.
void InitUnigramTable() { //frequency based probability for words
int a, i; //
long long train_words_pow = 0; //sum of power for all words
real d1, power = 0.75;
table = (int *)malloc(table_size * sizeof(int));
for (a = 0; a < vocab_size; a++) //iter vocab,统计能量总值train_words_pow,power为指数系数。
train_words_pow += pow(vocab[a].cn, power);
i = 0;
d1 = pow(vocab[i].cn, power) / (real)train_words_pow; //init the power ratio for idx 0
for (a = 0; a < table_size; a++) { //iter table with a as idx of table, i as idx of vocab
table[a] = i; //table[a] contains index i
//table反映的了单词能量的分布,单词能量越大对应占用table中的位置越多,越容易被抽样到
//table shows words power dist, higher power words have more samll indexes in table, easier to sample.
if (a / (real)table_size > d1) {
i++;
d1 += pow(vocab[i].cn, power) / (real)train_words_pow;
}
if (i >= vocab_siInitNetze) i = vocab_size - 1;
}
}
CConsider a corpus of \(\texttt{“We had meal”}\), in the table it should be that higher power words have more occurances like \([\texttt{“We”,”We”,”We”,”We”,”We”,”We”,”had”,”had”,”had”,”had”,”meal”}]\).
\(\begin{matrix} \mathrm{Corpus} & \texttt{We} & \texttt{had} & \texttt{meal} & & \\ \mathrm{Vocab\ Idx} & 0 & 1 & 2 & & \\ \mathrm{Vocab} & \texttt{We} & \texttt{had} & \texttt{meal} & \\ \mathrm{Table\ Idx} & \texttt{0} & \texttt{1} & \texttt{2} & \texttt{3} & \texttt{4} & \texttt{5} & \texttt{6} & \texttt{7} &\texttt{8} &\texttt{9} &\texttt{10} &\\ \mathrm{Table\ Val}& \texttt{0} & \texttt{0} & \texttt{0} & \texttt{0} & \texttt{0} & \texttt{0} & \texttt{1} & \texttt{1} &\texttt{1} &\texttt{1} &\texttt{2} \\ \mathrm{Table} & \texttt{We} & \texttt{We} & \texttt{We} & \texttt{We} & \texttt{We} &\texttt{We} & \texttt{had} & \texttt{had} &\texttt{had} &\texttt{had} & \texttt{meal} & \\ \end{matrix}\)
Hierarchical Softmax
Considering a full binary tree with \(V\) leafy nodes, the overall number of node should be \(2\cdot V-1\).
Huffman tree generated from the exact frequencies of the text “this is an example of a huffman tree”. Encoding the sentence with this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information). The frequencies and codes of each character are shown in the accompanying table.
Char Freq Code
space ' ' 7 111 a 4 010 e 4 000 f 3 1101 h 2 1010 i 2 1000 m 2 0111 n 2 0010 s 2 1011 t 2 0110 l 1 11001 o 1 00110 p 1 10011 r 1 11000 u 1 00111 x 1 10010
An example binary tree for the hierarchical softmax model
Illustration of one path of Huffman encoding from \(\texttt{root}\) to the leaf node \(\texttt{Virginia}\). Based on the maximum number of nodes for a binary tree, the first \(V\) values should be all of the vocabulary which are the leafy nodes, the last \(V-1\) values should be the non leafy nodes.
\(\begin{matrix} \mathrm{depth} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ \mathrm{word} & \boxtimes&\boxtimes & \boxtimes& \boxtimes& \boxtimes&\boxtimes &\boxtimes & \texttt{virginia}\\ \mathrm{count} & 800&700&600&500&400&300&200&100\\ \mathrm{point} & \left | C\right |+a & & & & & & &i\\ \mathrm{code} &\qquad\qquad & \quad\;\;0\;\;\quad & \quad\;01\;\quad & \;\;010\;\; & \;\;0101\; & \;\;01010 & \;010101 & 0101010\\ \end{matrix}\)
\(\displaystyle \underbrace{\textcolor{blue}{\bullet}}_{\substack{\text{root}}}\underbrace{\longmapsto}_{\substack{\text{explain}}}\circ \longmapsto \circ \longmapsto \circ \longmapsto\circ \longmapsto\circ\longmapsto\circ\longmapsto\underbrace{\textcolor{green}{\bullet}}_{\substack{\text{leaf}}}\)
\begin{matrix} \underbrace{\textcolor{blue}{\bullet}}_{\substack{\text{root}}} & \underbrace{\longmapsto}_{\substack{\text{left}}} & \circ & \underbrace{\longmapsto}_{\substack{\text{right}}} &\circ & \underbrace{\longmapsto}_{\substack{\text{left}}}& \cdots&\circ& \underbrace{\longmapsto}_{\substack{\text{right}}} & \underbrace{\textcolor{green}{\bullet}}_{\substack{\text{leaf}}} \\ &&\begin{bmatrix} w_{1,1}^{\star} \\ w_{1,2}^{\star} \\ \vdots\\ w_{1,n-1}^{\star}\\ w_{1,n}^{\star} \end{bmatrix}&& \begin{bmatrix} w_{2,1}^{\star} \\ w_{2,2}^{\star} \\ \vdots\\ w_{2,n-1}^{\star}\\ w_{2,n}^{\star} \end{bmatrix} &&& \begin{bmatrix} w_{J-1,1}^{\star} \\ w_{J-1,2}^{\star} \\ \vdots\\ w_{J-1,n-1}^{\star}\\ w_{J-1,n}^{\star} \end{bmatrix}&& \begin{bmatrix} w_{J,1}^{\star} \\ w_{J,2}^{\star} \\ \vdots\\ w_{J,n-1}^{\star}\\ w_{J,n}^{\star} \end{bmatrix} \end{matrix}
for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];//计算内积
f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]; //内积之后sigmoid函数
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha;//
// Propagate errors output -> hidden 反向传播误差,从Huffman树传到隐藏
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
// Learn weights hidden -> output 更新当前内节点的向量
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];
Cf = 0;
l2 = vocab[word].point[d] * layer1_size;
// Propagate hidden -> output
// 待预测单词的 word vecotr 和 隐层-霍夫曼树非叶节点权重 的内积
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
// 同cbow中hs的讨论
f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha; // 这里的code[d]其实是下一层的,code错位了,point和code是错位的!
// Propagate errors output -> hidden
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
// Learn weights hidden -> output
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
CThe explanation for the update of Huffman Encoding Tree matrix weight is based on the blog written in Chinese7.
In CBOW, \(\displaystyle H_{w}=\sum_{c\in C(w)}\overrightarrow{W_{c}}\) is the sum of context words \(c\) embeding vectors where \(c\in C(w)\) depending on the context window size.
As Huffman Encoding Tree is a binary tree, it is not possible that the leafy node is a right child node if the parent node has one child node. In other word, if a node has one child node, that child node must be a left node. Thus we want the sigmoid function to represent the probability of choosing a left child node, meaning when \(d_{j}^{w^{\ast}}=0\) since it would choose to go left from level \(j-1\), \(p(d_{j}^{w}=0|\mathbf{h}_{w}, \theta_{j-1}^{w})=\sigma(d_{j}^{w}|\mathbf{h}_{w}, \theta_{j-1}^{w})\); similarly, when \(d_{j}^{w}=1\) as the binary tree parent node constructs left child node before it is for the right child node, \(p(d_{j}^{w}=1|\mathbf{h}_{w}, \theta_{j-1}^{w})=1-\sigma(d_{j}^{w}|\mathbf{h_{w}}, \theta_{j-1}^{w})\)
\begin{align*} \frac{\partial \mathcal{L}(w^{\ast}, j)}{\partial \theta_{j-1}^{w^{\ast}}} &= \frac{\partial}{\partial \theta_{j-1}^w} \left\{ \underbrace{(1 - d_j^w) \cdot \log \left[ \sigma \left( \mathbf{x}_{w^{\ast}}^\top \theta_{j-1}^{w^{\ast}} \right) \right]}_{\substack{\color{navy}\text{Prob To Left Child Node After } log()}} + \underbrace{d_j^w \cdot \log \left[ 1 - \sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right) \right]}_{\substack{\color{navy}\text{Prob to Right Child Node After }log()}} \right\}\\ %eq1 &\underbrace{\looparrowleft}_{\substack{\text{rules}}} \underbrace{\sigma'(x)=\sigma(x)[1-\sigma(x)]}_{\substack{\color{navy}\text{sigmoid func derivative}}}; \underbrace{ln(x)'=1/x}_{\substack{\color{navy}\text{log derivative}}}\\ &= (1 - d_j^w)\cdot\underbrace{ \frac{1}{\sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)}}_{\substack{\color{navy}\text{log derivative}}}\cdot \underbrace{\sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)\cdot [1-\sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)]}_{\substack{\color{navy}\text{sigmoid derivative}}}\cdot \mathbf{x}_w\\ &+ d_j^w\cdot \underbrace{\frac{1}{1-\sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)}}_{\substack{\color{navy}\text{log derivative}}}\cdot \underbrace{(-1)\cdot \sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)\cdot [1-\sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)]}_{\substack{\color{navy}\text{sigmoid derivative}}}\cdot \mathbf{x}_w\\ &= \left(1 - d_j^w-\sigma \left( \mathbf{x}_w^\top \theta_{j-1}^w \right)\right)\cdot \mathbf{x}_w \end{align*}
Based on the derivative of the likelihood, the parameter vector \(\theta_{j-1}^{w}\) in level level \(j-1\) within Huffman Encoding Tree Trace to the specific word \(w\) could be updated based on the gradient calculation shown above.
\(\displaystyle \theta_{j-1}^w := \theta_{j-1}^w + \eta \left[ 1 – d_j^w – \sigma(\mathbf{x}_w^{\top}\theta_{j-1}^w) \right] \mathbf{x}_w\Longleftarrow \theta_{j-1}^w + \eta \left[1 – d_j^w – 1/e^{-(\mathbf{x}_w^{\top}\theta_{j-1}^w)} \right] \mathbf{x}_w\)
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha; //
// Learn weights hidden -> output 更新当前内节点的向量
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];
C\underbrace{\texttt{g}}_{\substack{\text{gradient}\\\text{coefficient}}}\texttt{=}\texttt{(1 - }\underbrace{\texttt{vocab[word].code[d]}}_{\substack{\text{level j code of word w: 0 or 1}}}\texttt{ - }\underbrace{\texttt{f}}_{\substack{\text{sigmoid}}}\texttt{)}\texttt{ * }\underbrace{\texttt{alpha}}_{\substack{\text{step size}}}\texttt{;}
\underbrace{\texttt{syn1[c + l2]}}_{\substack{\theta_{j}\text{}}} \texttt{ += } \underbrace{\texttt{g}}_{\substack{\text{gradient}\\\text{coefficient}}} * \underbrace{\texttt{neu1[c]};}_{\substack{\text{Hidden Layer Vec}}}
Similarly, for partial derivative of \(\mathbf{x}_{w}\), we could replace \(\mathbf{x}_{w}\) with \(\theta_{j-1}^{w}\) for the gradient calculation. Thus for the trace from root node through each non-leafy node to node \(w\), \(\theta_{j-1}^{w}\) is updated once, while \(\mathbf{x}_{w}\) is updated \(j -1\) times.
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha;//
// Propagate errors output -> hidden 反向传播误差,从Huffman树传到隐藏
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
CBackpropagation
In \(\texttt{void *TrainModelThread(void *id)}\)
Continuous Bag of Words
\mathcal{L} =\prod_{w \in C}^{}p()\Longrightarrow \sum_{w \in C} \log p(w \mid \text{Context}(w))
\(C\) means the whole Corpus or the dataset,
for (a = b; a < window * 2 + 1 - b; a++) if (a != window){ //CBOW b->rand window len
//[setence_position - window + b, sentence_position + window - b]
c = sentence_position - window + a; //c->context idx, sentence_position->loc of taget word
if (c < 0) continue;
if (c >= sentence_length) continue;
last_word = sen[c];
if (last_word == -1) continue;
//syn0=>(long long)vocab_size * layer1_size * sizeof(real) );
//syn0[last_word*layer1_size][0:layer1_size] is last_word's embedding vector.
for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; //add context embeddings
}
if (hs) for (d = 0; d < vocab[word].codelen; d++){ //Hierarchical-Softmax from root
//codelen: Huffman Encoding len, point: idxes of path to leaf.
f = 0;
l2 = vocab[word].point[d] * layer1_size; //point记录Huffman的路径,找到当前节点,并算出偏移
// Propagate hidden -> output
for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2];//计算内积
if (f <= -MAX_EXP) continue; //内积不在范围内直接丢弃
else if (f >= MAX_EXP) continue;
else f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]; //内积之后sigmoid函数
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha;//
// Propagate errors output -> hidden 反向传播误差,从Huffman树传到隐藏
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
// Learn weights hidden -> output 更新当前内节点的向量
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c];
}
if (negative > 0) for (d = 0; d < negative + 1; d++) {
if (d == 0) {
target = word;
label = 1;
} else {
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size - 1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;
for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2];
if (f > MAX_EXP) g = (label - 1) * alpha;
else if (f < -MAX_EXP) g = (label - 0) * alpha;
else g = (label - expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]) * alpha;
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c];
}
// hidden -> in
for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
c = sentence_position - window + a;
if (c < 0) continue;
if (c >= sentence_length) continue;
last_word = sen[c];
if (last_word == -1) continue;
for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c];//update embeddings
}
C\(\blacktriangleright\Box\Box\Box\boxplus\Box\Box\Box\Box\boxtimes\Box\Box\Box\Box\boxplus\Box\Box\Box\blacktriangleleft\)
\([\texttt{setence_position – window + b}, \texttt{sentence_position + window – b}]\)
Dynamic Window Size8
subsampling and rare-word pruning9
Skip Gram
//4.2.2.skip-gram模型:
for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { // 预测非中心的单词(邻域内的单词)
c = sentence_position - window + a;
if (c < 0) continue;
if (c >= sentence_length) continue;
last_word = sen[c];
if (last_word == -1) continue;
l1 = last_word * layer1_size;
// 累计误差项清零
for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
// HIERARCHICAL SOFTMAX
//4.2.2.skip-gram模型:HIERARCHICAL SOFTMAX
if (hs) for (d = 0; d < vocab[word].codelen; d++) {
f = 0;
l2 = vocab[word].point[d] * layer1_size;
// Propagate hidden -> output
// 待预测单词的 word vecotr 和 隐层-霍夫曼树非叶节点权重 的内积
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2];
// 同cbow中hs的讨论
if (f <= -MAX_EXP) continue;
else if (f >= MAX_EXP) continue;
// 以下内容同之前的cbow
else f = expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )];
// 'g' is the gradient multiplied by the learning rate
g = (1 - vocab[word].code[d] - f) * alpha; // 这里的code[d]其实是下一层的,code错位了,point和code是错位的!
// Propagate errors output -> hidden
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2];
// Learn weights hidden -> output
for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1];
}
// NEGATIVE SAMPLING
//4.2.2.skip-gram模型:NEGATIVE SAMPLING
if (negative > 0) for (d = 0; d < negative + 1; d++) {
if (d == 0) {
target = word;
label = 1;
} else {
next_random = next_random * (unsigned long long)25214903917 + 11;
target = table[(next_random >> 16) % table_size];
if (target == 0) target = next_random % (vocab_size - 1) + 1;
if (target == word) continue;
label = 0;
}
l2 = target * layer1_size;
f = 0;
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
// 以下内容同之前的cbow
if (f > MAX_EXP) g = (label - 1) * alpha;
else if (f < -MAX_EXP) g = (label - 0) * alpha;
else g = (label - expTable[(int)( (f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2) )]) * alpha;
for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
}
// Learn weights input -> hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
}
CAttention in transformers, step-by-step | DL6
Code Repo
Originial v 0.1c word2vec.c
Patched v 0.1b https://github.com/William-Yeh/word2vec-mac, this modified version fixes issues where the original repo cannot run a macOS computer.
When comparing the original with the Patched version of word2vec.c file at v 0.1b. This comparison is useful because it allows researchers to at least know how to handle compatibility issue on platform.
Filename | Original | Patched |
compute-accuracy.c | #include <malloc.h> | #include <stdlib.h> |
demo-*.sh | wget | curl |
distance.c | #include <malloc.h> bestd[a] = 0; | #include <stdlib.h> bestd[a] = -1; |
word-analogy.c | #include <malloc.h> | #include <stdlib.h> |
The good thing is that for most of the c files, malloc.h needs to be changed to stdlib.h, which is straightfoward, this is similar to demo-*.sh where wget
and gzip
needs to be replaced by curl
and unzip
. The more complex issue is the distance.c, one issue is about handling bestw
char matrix, one more issue is about initialization of bestd[a]
.
I cannot believe the researcher I’m trying to learn from encounters such a tragic story.
Rest In Peace Xin Rong, you will be forever remembered.
- Mikolov, Tomas, Kai Chen, Greg S. Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv preprint arXiv:1301.3781 (2013).📁🗒[↩]
- Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
- Rong, Xin. “word2vec parameter learning explained.” arXiv preprint arXiv:1411.2738 (2014).[↩]
- Rong, Xin. “word2vec parameter learning explained.” arXiv preprint arXiv:1411.2738 (2014).[↩]
- Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
- Nikkarinen, Irene, Tiago Pimentel, Damián E. Blasi, and Ryan Cotterell. “Modeling the unigram distribution.” arXiv preprint arXiv:2106.02289 (2021).[↩]
- word2vec原理推导与代码分析[↩]
- Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
- Goldberg, Yoav, and Omer Levy. “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method.” arXiv preprint arXiv:1402.3722(2014).[↩]
Leave a Reply