Implementing doc2vec


In a previous post [1], we’ve taken a look at what doc2vec is, who invented it, what it does, what implementations exist, and what has been written about it. In this post, we will work out the details of a version of the simplest doc2vec algorithm: PV-DBOW. Our approach may not be exactly the same as in the original paper [7]. Our aim is a starting point: minimal, easy to understand, and still interesting enough to be expected to work. In a follow-up post, we may develop a Tensorflow implementation. This post should be detailed enough for you to do it yourself, perhaps you’ll beat us to it 🙂 For your inspiration, also refer to the doc2vec Tensorflow code here [18]; that code implements PV-DM, the other doc2vec algorithm.

If you have already watched the videos from the Udacity course on deep learning by Vincent van Hoecken [2] or you already have done some other neural network tutorials, I guess this post should be quite easy to follow. But if this post series is one of the first things you read on neural nets, perhaps you will still be able to follow along, looking up some things along the way yourself.

Before we dive in, I would like to recall the overall strategy from the previous post [1]. Recall that we use doc2vec because we want vector representations of documents, paragraphs or sentences. These vectors should have a fixed, not too large size. We will use these vectors for some task, for example document classification. Doc2vec has the following strategy:

1. We’re training a small network on a task that is somehow related to the end-to-end task.
2. In the middle of this small network we have a small layer.
3. After training, we remove the output layer of our small network (picture literally breaking the network apart).
4. The middle layer is now the output layer of our network, and it produces document vectors now.

It is a quite common pattern to pre-train a network and then use part of that network for some other purpose. Doc2vec is just one example. By working it out in detail, you may get some intuition about why it works. And this way of thinking you can then apply to other tasks, and build similar networks yourself. And, on the side, you may get something out of the details themselves.

The distributed bag of words model (PV-DBOW)

Recall that doc2vec algorithms were originally named paragraph vector algorithms. The authors of the original paper [7] start with a variant of paragraph vectors they call the distributed memory model (PV-DM). Later, they introduce another variant, the distributed bag of words model (PV-DBOW). For this blog post, we’ll start with the latter one, because it is the simpler model of the two. Starting with the simpler model will make things easier to learn. It is also easier to use. It has less parameters that need to be trained. As a consequence, it needs less data to train on. The network we’re training in PV-DBOW looks like this:


It looks a bit different than Figure 3 in the paper [7]. Here, we have drawn each and every unit of the network. Also, we have only room for one word in the output layer on the right. In the paper, Figure 3 has multiple words in the output layer. Our version looks a bit simpler. What we have here is a neural network with three layers. Let’s go over all of them.

The input layer, on the left, hasN units. N is the number of documents in the notation used in [7]. During training, we will present a document in the input layer. The document will be encoded as a one-hot vector. Each unit in the input layer corresponds to a single document-id in your corpus. When we present a document, the unit corresponding to its id will be set to 1, and all other units will be set to 0. Together, the units form the document vector \vec{d}. Surprising, isn’t it? The only information we use as input is the document id.

The middle layer has size pp is a free parameter that is chosen by you. It is the size of the paragraph vectors that doc2vec will output for you. The big fat arrow between the input layer and the hidden layer means that the two layers are fully connected. Each unit in the input layer is connected to each unit in the middle layer. Each connection has a certain weight. We should initialize these weights randomly, with small values, prior to training the network.

If you know a bit about neural networks, you might expect that at this point I am going to say something about a threshold in the middle layer, or about an activation function. But doc2vec, like word2vec, uses a very simple middle layer. One reason for this is that it speeds up computation quite a bit. The activation in that middle layer is just a weighted sum of the activation in the input layer. A compact way to represent this is:

 \hat{\vec{d}} = D\vec{d}

Here, D is an p\times N matrix that contains all the weights. \hat{\vec{d}} is the activation in the middle layer. It is the so-called paragraph or document vector produced by the network for \vec{d}. That is why we denote it here with \hat{\vec{d}}: we view it as an estimation, a representation of \vec{d} in a lower, p-dimensional space. It is also called an “input vector” [7]. Because \vec{d} is a one-hot vector, we have that the n‘th column in D contains \hat{\vec{d_n}}: the paragraph vector for the document with id n. A layer like the middle layer in our network here is often called an embedding layer.

The output layer has M units. In [7], M is used to denote the number of unique words in the vocabulary. The output layer activity in each of its units corresponds to a single word in the vocabulary. The output layer can be seen as a crude estimation of a term vector, I’ve called it \vec{l} in the picture: l for “logits”: a common name for activations in the output layer of a network. As you can see, the output layer is fully connected to the hidden layer. We can represent this with a matrix multiplication again:

 \vec{l} = U\hat{\vec{d}}

Here, U is an M\times p matrix that contains all the weights between the middle and the output layer. It is quite common to also have biases in the output layer, in which case we would write l = U\hat{\vec{d}} + \vec{b}. In expositions of the related word2vec algorithm, biases are sometimes included [5, 7] and sometimes excluded [16]. We omit them here. Note that U has a row for every word in our vocabulary. These rows are sometimes also called “output vectors” for the corresponding words. Let’s refer to these vectors here as \vec{u_m}, where m is the index of the corresponding word in the output layer. Then, we have

 l_m = \vec{u_m}\cdot\hat{\vec{d}}\mathrm{.}

Back to our overall strategy. After training, D contains the document vectors that we will take out and re-use as part of another application. Another way to think about this, is that after training, we will remove the last layer of the network. That would make the middle layer the output layer of the network. And what will it give us if we feed it a document vector? A low-dimensional document vector of a fixed size.

Next, we work on what our network will learn. Or, equivalently, what our objective function is. The function we want to minimise.

The softmax function and a loss function

In [7], we read that we will force the model to predict words randomly sampled from the paragraph. In our model above, we use documents as paragraphs. So what happens is that we give the network a document id, and we ask it to predict a randomly sampled word from the corresponding document. Amazing, isn’t it? Can we hope to achieve this? Well, remember that we can give the network a huge quantity of unique examples: as many as we have words in our corpus! And, in practice, during training, we’ll feed it our entire corpus many times.

Two connected functions we are going to add to our model in this section are the softmax function and a loss function:


The first three layers here are the same as before. The output layer activations are now passed into the softmax function s:

 s(l_i) = \frac{e^{l_i}}{\sum_k{e^{l_k}}}

Some properties of this function, that you can verify yourself:

1. 0 < s(l_i) < 1
2. \sum_i s(l_i) = 1
3. It boosts the activation of the units with the highest activation.
4. If the activation in the logits is higher, s(l_i) values will get closer to one and zero: the network will be more “sure” of its predictions.

The softmax is used for multi-class classification tasks. In multi-class classification tasks each observation is labeled with one of several classes. The task that we set our network, however, is a multi-class multi-label classification [4]: an observation can be labeled with one or more of several classes. Our network is asked to predict the words (multiple labels) in the document from the document id. How do we fix this? Well, as a training example, we’ll feed a document (observation) to our network, each time with a single target word (label). Each time the document is presented as a training example, possibly a different target word is used as label. In this, we follow the same pattern as in the word2vec algorithm Skip-Gram [5, 7]. This algorithm is what inspired PV-DBOW [7].

The result of the softmax function is the fourth layer in the above picture, which I’ve called \hat{\vec{t}}. I’ve called it \hat{\vec{t}} because it is meant to be an approximation of the actual term to be predicted. Because of the first two properties of the softmax function, we can now more or less legally think of the output of our network as an estimated probability distribution over the classes for an input document.

To train our network, we need to measure how well these probabilities were estimated for sampled target terms. For this, we compare \hat{\vec{t}} to the target term vector. That vector is all the way in the right in the picture above, it is called \vec{t}. This is a one-hot vector. We compare \vec{t} and \hat{\vec{t}} using a loss function. The loss function should be low when \hat{\vec{t}} and \vec{t} are similar. A common choice [7, 8, 9, 10] based on cross entropy is:

 E(\hat{\vec{t}}, \vec{t}) = -\sum_i t_i \log{\hat{t}_i}

Here are some properties of E that you can verify for yourself:

1. E(\hat{\vec{t}}, \vec{t}) > 0
2. If t_i is zero, the i‘th element of the sum will be zero, regardless of the value \hat{t}_i}, our prediction.
3. E is low when our prediction, \hat{t}_i, is close to 1.0 for the element of our target vector t_i that is one.

The second fact puzzled me for a bit. Does it not matter at all what a network predicts for units that are zero in the target vector? What if my network predicts a high value for all output neurons? Well, the softmax function prevents this. If all logits are equally high, the normalisation will make them all low. And the network will then incur loss on the unit that is one in the target vector. So, the combination of the softmax and this cross entropy loss is important.

Because \vec{t} is a one-hot vector, we can simplify E a bit further. Let y be the index where t_y = 1. Then

 E(\hat{\vec{t}}, \vec{t}) = -\log{\hat{t}_y}

Training our neural network involves minimising a loss function on a set of training points. For that, we need a loss function that compares a set of training predictions to a set of corresponding correct target vectors. Let’s denote our set of training target vectors as T, and let \vec{t_j} denote the j‘th target vector. As a loss function, we’ll use

 L = \frac{1}{|T|} \sum_{j=1}^{|T|}E(\hat{\vec{t_j}}, \vec{t_j})\mathrm{,}

we are simply averaging E over a set of training examples here.

Combining all our equations up to this point, it is not hard to write out L in full, you can try it yourself:

 L = \frac{1}{|T|} \sum_{j=1}^{|T|}{-\vec{u_{y_j}}\cdot\hat{\vec{d_j}} + \log{\sum_{m=1}^M{ e^{\vec{u_m}\cdot\hat{\vec{d_j}}}}}

Here, y_j is the index of the correct word for training example j in the target vector \vec{t_j}. So, \vec{u_{y_j}} is just the output vector corresponding to that word.

Now there is only one thing to add to our loss function: regularisation. This is usually an important part in trying to prevent a network from overfitting. Still, in the doc2vec and word2vec papers [7, 11, 12] it is not mentioned; we may experiment with omitting it as well. An interesting aspect of doc2vec is that we are not really interested in good prediction on a test set of our network. Rather, we are interested that the paragraph vectors we get in D in the end will be useful for some other purpose. In any case, a common regularisation approach is L2 regularisation:

 L' = L + \lambda \frac{1}{2}||\vec{w}||_2^2

Here, ||\vec{w}||_2^2 is the L2 norm (length) of \vec{w}, squared. \vec{w} here is one big vector with all the network weights in it. In our network, that is all the elements of the matrices U and D\lambda is a parameter that we have to set in advance, it is not learned by stochastic gradient descent. Such a parameter is often called a hyperparameter.

One way to think about this regularisation is that we are rewarding weight configurations with small weights on average. And now if you think about the fourth property of the softmax function as discussed above, then you can translate this as rewarding weight configurations where the network is less sure of itself.

Now that we have defined our loss function, we have the bull by the horns. It constitutes the full definition of the optimisation problem we are solving, and the full definition of our network. In the next section we very briefly sketch how we minimise L.

Stochastic gradient descent

It would require another series of blog posts to fully explain the way neural networks are trained with backpropagation and gradient descent, but fortunately it is not hard to find a lot of material on that. It is important to understand the details [11], but we give only a very brief outline here, for completeness.

For a given set of input vectors \vec{d_j}, everything in L is constant save the weights in D and U. Our goal is to find a configuration for the weights such that L is minimised. This can be analytically solved actually for the complete training set. But for a big training set that would be computationally inefficient. Instead the idea of stochastic gradient descent is to sample some training samples randomly, so \{\vec{d_1}, \ldots, \vec{d_j}, \ldots\}is then a relatively small set of training examples. Such a set of examples is often called a batch. For this batch, again, L can be thought of as a function of the weights in D and U. These weights span a weight space. During training, we compute the gradient of L in the current point in the weight space. This gradient is the direction in which L increases most sharply. What we want to do is move in the opposite direction, but only slowly, using a learning rate that we will have to determine in advance (including a possible decay function and / or decay rate for this learning rate). The SGD iteration results in a set of small adjustments to all the weights: the weight delta’s. This process of finding the weight delta’s for all the weights in the network is often called backpropagation.

Stochastic gradient descent (SGD) is a very successful algorithm and it is the driving force behind the success of neural networks. Now that we have sketched SGD, it is time to define the final nuts and bolts of our first, simple doc2vec implementation. We start with how we will sample training examples.

Generating training batches

A training example is a tuple (docid, word) in our network. When we sample such tuples, there are a couple of choices to make. The first and most important is what to do with the notion of a text window. Then, we have to deal with the extremes in term frequencies: in natural language, there are many extremely rare words, as well as some extremely frequent words. And last but not least, we have to do something about the computational complexity of our loss function, which contains a summation over our entire vocabulary.

Using a text window

From the description of PV-DBOW [7], it is not easy to determine what is done: “(…) at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector.”. The simplest possible interpretation of this is to just randomly sample words. Let’s call this option A. If we go with option A, then we would end up with a model that can predict the words for a given document. Given the document, this would give us a global idea of its contents. This makes option A related to methods like LSA, LDA, and ALS [19].

The description just given suggests more. If we just randomly sample words, why mention a text window at all? The authors mention that PV-DBOW is similar to the word2vec Skip-Gram model [7]. A PV-DBOW approach inspired by the Skip-Gram model is to sample a text window, and then offer all the words in this text window in the same batch together, i.e., in the same iteration of stochastic gradient descent. Let’s call this option B. We could rewrite our loss function a little bit to make it more explicit that we want to correctly predict sets of words that are close together. Then, we could postulate a Naive Bayes assumption: given a document, and a text window, words in the window occur independently of each other. Then we could simplify our loss function and show that option B implements minimising that simplified loss function. See [16] for a similar derivation for Skip-Gram. If we follow option B, we would end up with a model that can predict sets of words that are close together in documents. Words that are close together often have some syntactic and semantic relations to each other. Option B would capture some of that. As do the word2vec algorithms CBOW and Skip-Gram. Since this is a major strength of these algorithms, our first bet would be to use option B for PV-DBOW.

Dealing with term frequencies

Some terms are very infrequent: a common approach is to ignore these terms and focus efforts only on the most frequent, e.g., 100K, 500K, or 1M words. Remaining terms could be mapped to a catch-all token ‘UNKNOWN’ if we want to keep their position in the text [5]. We will do this initially, too.

Other terms are very frequent, words like ‘the’, ‘a’, ‘of’, etc., and do not convey much meaning at all (although they have syntactic functions). In [12], for the Skip-Gram model, training words are sampled with a lower probability if they have a very high frequency, according to a slightly odd heuristic formula. How to combine this with text windows is not immediately obvious. Do we discard each word in our corpus with a probability based on term frequency, and then act as if it was never there, before we even sample text windows? In our implementation, initially, we will not make any adjustments for high frequency words at first; no adjustments are mentioned in [16] either. And if things don’t work, perhaps our first try will be a very crude but deterministic and often used heuristic: remove the top K terms from the corpus before doing anything.

Negative sampling

Finally, we will have to take some steps to make training our network computationally feasible. To see why, consider how big our vocabulary could typically be, and then think of computing the softmax function over and over again during training. We have a choice of methods to fix this, namely hierchical softmax [12] or negative sampling [12, 16]. We’ll opt for negative sampling. The basic idea in terms of our network structure is that in our SGD iteration we will work with a very small output layer. It will contain the correct terms, and only a handful of randomly sampled incorrect terms [2]. This translates to some changes to our loss function [12]. A good exposition of these changes for the word2vec Skip-Gram model is given in [16]. A similar line of reasoning can be followed for PV-DBOW.


We now have a detailed recipe for a PV-DBOW implementation. The level of detail is such that we should now be able to implement it in Tensorflow. Tensorflow has primitives for the first two layers (the embeddings), the output layer (just a matrix multiplication), the softmax-loss-function combo, and even negative sampling. What remains for us to do is to supply the training batches, to tie it all together, and to choose sensible values for the hyperparameters. We may follow-up on this post with an example PV-DBOW implementation, and otherwise, we hope you now feel confident enough to implement it yourself!


7. Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
10. Pattern Recognition and Machine Learning. Bishop, 2006, p 209
11. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
12. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.