Coding doc2vec


In two previous posts, we googled doc2vec [1] and “implemented” [2] a simple version of a doc2vec algorithm. You could say we gave the specifications for a doc2vec algorithm. But we did not actually write any code. In this post, we’ll code doc2vec, according to our specificication. Together, these three blog posts give some understanding of how doc2vec works, under the hood.

Understanding by building. We’ve made a Jupyter notebook [3], and we’ll walk you through it by highlighting some key parts. As a starting point for the notebook, we’ve used the word2vec implementation [4] from the Vincent Vanhoucke Udacity course. We’ve modified basically everything in that a bit, and added some new stuff. But the idea remains the same: a minimal code example for educational purposes.

What we’ll do

We’ll get and preprocess some documents from a well known Reuters corpus. We’ll train a small doc2vec network on these documents. While we train it, the document vectors for each document are changing. Every so many steps, we use the document vectors. First, we visualise them with T-SNE in a two-dimensional space. We color code the document vectors with class labels (the news categories from the Reuters corpus), to see if document vectors belong to the same class are getting closer to each other. Second, we train a simple linear classifier using the document vectors as input, to predict class labels. We’ll call this prediction task the end-to-end task. We’ll observe the following: 1. The doc2vec network it’s loss decreases as it trains. This is a good sign. 2. If we train doc2vec longer, the performance for the end-to-end task increases. This is another good sign. 3. The two-dimensional visualisation with color-coded document vectors is not very informative. This is too bad, but not a problem. In the end, it’s just a visualisation.


The first step is preprocessing. The following one liner reads in the entire Reuters data set in memory. If your data is very big, this is not advisable of course. But: early optimization is the root of all evil 🙂


fileid2words = {fileid:
        [normalize(word) for word in word_tokenize(
                reuters.raw(fileid)) if accept(word)] \
        for fileid in reuters.fileids() if accept_doc(fileid)}

In words: we iterate over Reuters files, and accept some of the documents. For each document, we accept some of the words. And the words we do accept, we normalise. Let’s take a closer look at each step.

def accept_doc(fileid):
    return fileid.startswith('training/') \
            and np.random.random() * 100 < PERCENTAGE_DOCS

We only accept documents from the training set of the Reuters corpus. And we select a random percentage of these documents according to the hyperparameter PERCENTAGE_DOCS that we have set by hand at the top of the notebook.


def accept(word):
    # Accept if not only Unicode non-word characters are present
    return re.sub(r'\W', '', word) != ''

We refuse words that consist entirely of non-word characters. The words that are refused here are taken out of the token stream before anything happens. You can play with this and refuse other words, too. For example, stopwords like ‘the’, ‘and’, etc. This may or may not be a good idea. One way of learning more about it is to play with it and see what happens to your performance in the end-to-end task.

def normalize(word):
    return word.lower()

And we lowercase all tokens. This is a first step to reduce data sparsity of natural language. There are other ideas, too. For example, replacing numbers with a special NUMBER token, or spelling out numbers with words, so that ‘123’ becomes ‘one two three’. There is always a tradeoff: normalising tokens may lead to some information loss. After we have obtained the dictionary fileid2words, we build our vocabulary:


count = [['__UNK__', 0], ['__NULL__', 0]]
count.extend([(word, count) for word, count in collections.Counter(
        [word for words in fileid2words.values() \
        for word in words]).most_common(
                VOCAB_SIZE - 2 + REMOVE_TOP_K_TERMS)[
                ] if count >= MIN_TERM_FREQ])

Here, first we flatten the dictionary, our entire dataset, to just a sequence of tokens (words). Then we count the occurence of all the unique words. We add these counts to the counts of two special tokens: __UNK__ and __NULL__. We use the most common words as our vocabulary. We’ll remove the top k most common terms, because these tend to be stopwords. And we require the word frequency to be higher than a certain minimum. That is because we can hardly expect our network to predict a term that would only occur, say, once in the whole corpus. Occurences for words that did not end up in our vocabulary will later on be replaced with the __UNK__ (unknown) token. So at this point no words will be taken out of the corpus anymore, they will only be replaced. One thing that you can try is if it works better to really remove stopwords from the corpus. Don’t worry about the __NULL__ token, it is only used when our documents are too short to even fit a single text window in (remember that in doc2vec, we try to predict words from fixed size text windows that occur in a document). That will not happen often in the Reuters corpus.

Training the network

In Tensorflow, training a network is done in two steps. First, you define the model. You can think of the model as a graph. Second, you run the model. We’ll take a look at the first step, how our model is defined. First: the input.


# Input data
dataset = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
labels = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1])

The dataset is defined as a placeholder with the shape of a simple array that contains ints. When we run the model, it will contain document ids. Nothing more, nothing less. It will contain as many document ids as we want to feed the stochastic gradient descent algorithm in a single iteration. The labels placeholder is also a vector. It will contain integers that represent words from the vocabulary. So, basically, for each document id we want to predict a word that occurs in it. In our implementation, we make sure that a batch contains one or more text windows. So if we use a text window of size 8, a batch will contain one or more sequences of eight consecutive words. Next, we take a look at the weights in our neural network:


# Weights
embeddings = tf.Variable(
        tf.random_uniform([len(doclens), EMBEDDING_SIZE],
                          -1.0, 1.0))
softmax_weights = tf.Variable(
                [vocab_size, EMBEDDING_SIZE],
                stddev=1.0 / np.sqrt(EMBEDDING_SIZE)))
softmax_biases = tf.Variable(tf.zeros([vocab_size]))

You can think of embeddings as the transpose of the matrix D from our previous post [2]. In its rows, it has a document vector of length EMBEDDING_SIZE for each document id. This document vector is also called an “input vector”. You can also think of embeddings as the weights between the input layer and the middle layer of our small doc2vec network. When we run the session, we will initialize the embeddings variable with random weights between -1.0 and 1.0. The softmax_weights are the weights between the middle layer and the output layer of our network. You can also think of them as the matrix U from our previous post. On its rows, it has an “output vector” of length EMBEDDING_SIZE for each word in the vocabulary. When we run the model in our session, we will initialize these weights with (truncated) normally distributed random variables with mean zero and a standard deviation that is inversely proportional to EMBEDDING_SIZE. Why are these variables initialized using a normal distribution, instead of with a uniform distribution like we used for the embeddings? The short answer is: because this way of initialisation has apparently worked well in the past. You can try different initialisation schemes yourself, and see what it does to your end-to-end performance. The long answer; well, perhaps that’s food for another blog post. The softmax_biases are initialised here with zeroes. In our previous post, we mentioned that softmax biases are often used, but omitted them in our final loss function. Here, we used them, because the word2vec implementation we based this notebook on used them. And the function we use for negative sampling wants them, too. The activation in the middle layer, or, alternatively, the estimated document vector for a document id is given by embed:


embed = tf.nn.embedding_lookup(embeddings, dataset)

tf.nn.embedding_lookup will provide us with fast lookup of a document vector for a given document id. Finally, we are ready to compute the loss function that we’ll minimise:


loss = tf.reduce_mean(
                softmax_weights, softmax_biases, embed,
                labels, NUM_SAMPLED, vocab_size))

Here, tf.nn.sampled_softmax_loss takes care of negative sampling for us. tf.reduce_mean will compute the average loss over all the training examples in our batch. As an aside, if you take a look at the complete source code in the notebook, you’ll notice that we also have a function test_loss. That function does not use negative sampling. It should not, because negative sampling underestimates the true loss of the network. It is only used because it is faster to compute than the real loss. When you run the notebook, you will see that the training losses it prints are always lower than the test losses. One other remark about the test loss is the following: the test examples are taken from the same set of documents as the training examples! This is because in our network, we have no input to represent a document that we have never seen before. The text windows that have to be predicted for a given document id are different though, in the test set.

A two-d visualisation of the document vectors with T-SNE

As we train our network, as we minimise our loss function, we keep making small changes to all the weights in the network. You can think of the weights between the embedding layer and the output layer as weights that affect where the decision boundaries of our output neurons are located. And you can think of the weights between our input layer and the embedding layer as weights that determine where our documents are projected in the embedding space. These weights help determine our document vectors. We train our network to predict words from text windows sampled from documents. So, intuitively, as we train longer, the documents change location in the embedding space in such a way that it becomes easier to predict the words that occur in these documents. And now here comes an important point. If we can predict which words occur in a document, then perhaps we can also say something about the topics in a document, or about the genre or category of the document. This is in fact an important motivation for using doc2vec as a step in a document classification algorithm. Okay. So, as we train to predict words, our document vectors keep changing location in the embedding space. If we want to use the document vectors as input for a document classifier, we hope that documents with the same class are located close to each other in the embedding space. If we want to make a visualisation that will show if this is happening, we need to use a dimensionality reduction algorithm, because we can’t visualise our embedding space if it has more than three dimensions. Because we are interested in the distance between document vectors, we use T-SNE. T-SNE is an algorithm that aims to project points close to each other that were also close to each other in the original high-dimensional space. And the other way around for points that are far apart in the high-dimensional space. And after we visualise the document vectors in 2D, we color code the points: purple points have the most common Reuters class as one of the class labels. In our experiments this was usually the class ‘earn’. The rest of the points does not have this label. Sadly, in experiments we did not see a clear purple cluster yet.
However, it is hard to measure the quality of the document vectors by visualising them like this. To do that more quantitavely, we need to perform the end-to-end task that want to used the document vectors for.

Predicting the most common class label from document vectors

The end-to-end task that we’ll address in this post is to classify for each document whether or not it has the class label for the most common Reuters class (‘earn’, in most of the Reuters samples in my experiments). This is a binary classification task. As input we’ll use the document vectors that are learned by our doc2vec network. As algorithm we’ll use a linear classifier, because the inventors of doc2vec specifically mention that paragraph vectors are suitable as input for any kind of linear classification algorithm. It makes sense, too: doc2vec itself has only one classification layer on top of the embedding layer. Each neuron in its output layer has an activation that is just a weighted sum, just a linear function, of the document vector in the embedding layer. And then all that happens after that is the softmax function. Predicting the class label of a document is easier than predicting the words in a document. So our first try will just be to use a simple linear classifier for document classification as well. Just to show that the classification algorithm we use does not have to be a neural network, we’ll use a support vector machine with a linear kernel. As an evaluation metric, we calculate precision and recall for both classes; then we calculate the harmonic mean of these two for each class: this is the F1 score. Then we take the weighted average of the F1 scores of the two classes: weighted by the number of observations of both classes. We report scores on a random subset of the Reuters training set. We do not report scores on the Reuters test set yet, because we are only in an exploratory phase yet here. In general, you should use test data sparingly, in this post, we’ll not use it at all! Every so many steps of training, we pause training of the network, and we use the current document vectors to train a classifier. In one of hour experiments, where we trained doc2vec on all documents in the Reuters training set, we got the following F1 results:

Number of SGD iterations (steps) average F1 PV-DBOW training loss PV-DBOW test loss
30K 0.53 2.6 4.6
60K 0.57 2.4 4.3
90K 0.60 2.3 4.2
100K 0.61 2.2 4.2

What you can see here in the third and fourth column is that the training loss and test loss of our doc2vec network slowly decreases as we are training. We are slowly getting better at predicting the words from the document ids. But what is more important is that you can see in the second column that the performance on the end-to-end task (document classification) is also increasing as we are training the network longer. That means we are on the right track, even if the absolute performance is not yet that impressive. Now that we have walked you through the main steps of our Jupyter notebook [3], go ahead and run it yourself! All you need is a working installation of Python 3, numpy, pandas, sklearn, Tensorflow and Jupyter Notebook. pandas is not essential, we only use it to output some descriptive statistics quickly. You can implement alternatives for all of the many choices we had along the way. You can try different values for the configuration variables at the top of the notebook. It should be quite possible to improve the performance that we got in our first experiments: good luck!. In the beginning, do set `PERCENTAGE_DOCS` to a lower value, e.g., something like 5 percent. 100K steps of training on all training set documents in the Reuters dataset took about an hour on a single desktop computer with 8 cores. You don’t want to wait that long just to try out if everything works.


3. 4.