Machine learning – an example
In my previous blog post, I tried to give some intuition on what neural networks do. I explained that when given the right features, the neural network can generalize and identify regions of the same class in the feature space. The feature space consisted of only 2 dimensions so that it could be easily visualized. In this post, I want to look into a more practical problem of text classification. Specifically, I will use the Reuters 21578 news article dataset. I will describe a classification algorithm for this dataset that will utilize a novel feature extraction algorithm for text called doc2vec.
I will also make the point that because we use machine learning, which means the machine will do most of the work, the same algorithms can be used on any kind of text data and not just news articles. The algorithm will not contain any business logic that is specific to news articles. Especially the neural network is a very reusable part. In machine learning theorem a neural net is known as a universal approximator. That means that it can be used to approximate many interesting functions. In practical terms, it means you can use the same neural network architecture for image data, text data, audio data and much more. So trying to understand one application of a neural network can help you understand much more machine learning applications.
Training the Doc2vec model
In the previous post, I explained how important it is to select the right features. Doc2vec is an algorithm that extracts features from text documents. As the name implies it converts documents to vectors. How exactly it does that is beyond the scope of this blog (do see the paper at: https://arxiv.org/pdf/1405.4053v2.pdf) but its interface is pretty simple. Below is the python code to create vectors from a collection of documents:
# Load the reuters news articles and convert them to TaggedDocuments
taggedDocuments = [TaggedDocument(words=word_tokenize(reuters.raw(fileId)), tags=[i]) for i, fileId in enumerate(reuters.fileids())]
# Create and train the doc2vec model
doc2vec = Doc2Vec(size=doc2vec_dimensions, min_count=2, iter=10, workers=12)
# Build the word2vec model from the corpus
# Build the doc2vec model from the corpus
(for the complete script see: reuters-doc2vec-train.py)
To get some intuition on what doc2vec does let’s convert some documents to vectors and look at their properties. The following code will convert documents from the topic jobs and documents from the topic trade to document vectors. With the help of dimensionality reduction tools (PCA and TSNE) we can reduce these high dimensional vectors to 2 dimensions. See scripts/doc2vec-news-article-plot.py for the code. These tools work in such a way that coordinates in the high dimensional space that are far apart are also far apart in the 2-dimensional space and vice versa for coordinates that are near each other.
(see the source code at: doc2vec-news-article-plot.py)
What you see here are the document classes, red for the “job” topic documents and blue for the “trade” topic documents. You can easily see that there are definitely regions with more red than blue dots. By doing this we can get some intuition that the features we selected can be used to make a distinction between these 2 classes. Keep in mind that the classifier can use the high dimensional features which probably show a better distinction than this 2-dimensional plot.
Another thing we can do is calculate the similarity between 2 doc vectors (see the similarity function of doc2vec for that: gensim.models.doc2vec.DocvecsArray#similarity). If I pick 50 job vectors their average similarity to each other is 0.16. The average similarity between 50 trade vectors is 0.13 If we now look at what the average similarity between 50 job vectors and 50 trade vectors we get a lower number: 0.02. We see that the trade vectors are farther apart from the job vectors than that they are from each other. We get some more intuition that our vectors contain information about the content of the news article.
There is also a function that given some example vectors finds the top n similar documents, see gensim.models.doc2vec.DocvecsArray#most_similar. This can also be useful to see if your trained doc2vec model can distinguish between classes. Given a news article, we expect to find more news articles of the same topic nearby.
Training the classifier
Now that we have a trained doc2vec model that can create a document vector given some text we can use that vector to train a neural network in recognizing the class of a vector.
Important to understand is that the doc2vec algorithm is an unsupervised algorithm. During training, we didn’t give it any information about the topic of the news article. We just gave it the raw text of the news article. The models we create during the training phase will be stored and will later be used in the prediction phase. Schematically our algorithm looks like this (for the training phase):
For the classifier, we will use a neural network that will train on all the articles in the training set (the reuters dataset is split up in a training and test set, the test set will later be used to validate the accuracy of the classifier). The code for the classifier looks like this:
model = Sequential()
model.add(Dense(input_dim=doc2vec_dimensions, output_dim=500, activation='relu'))
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
(for the complete script see: reuters-classifier-train.py)
This code will build a neural network with 3 hidden layers. For each topic in the Reuters dataset, there will be an output neuron returning a value between 0 and 1 for the probability that the news article is about the topic. Keep in mind that a news article can have several topics. So the classifier can indicate the probability of more than 1 topic at once. This is achieved by using the binary_crossentropy loss function. Schematically the neural network will look like this:
Given a doc vector, the neural network will give a prediction between 0 and 1 for each topic. After the training phase both the model of the doc2vec algorithm and for the neural network will be stored so that they later can be used for the prediction phase.
When using a neural network it’s important not to have too few dimensions or too many. If the number of dimensions is too low the coordinates in the feature space will end up too close to each other which makes it hard to distinguish them from each other. Too many dimensions will cause the feature space to be too large, the neural network will have problems to relate data points of the same class. Doc2vec is a great tool that will create vectors that are not that large, the default being 300 dimensions. In the Doc2vec paper, it’s mentioned that this is the main advantage over techniques that create a dimension for every unique word in the text. This will create in the 10s or 100s thousand dimensions.
For the prediction phase, we load the trained doc2vec model and the trained classifier model.
When we feed the algorithm the text of a news article the doc2vec algorithm will convert it to a doc vector and based on that the classifier will predict a topic. During the training phase, I withheld a small set of news articles from the training of the classifier. We can use that set to evaluate the accuracy of the predictions by comparing the predicted topic with the actual topic. Here are some predicted topics next to their actual topics:
title: AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS HIT
predicted: [‘ship’] – actual: [‘ship’]
title: INDONESIAN COMMODITY EXCHANGE MAY EXPAND
predicted: [‘coffee’] – actual: [‘coffee’, ‘lumber’, ‘palm-oil’, ‘rubber’, ‘veg-oil’]
title: SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE
predicted: [‘grain’, ‘wheat’] – actual: [‘grain’, ‘wheat’]
title: WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRALIA
predicted:  – actual: [‘gold’]
title: SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERGER
predicted: [‘earn’] – actual: [‘acq’]
title: SUBROTO SAYS INDONESIA SUPPORTS TIN PACT EXTENSION
predicted: [‘acq’] – actual: [‘tin’]
title: BUNDESBANK ALLOCATES 6.1 BILLION MARKS IN TENDER
predicted: [‘interest’, ‘money-fx’] – actual: [‘interest’, ‘money-fx’]
Hopefully, by now I’ve given some intuition on what machine learning is. First, your data needs to be converted to meaningful feature vectors with just the right amount of dimensions. You can verify the contents of your feature vectors by:
- Reducing them to 2 dimensions and plot them on a graph to see if similar things end up near each other
- Given a datapoint, find the closest other data points and see if they are similar
You need to divide your dataset in a training and a test set. Then you can train your classifier on the training set and verify it against the test set.
While this is a process that takes a while to understand and getting used to, the very interesting thing is that this algorithm can be used for a lot of different use cases. This algorithm describes classifying news articles. I’ve found that using exactly the same algorithm on other kinds of predictions, sentiment prediction for example, works exactly the same. It’s just a matter of swapping out the topics with positive or negative sentiment.
I’ve used the algorithm on other kinds of text documents: user reviews, product descriptions and medical data. The interesting thing is that the code changes required to apply these algorithms on other domains are minimal, you can use exactly the same algorithms. This is because the machine itself learns the business logic. Because of that the code doesn’t need to change. Understanding the described algorithm is not just learning a way to predict the topic of a news article. Basically, it can predict anything based on text data as long as you have examples. For me as software engineer, this is quite surprising. Usually, code is specific to an application and cannot be reused in another application. With machine learning, I can make software that can be applied in multiple very different domains. Very cool!
For more practical tips on machine learning see the paper “A Few Useful Things to Know about Machine Learning” at: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf