Stack Overflow, Logstash, Elastic Graph

-

Lately everyone interested in what the Elastic guys are doing, has heard about their new product called Graph. Graph is going to be part of their commercial solution called x-pack. Time to play around with it.

In this blog post I am going to use the latest released versions of logstash (2.3), elasticsearch (2.3) and the brand new Graph plugin for Kibana (4.5). I am going to try it all out with a dataset exported from Stack overflow. They provide an interesting database query interface. Using this interface we export data about answers given to elasticsearch questions. Try to find users that give comments and if there is a correlation between the people asking questions and people giving answers. Also we can find the biggest competitors if you are into giving answers and scoring points.

Exporting data from Stackoverflow

Stack overflow comes with a web interface to create sql queries to query all their data. It is not to hard to find what you need if you get a bit of understanding of the model. You can find the API here: data.stackexchange.com. Stackexchange contains a lot of sites. Stack overflow is the one I used most.

First I needed to find the id of the tag for elasticsearch. I am not going to explain the complete data model. But the names of tags are in the table: Tags. Since each Posts (answers and questions are in the same table) can have multiple tags there is a table connecting Posts to Tags called the PostTags. The following query shows how to obtain the key for the tag elasticsearch.

SELECT
    *
FROM Tags
WHERE
    TagName LIKE 'elasticsearch'

There response to this query is one line with the number 64388. We are going to reuse this number in the other queries to find only answers to elasticsearch questions. Move on to the query to get the data we need. If you want to run the query yourself, you can find it here:

Stack overflow query to find all elasticsearch answers

Below the query

SELECT
    p.Id,
    p.ParentId,
    p.CreationDate,
    p.Score,
    p.CommentCount,
    u.DisplayName AS Commentor,
    up.DisplayName As Questor,
    pp.Tags
FROM Posts AS p
    INNER JOIN Posts AS pp ON p.ParentId = pp.Id
    INNER JOIN PostTags AS pt ON pt.PostId = p.ParentId
    INNER JOIN Users AS u ON p.OwnerUserId = u.Id
    INNER JOIN Users AS up ON pp.OwnerUserId = up.id
WHERE
    pt.TagId = 64388
    AND p.PostTypeId = 2

A new table called Users is introduced, as you can see we create a double join on the table. One to find the actual name of the person who asked the question Questor and one for the person who wrote the answer Commentor. Another thing which can be handy to understand is the condition for Posts with PostTypeId equals to 2. These are the answers. The question the answer was given to, can be found using the ParentId. That is why we need to join the Posts table with itself. The result is in the following format.

"23039905","23012906","2014-04-13 06:28:52","0","2","Jettro Coenradie","Keyur Patel",""

In the next section more details on how to use logstash to import this file into elasticsearch

Importing CSV data using logstash

Logstash still is a very easy way to import csv files into elasticsearch. Using the file input, some filters like csv and grok and of course the very easy to use elasticsearch output. For those unfamiliar with logstash I’ll demonstrate the configuration. Not going into installation details since logstash has that covered well enough on their website.

The configuration consists of three elements: Input, Filter and Output. The input is straightforward, path to the file and till logstash to start loading it from the beginning and not wait for new lines, which is the default behaviour.

input {
  file {
    path => "/absolute/path/to/QueryResults.csv"
    start_position => beginning
  }
}

Next part is more complicated, the filter part. I currently use 5 filters. The csv filter configures the names of the columns and converts the CreationDate into a date field. The next filters handle the Tags field which has a format of . The grok filter strips the first < and the last >. The mutate filter makes the remainder into multiple tags using the >< as the splitter. The second mutate filter renames the field back to Tags and the file mutate filter removes a few fields we do not need.

filter {
    csv {
        columns => ["Id","ParentId","CreationDate","Score","CommentCount","Commentor","Questor","Tags"]
        convert => { "CreationDate" => "date" }
    }
    grok {
        match => { "Tags" => "<%{GREEDYDATA:tags_str}>"}
        remove_field => ["Tags"]
    }
    mutate {
        split => { "tags_str" => "><"}
    }
    mutate {
        rename => { "tags_str" => "Tags" }
    }
    mutate {
        remove_field => ["@timestamp", "@version", "host", "message", "path"]
    }
}

The final bit is sending the data to elasticsearch using the output element. In this case I do not want logstash to manage the template, I insert the mapping myself. What is interesting is the way to use the Id coming from the database as the document id in elasticsearch as well. Logstash uses the http connector by default.

output {
    elasticsearch {
        index => "stackoverflow-answers"
        document_id => "%{Id}"
        document_type => "answer"
        manage_template => "false"
    }
    stdout {
        codec => rubydebug
    }
}

Before inserting the data we need to get the mapping right. Check the next section for the mapping

You need a bit of a mapping

I use sense, the Kibana plugin, to insert the mapping into elasticsearch. This is convenient and easy to make changes and try it again. Below the PUT request to send the mapping. Take not of the date mapping

PUT stackoverflow-answers
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1
  },
  "mappings": {
    "question": {
      "properties": {
        "CreationDate": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        },
        "CommentCount": {
          "type": "integer"
        },
        "Questor": {
          "type": "string",
          "index": "not_analyzed"
        },
        "Commentor": {
          "type": "string",
          "index": "not_analyzed"
        },
        "Tags": {
          "type": "string",
          "index": "not_analyzed"
        },
        "Id": {
          "type": "integer"
        },
        "ParentId": {
          "type": "integer"
        },
        "Score": {
          "type": "integer"
        }
      }
    }
  }
}

Time to run logstash and insert all the nice records. At the time of writing there were more 16k answers to questions tagged with elasticsearch.

The graph plugin in the latest Kibana

Finally, now we have some data to display in a graph. Again, not going into detail of installing the plugin, just check the website and follow the few basic steps.

In the default mode, you would need a multi-value field to find interesting graphs, but that is not what I want. I want to compare some fields. In the example I am going to find competitors for giving the right answer to questions with tags I give answers to that are at least elasticsearch related.

In the first screen we are going to select the advanced mode. That way we can select multiple groups, I select tags and Commenters and I search for jettro. In the first screen we see a number of tags I am involved with and I see a number of names that answers to these tags as well. I do mis a number of users. One of them being javanna. He is one of the most active persons on the elasticsearch list, so he must be a big competitor. Why isn’t here there? That is because the graph tool uses the least common terms functionality. So the most obvious matches are not returned. You can switch off this behaviour. This is shown in the next image.

Show the button to call the advanced mode

Choose the two fields using the fields dropdown button

Select the settings wheel and configure the significant terms

Show the graph now not without the significant terms feature

Concluding

The Graph plugin looks really cool, the hard part is finding good data to make use of it. So if you have the data, it is not that hard to at least start playing with it. Below a short video I made demonstrating some of the things you can do. More to come.