Using the new elasticsearch 5 percolator

-

In the upcoming version 5 of elasticsearch the implementation for the percolator has changed a lot. They moved the percolator from being a separate endpoint and API to being a member of the search API. In the new version you can execute a percolator query. Big advantage is that you can now use everything you want in the query that you could already in all other queries. In this blogpost I am going to show how to use the new percolator by building a very basic news notification service

What is the percolator?

For those of you who have never heard about the percolator, let me do a short introduction. The percolator is sort of an inverse search. Usually you store documents and execute a query to match documents. With the percolator we store the queries and take a document as input. Based on that document we look for the queries that the document would match to. The document can be an existing indexed document, or you can provide the document with the query.

What are we going to do?

In our case I want to create a notification service for interesting news items. So I give the users of my website an option to create a query based on a search term that is being searched for in the title field. I also give them the option to filter on only specific categories.

Setting it up

First we create an index that we use to store news items. And of course we also need to news items. The next code block shows the commands in curl format so you can try it out yourself as well. Personally I prefer the console in Kibana, with code completion and other interesting stuff, but curl will do fine for now.


curl -XPUT "http://localhost:9200/news" -d'
{
  "mappings": {
    "item": {
      "properties": {
        "title": {"type": "text"},
        "body": {"type": "text"},
        "category": {"type": "keyword"},
        "tags": {"type": "keyword"}
      }
    }
  }
}'

curl -XPUT "http://localhost:9200/news/item/1" -d'
{
  "title": "Early snow this year",
  "body": "After a year with hardly any snow, this is going to be a serious winter",
  "category": "weather"
}'

curl -XPUT "http://localhost:9200/news/item/2" -d'
{
  "title": "Snow on the ground, sun in the sky",
  "body": "I am waiting for the day where kids can skate on the water and the dog can play in the snow while we are sitting in the sun.",
  "category": "weather"
}'

Now we have an index with two documents. Notice that we used the new elasticsearch 5 field types: text and keyword. These have replaced the string type. Time to configure the percolator index. The mapping for this index contains two parts. The first part, in our case doctype, specifies the fields that we can use in the query. In our case we provide access to the title and category fields. The second part is configuring the field and type containing the stored query. In our case we call this type the notification and the field is query with the type percolator.

curl -XPUT "http://localhost:9200/news-notify" -d'
{
  "mappings": {
    "doctype": {
      "properties": {
        "title": {
          "type": "text"
        },
        "category": {
          "type": "keyword"
        }
      }
    },
    "notification": {
      "properties": {
        "query": {
          "type": "percolator"
        }
      }
    }
  }
}'

Now we can add the documents containing the queries to store. Notice that we also metadata fields for the user that we should notify if the query matches a document and the date when this percolator query was inserted.

curl -XPUT "http://localhost:9200/news-notify/notification/1" -d'
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "snow"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "category": {
              "value": "weather"
            }
          }
        }
      ]
    }
  },
  "meta": {
    "username": "sander",
    "create_date": "2016-10-13T14:23:00"
  }
}'

curl -XPUT "http://localhost:9200/news-notify/notification/2" -d'
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "sun"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "category": {
              "value": "weather"
            }
          }
        }
      ]
    }
  },
  "meta": {
    "username": "jettro",
    "create_date": "2016-10-13T14:21:45"
  }
}'

Execution time

Now imagine that we have just inserted a new document and we want to check if people are interested in the document. The following percolator query would do just that. The field parameter points to the field of type percolator. The document_type points to the type containing the specification of the fields used in the query. The index, type and id parameters point to the actual document under test. The document we match to the stored queries.

curl -XGET "http://localhost:9200/news-notify/_search" -d'
{
  "query": {
    "percolate": {
      "field": "query",
      "document_type": "doctype",
      "index": "news",
      "type": "item",
      "id": 1
    }
  },
  "_source": {
    "includes": "meta.*"
  }
}'

That is it, how many queries would match for document with id:1? Only Sander would be interested. What about id:2? Both of us, Sander because he likes the snow, and jettro because he likes the sun.

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.25811607,
    "hits": [
      {
        "_index": "news-notify",
        "_type": "notification",
        "_id": "2",
        "_score": 0.25811607,
        "_source": {
          "meta": {
            "create_date": "2016-10-13T14:21:45",
            "username": "jettro"
          }
        }
      },
      {
        "_index": "news-notify",
        "_type": "notification",
        "_id": "1",
        "_score": 0.25811607,
        "_source": {
          "meta": {
            "create_date": "2016-10-13T14:23:00",
            "username": "sander"
          }
        }
      }
    ]
  }
}

If you want to be more explicit in why the queries matched the documents, you can use highlighting. The next block shows how to enable highlighting on the title field, as well as the results with highlighting enabled.

curl -XGET "http://localhost:9200/news-notify/_search" -d'
{
  "query": {
    "percolate": {
      "field": "query",
      "document_type": "doctype",
      "index": "news",
      "type": "item",
      "id": 2
    }
  },
  "_source": {
    "includes": "meta.*"
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}'

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.25811607,
    "hits": [
      {
        "_index": "news-notify",
        "_type": "notification",
        "_id": "2",
        "_score": 0.25811607,
        "_source": {
          "meta": {
            "create_date": "2016-10-13T14:21:45",
            "username": "jettro"
          }
        },
        "highlight": {
          "title": [
            "Snow on the ground, sun in the sky"
          ]
        }
      },
      {
        "_index": "news-notify",
        "_type": "notification",
        "_id": "1",
        "_score": 0.25811607,
        "_source": {
          "meta": {
            "create_date": "2016-10-13T14:23:00",
            "username": "sander"
          }
        },
        "highlight": {
          "title": [
            "Snow on the ground, sun in the sky"
          ]
        }
      }
    ]
  }
}

Did you notice that highlighting works on the document you provide to match against, and not somehow to the query that is stored. In a normal situation that would be the other way around. However in this case that would feel useless. I think the new percolator is a lot more flexible, love the way you can now add meta data or just plain data to your percolator query object. Using this data you can create a notification, send it to the persons with the username and tell the users why we think they find the article interesting.