Faceted search with Elasticsearch

-

In this blog, I will be presenting two strategies for implementing faceted search with Elasticsearch. Few days back I had a discussion with my colleague Byron about implementing faceted search when Elasticsearch is being used to serve the search results. And this blog is a culmination of our discussion. In Elasticsearch, the aggregation framework provides support for facets and executing metrics and scripts over those facet results. Following is a simple example wherein each Elastic documents contains a field color and we execute a Term Aggregation on the document set

{
  "aggs" : {
    "colors" : {
      "terms" : { "field" : "color" }
      }
    }
}

And we get the following

"buckets": [
    {
        "key": "red",
        "doc_count": 2
     },
     {
        "key": "green",
        "doc_count": 2
     },
     {
        "key": "blue",
        "doc_count": 1
     },
     {
        "key": "yellow",
        "doc_count": 1
     }

So we get each of the unique color as a bucket key and doc_count is the number of documents have corresponding color field value.

One of the options is to keep going on with sub-aggregations which is also called Path hierarchy sub-aggregation but that is something which is very expensive and also not feasible beyond a certain hierarchy level, as discussed here.

First approach

The first approach for faceted search is when we have a unique field corresponding to a hierarchy level present in each document. For example, if we have documents pertaining to products for online webshop and 3 levels of hierarchy then a product document would look something like –

     {
    ...
 
"categoryOneLevel": [
       "8299"
     ],
     "categoryTwoLevel": [
       "8299-3131"
     ],
     "categoryThreeLevel": [
       "8299-3131-2703",
       "8299-3131-2900"
     ]
}

On the UI we can visualize it with something like – Computer (8299) (level 1) > laptop(3131) (level 2) > (Linux (2703) and Mac(2900)) (level 3) > further on. The reason for storing the other levels in each deeper located category is because a product can exist in multiple categories. If we take this approach then sample queries would be something like –

When user is on landing page (no facetting yet)

GET document/_search
  {
  "aggs": {
   "categories": {
     "terms": {
       "field": "categoryOneLevel",
      }
   }
 }
 }

Now when the user clicks, the next query fired would be

GET document/_search?size=0
{
 "query": {
   "bool": {
     "filter": {
       "term": {
         "categoryOneLevel": "8299"
       }
     }
   }
 },
 "aggs": {
   "categories": {
     "terms": {
       "field": "categoryTwoLevel",
       "include": "8299-.*"
     }
   }
 }

Here the filter query selects all the documents of categoryOne and then does the aggregations on all categoryTwo values which are directly linked with categoryOne as we choose the include : 8299-* filter Thus you can extend this approach further that if the user clicks on categoryTwo then results from categoryThree are returned. In this way we can have faceted search results from ElasticSearch.

Second Approach

Now, let’s look at the second approach which somewhat comes out of the box. Using the built-in Path tokenizer. Here a single field holds the hierarchy values and while indexing we specify the “path tokenizer” mapping setting for that field.  For example, a single field would hold the comma separated value and path tokenizer would produce the token values as –

Computer,Laptop,Mac

And produces tokens:

Computer
Computer,Laptop
Computer,Laptop,Mac

In this approach, I am taking the actual user-readable values of categories instead of Ids, it’s only for example sake, as taking Ids gives you the flexibility to update the category names like Computer, Laptop etc associated with that Id.

Let’s look at some elastic queries –

PUT blog_index/
{
  "settings": {
    "analysis": {
      "analyzer": {
      "path-analyzer": {
         "type": "custom",
         "tokenizer": "path-tokenizer"
        }
     },
     "tokenizer": {
      "path-tokenizer": {
         "type": "path_hierarchy",
         "delimiter": ","
         }
      }
   }
 },
 "mappings": {
     "my_type": {
       "dynamic": "strict",
       "properties": {
         "hierarchy_path": {
             "type": "string",
             "analyzer": "path-analyzer",
             "search_analyzer": "keyword"
             }
          }
       }
     }
   }

Our index is ready with field “hierarchy_path” having the path tokenzier set as the analyzer thus now the terms of this field will be tokenized based on the path_tokenizer.

Now lets, add a document to the index

POST blog_index/my_type/1
{
"hierarchy_path": ["Computer,Laptop,Mac","Home,Kitchen,Cookware"]
}

We have added a document with field hierarchy_path having example of two set of categories and each set having comma separated values.

If we have execute a terms aggregation on field “hierarchy_path” we get

GET blog_index/my_type/_search?search_type=count
{
   "aggs": {
     "category": {
        "terms": {
         "field": "hierarchy_path",
          "size": 0
          }
       }
     }
     }

We get the following buckets

"buckets": [
 {
 "key": "Computer",
 "doc_count": 1
 },
 {
 "key": "Computer,Laptop",
 "doc_count": 1
 },
 {
 "key": "Computer,Laptop,Mac",
 "doc_count": 1
 },
 {
 "key": "Home",
 "doc_count": 1
 },
 {
 "key": "Home,Kitchen",
 "doc_count": 1
 },
 {
 "key": "Home,Kitchen,Cookware",
 "doc_count": 1
 }
 ]

From the above results, we can see that the path_tokenizer has split the comma separated values of the field “hierarchy_path”.

So, now based on the user activity we can fire queries to select the documents pertaining to the category which user is looking for. The query for selecting the top level category would be

GET blog_index/my_type/_search?search_type=count
{
   "aggs": {
     "category": {
       "terms": {
         "field": "hierarchy_path",
         "size": 0,
         "exclude": ".*\\,.*"
        }
       }
       }
     }

and we get

"buckets": [
  {
    "key": "Computer",
    "doc_count": 1
   },
 {
   "key": "Home",
   "doc_count": 1
  }
]

We have used the regular expression exclude”: “.*\\,.* which excludes all the sub-levels thus we get only the top hierarchy.

If user wants only second level then the query fired would be

GET blog_index/my_type/_search?search_type=count 
{
 "query": {
 "bool" : {
 "filter": {
 "prefix" : { "hierarchy_path" : "Computer" }
 }
 }
 },
 "aggs": {
 "category": {
 "terms": {
 "field": "hierarchy_path",
 "size": 0,
 "include" : "Computer\\,.*",
 "exclude": ".*\\,.*\\,.*"
 }
 }
 }
}

Wherein we specify the regex for include which would mean all documents which are part of Computer hierarchy but we exclude the third level of hierarchy thus result only contains second level of hierarchy.

"buckets": [
 {
 "key": "Computer,Laptop",
 "doc_count": 1
 }
 ]

When the user activity requires third level of hierarchy then the query fired would be

GET blog_index/my_type/_search?search_type=count
{
 "query": {
 "bool" : {
 "filter": {
 "prefix" : { "hierarchy_path" : "Computer" }
 }
 }
 },
 "aggs": {
 "category": {
   "terms": {
     "field": "hierarchy_path",
      "size": 0,
     "include" : "Computer\\,.*\\,.*"
    }
   }
  }
 }

Based on the include regex “Computer\\,.*\\,.*” we wil get only the documents have the third level of hierarchy as well

"buckets": [
{
"key": "Computer,Laptop,Mac",
"doc_count": 1
}
]

In this way based on the user activity of our application we can fetch corresponding results from Elastic and while indexing our documents we need to make sure the product documents have relevant value in the “hierarchy_path” field based on the hierarchy level which that product would be present in.