Looking ahead: new field collapsing feature in Elasticsearch

-

At Luminis Amsterdam, Search is one of our main focus points. Because of that, we closely keep an eye out for upcoming features.

Only a few weeks ago, I noticed that the following pull request (“Add field collapsing for search request”) was merged into the Elasticsearch code base, tagged for the 5.3/6.x release. This feature allows you to group your search results based on a specific key. In the past, this was merely possible by using a combination of an ‘aggregation’ and ‘top hits’.

Now a good question would be: ‘why would I want this?’ or ‘what is this grouping you are talking about?’. Imagine having a website where you sell Apple products. MacBook’s, iPhones, iPad’s etc… Let’s say because of functional requirements, we have to create separate documents for each variant of each device. (eg. separate documents for iPad Air 2 32GB Silver, iPad Air 2 32GB Gold etc..) When a user searches for the word ‘iPad’, having no result grouping, will mean that your users will see search results for all the iPads you are selling. This could mean that your result list looks like the following:

  1. iPad Air 2 32GB Pink
  2. iPad Air 2 128GB Pink
  3. iPad Air 2 32GB Space Grey
  4. iPad Air 2 128GB Space Grey
  5. ..
  6. ..
  7. ..
  8. ..
  9. ..
  10. iPad Pro 12.9 Inch 32GB Space Grey
  11. Ipad Case with happy colourful pictures on it.

Now for the sake of this example, let’s say we only show 10 products per page. If our user was really looking for an iPad case, he wouldn’t see this product, but instead, would be shown a long list of ‘the same’ iPad. This is not really user-friendly. Now, a better approach would be to group all the Ipad Air 2 products in one, so that it would take only 1 spot in the search results list. You would have to think of a visual presentation in order to notify the user that there are more variants of that same product.

As mentioned before, grouping of results was already possible in older versions of Elasticseach, but the downside of this old approach was that it would use a lot of memory when computing this on big data sets, plus that paginating result was not (really) possible. An example:

GET shop/_search
{
  "size": 0,
  "query": {
    "match": {
      "title": "iPad"
    }
  },
  "aggs": {
    "collapse_by_id": {
      "terms": {
        "field": "family_id",
        "size": 10,
        "order": {
          "max_score": "desc"
        }
      },
      "aggs": {
        "max_score": {
          "max": {
            "script": "_score"
          }
        },
        "top_hits_for_family": {
          "top_hits": {
            "size": 3
          }
        }
      }
    }
  }
}
  • We perform a Terms aggregation on the family_id, which results in the grouping we want. Next, we can use top_hits to get the documents belonging to that family.

All seems well. Now let’s say we have a website where users are viewing 10 products per page. In order for users to go to the next page, we would have to execute the same query, up the number of aggregations to 20 and remove the first 10 results. Aggregations use quite some processing power, so having to constantly aggregate over the complete set will not be really performant when having a big data set. Another way would be to eliminate the first page results by executing a query with for page 2 combined with a filter to eliminate the families already shown. All in all, this would be a lot of extra work in order to achieve a field collapsing feature.

Now that Elasticsearch added the field collapsing feature, this becomes a lot easier. You can download my gist( with some setup for if you want to play along with the example. The gist contains some settings/mappings, test data and the queries which I will be showing you in a minute.

Alongside query, aggregations, suggestions, sorting/pagination options etc.. Elasticsearch has added a new ‘collapse’ feature:

GET shop/_search
{
  "query": {
    "match": {
      "title": "Ipad"
    }
  },
  "collapse": {
    "field": "family_id"
  }
}

The simplest version of collapse only takes a field name on which to form the grouping. If we execute this query, it will generate the following result:

"hits": {
    "total": 6,
    "max_score": null,
    "hits": [
      {
        "_index": "shop",
        "_type": "product",
        "_id": "5",
        "_score": 0.078307986,
        "_source": {
          "title": "iPad Pro ipad",
          "colour": "Space Grey",
          "brand": "Apple",
          "size": "128gb",
          "price": 899,
          "family_id": "apple-5678"
        },
        "fields": {
          "family_id": [
            "apple-5678"
          ]
        }
      },
      {
        "_index": "shop",
        "_type": "product",
        "_id": "1",
        "_score": 0.05406233,
        "_source": {
          "title": "iPad Air 2",
          "colour": "Silver",
          "brand": "Apple",
          "size": "32gb",
          "price": 399,
          "family_id": "apple-1234"
        },
        "fields": {
          "family_id": [
            "apple-1234"
          ]
        }
      }
    ]
  }

Notice the total amounts in the query response, showing the total amount of documents that were matched against the query. Our hits only return 2 hits, but if we look at the ‘fields’ section of the result, we can see our two unique family_id’s. The best matching result for each family_id is returned in the search results.

It is also possible to retrieve the documents directly for each family_id by adding an inner_hits block inside collapse:

GET shop/_search
{
  "query": {
    "match": {
      "title": "iPad"
    }
  },
  "collapse": {
    "field": "family_id",
    "inner_hits": {
      "name": "collapsed_by_family_id",
      "from": 1,
      "size": 2
    }
  }
}

You can use ‘from:1’ to exclude the first hit in the family, since it’s already the returned parent of the family
Which results in:

"hits": {
    "total": 6,
    "max_score": null,
    "hits": [
      {
        "_index": "shop",
        "_type": "product",
        "_id": "5",
        "_score": 0.078307986,
        "_source": {
          "title": "iPad Pro ipad",
          "colour": "Space Grey",
          "brand": "Apple",
          "size": "128gb",
          "price": 899,
          "family_id": "apple-5678"
        },
        "fields": {
          "family_id": [
            "apple-5678"
          ]
        },
        "inner_hits": {
          "collapsed_family_id": {
            "hits": {
              "total": 2,
              "max_score": 0.078307986,
              "hits": [
                {
                  "_index": "shop",
                  "_type": "product",
                  "_id": "6",
                  "_score": 0.066075005,
                  "_source": {
                    "title": "iPad Pro",
                    "colour": "Space Grey",
                    "brand": "Apple",
                    "size": "256gb",
                    "price": 999,
                    "family_id": "apple-5678"
                  }
                }
              ]
            }
          }
        }
      },
      {
        "_index": "shop",
        "_type": "product",
        "_id": "1",
        "_score": 0.05406233,
        "_source": {
          "title": "iPad Air 2",
          "colour": "Silver",
          "brand": "Apple",
          "size": "32gb",
          "price": 399,
          "family_id": "apple-1234"
        },
        "fields": {
          "family_id": [
            "apple-1234"
          ]
        },
        "inner_hits": {
          "collapsed_family_id": {
            "hits": {
              "total": 4,
              "max_score": 0.05406233,
              "hits": [
                {
                  "_index": "shop",
                  "_type": "product",
                  "_id": "2",
                  "_score": 0.05406233,
                  "_source": {
                    "title": "iPad Air 2",
                    "colour": "Gold",
                    "brand": "Apple",
                    "size": "32gb",
                    "price": 399,
                    "family_id": "apple-1234"
                  }
                },
                {
                  "_index": "shop",
                  "_type": "product",
                  "_id": "3",
                  "_score": 0.05406233,
                  "_source": {
                    "title": "iPad Air 2",
                    "colour": "Space Grey",
                    "brand": "Apple",
                    "size": "32gb",
                    "price": 399,
                    "family_id": "apple-1234"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }

Paging was an issue with the old approach, but since documents are grouped inside the search results, paging works out of the box. Same way as it does for normal queries and with the same limitations.

A lot of people in the community have been waiting for this feature and I’m excited that it finally arrived. You can play around with the data set and try some more ‘collapsing’ (eg by color, brand, size etc..). I hope this gave you a small overview of what’s to come in the upcoming 5.3/6.x release.