Elasticsearch 6 is coming
For some time now, elasticsearch has been releasing versions of the new major release elasticsearch 6. At this moment the latest edition is already rc1, so it is time to start thinking about migrating to the latest and greatest. What backwards compatible issues will you run into and what new features can you start using. This blog post gives a summary of the items that are most important to me based on the projects that I do. First we’ll have a look at the breaking changes, than we move on to new features or interesting upgrades.
Breaking changes
Most of the breaking changes come from the elasticsearch documentation that you can of course also read yourself.
Migrating indexes from previous versions
As with all major release, only indexes created in the prior version can be migrated automatically. So if you have an index created in 2.x, migrated it to 5.x and now want to start using 6.x you have to use the reindex API to first index it into a 5.x index before migrating.
Index types
In elasticsearch 6 the first step is taken into indexes without types. The first step is to allow only a single type within a new index and be able to keep using multiple types in indexes migrated from 5.x. Starting with elasticsearch 5.6 you can prevent people from creating indexes with multiple types. This will make it easier to migrate to 6.x when it becomes available. By applying the following configuration option you can prevent people from making multiple types in one index
index.mapping.single_type: true
More reasoning about why the types need to be removed can be found in elasticsearch documentation removal of types. Also if you are into parent-child relationships in elasticsearch and are curious what the implications of not being able to use multiple types are, check this documentation page parent-join. Yes will will get joins in elasticsearch :-), though with very limited use.
Java High Level REST Client
This was already introduced in 5.6, still good to know as this will be the replacement for the Transport client. As you might know I am also creating some code to use in Java Applications on top of the Low Level REST client for java that is also being used by this new client. More information about my work can found here: part 1 and part 2.
Uniform response for create/update and delete requests
At the moment a create request returns a response field created true/false, and a delete request returns found true/false. If you are someone trying to parse the response and using this field, you can no longer use this. Use the result field instead. This will have the value created or updated in case of the create request and deleted or not_found in case of the delete request.
Translog change
The translog is used to keep documents that have not been flushed to disk yet by elasticsearch. In prior releases the translog files are removed when elasticsearch has performed a flush. However, due to optimisations made for recovery having the translog could speed the recovery process. Therefore the translog is now kept for by default 12 hours or a maximum of 512 Mb
More information about the translog can be found here: Translog.
Java Client
In a lot of java projects the java client is used. I have used it as well for numerous projects. However, with the introduction of the High Level Rest client for java projects should move away from the Transport Client. If you want/need to keep using it for now, there are some changes in packages and some methods have been removed. For me the one I used the most is the the order for aggregations, think about Terms.Order and Histogram.Order. They have been replaced by BucketOrder
Index mappings
There are two important changes that can affect your way of working with elastic. The first is the way booleans are handled. In indexes created in version 6, a boolean accepts only two values: true and false. Al other values will result in an exception.
The second change is the _all field. In prior version by default an _all field was created in which all values of fields were copied as strings and analysed with the standard analyser. This field was used by queries like the query_string. There was however a performance penalty as we now had to analyse and index a potentially big field. Soon it became a best practice to disable the field. In elasticsearch 6 the field is disabled by default and it cannot be configured for indices created with elasticsearch 6. If you still use the query_string query, it is now executed agains each field. You should be very careful with the query_string query. It comes with a lot of power. Users get a lot of options to create their own query. But with great power comes great responsibilities. They can create very heavy queries as well. And they can queries that break without a lot of feedback. More information about the query_string. If you still want to give you users more control, but the query_string query is one step to far, think about creating your own search DSL. Some ideas can be found in one of my previous blog posts: Creating a search DSL and Part 2 of creating a search DSL.
Booting elasticsearch
Some things changed with the startup options. You cannot configure the user elasticsearch runs with if you use the deb or rpm packages and the elasticsearch.yml file location is now configured differently. Now you have to export the path where to find all configuration files (elasticsearch.yml, jvm.options and log4j2.properties). You can expose an environment variable ES_PATH_CONF containing the path to the config folder. I use this regularly on my local machine. As I have multiple projects running often with different version of elasticsearch I have setup a structure where I put my config files in separate folders from the elasticsearch distributable. Find the structure in the image below. In the beginning I just copy the config files to my project specific folder. When I start the project with the script startNode.sh the following script is executed.
#!/bin/bash
CURRENT_PROJECT=$(pwd)
export ES_PATH_CONF=$CURRENT_PROJECT/config
DATA=$CURRENT_PROJECT/data
LOGS=$CURRENT_PROJECT/logs
REPO=$CURRENT_PROJECT/backups
NODE_NAME=Node1
CLUSTER_NAME=playground
BASH_ES_OPTS="-Epath.data=$DATA -Epath.logs=$LOGS -Epath.repo=$REPO -Enode.name=$NODE_NAME -Ecluster.name=$CLUSTER_NAME"
ELASTICSEARCH=$HOME/Development/elastic/elasticsearch/elasticsearch-6.0.0-rc1
$ELASTICSEARCH/bin/elasticsearch $BASH_ES_OPTS
Now when you need additional configuration options, add them to the elasticsearch.yml. If you need more memory for the specific project, change the jvm.options file.
Plugins
When indexing pdf documents or word documents a lot of you out there have been using the mapper-attachments plugin. This was already deprecated, now it has been removed. You can switch to the ingest attachment plugin. Never heard about Injest? Injest can be used to pre process documents before they are being indexed by elasticsearch. It is a lightweight variant for Logstash, running within elasticsearch. Be warned though that plugins like the attachment mapper can be heavy on your cluster. So it is wise to have a separate node for Injest. Curious about what you can do to inject the contents of a pdf? The next few steps show you the commands to create the injest pipeline, send a document to it and obtain it again or create a query.
First create the injest pipeline
PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"attachment": {
"field": "data"
}
}
]
}
Now when indexing a document containing the attachment as a base64 encoded string in the field data we need to tell elasticsearch to use a pipeline. Check the parameter in the url: pipeline=attachment. This is the name used when creating the pipeline.
PUT my_index/my_type/my_id?pipeline=attachment
{
"data": ""
}
We could stop here, but how to get base64 encoded input from for instance a pdf. On linux and the mac you can use the base64 command for that. Below is a script that reads a specific pdf and creates a base64 ended string out of it. This string is than pushed to elasticsearch.
#!/bin/bash
pdf_doc=$(base64 ~/Desktop/Programma.pdf)
curl -XPUT "http://localhost:9200/my_index/my_type/my_id?pipeline=attachment" -H 'Content-Type: application/json' -d '{"data" : "'"$pdf_doc"'"}'
Scripts
If you are heavy into scripting in elasticsearch you need to check a few things. Changes have been made to the use of the lang attribute when obtaining or updating scripts, you cannot provide it any more. Also support for other languages than painless has been removed.
Search and query DSL changes
Most of the changes in this area are very specific. I am not going to sum them, please check the original documentation. Some of them I do want to mention as they are important to me.
- If you are constructing queries and it can happen you have an empty query, you can no longer provide an empty object { }. You will get an exception if you keep doing it.
- Bool queries had a disable_coord parameter, with this you could influence the score function to not use missing search terms as a penalty for the score. This option has been removed.
- You could transform a match query into a match_phrase query by specifying a type. This is no longer possible, you should just create a phrase query now if you need it. Therefore also the slop parameter has been removed from the match query.
Calculating the score
I the beginning of elasticsearch the score for a document based on an executed query was calculated using an adjusted formula for TF/IDF. It turned out that for fields containing smaller amounts of text TF/IDF was less ideal. Therefore the default scoring algorithm was replaced by BM25. Moving away from TF/IDF to BM25 has been the topic for version 5. Now with 6 they have removed two mechanisms in the scoring: Query Normalization and Coordination Factors. Query Normalization was always hard to explain during trainings. It was an attempt to normalise the scores of queries. Normalizing should make it possible to compare them. However, it did not work and you still should not compare scores of different queries. The Coordinating Factors were more a penalty when having multiple terms to search for and not all of them were found, the coordinating factor gave a penalty to the score. You could easily see this when using the explain API.
That is it for the breaking changes, again there are more changes that you might want to investigate if you are really into all the elasticsearch details. Than have a look at the original documetation
Next up, cool new features
Now let us zoom in on some of the newer features or interesting upgrades.
Sequence Numbers
Sequence Numbers are now assigned to all index, update and delete operations. Using this number a shard that went offline for a moment can ask the primary shard for all operations after a certain sequence number. If the translog is still available (remember that we mentioned in the beginning that the translog was now kept around for 12 hours and or 512 Mb by default) the missing operations can be send to the shard preventing a complete refresh of all the shards contents.
Test Normalizer using analyse endpoint
One of the most important parts of elastic is configurating the mapping for your documents. How do you adjust the terms that you can search for based on the provided text. If you are not sure and you want to try out a specific tokeniser and filters combination you can use the analyze endpoint. Have a look at the following code sample and response where we try out a whitespace tokeniser with a lowercase filter.
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase"],
"text": ["Jettro Coenradie"]
}
{
"tokens": [
{
"token": "jettro",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "coenradie",
"start_offset": 7,
"end_offset": 16,
"type": "word",
"position": 1
}
]
}
As you can see we now get two tokens and the uppercase characters are replaced by their lowercase counterparts. What if we do not want the text to become two terms, but we want it to stay as one term. Still we would like to replace the uppercase characters with their lowercase counterparts. This was not possible in the beginning. However, with the introduction of normalizer, a special analyser for fields of type keyword it became possible. In elasticsearch 6 we now have the functionality to use the analyse endpoint for normalisers as well. Check the following code block for an example.
PUT people
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"normalizer": {
"name_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
}
}
GET people/_analyze
{
"normalizer": "name_normalizer",
"text": ["Jettro Coenradie"]
}
{
"tokens": [
{
"token": "jettro coenradie",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 0
}
]
}
LogLog-Beta
Ever heard about HyperLogLog or even HyperLogLog++? Well than you must be happy with LogLog-Beta. Some background, elasticsearch comes with a Cardinality Aggregation which can be used to calculate or better estimate the amount of distinct values. If we wanted to create an exact value, we would have to create a map of values with all unique values in there. This would require an extensive amount of memory. You can specify a threshold under which the amount of unique values would be close to exact. However the maximum value for this is 40000. Before elasticsearch used the HyperLogLog++ algorithm to estimate the unique values. With the new algorithm called LogLog-Beta there are better results with lower error margins and still the same performance.
Significant Text Aggregation
For some time the Significant Terms Aggregation has been available. The idea behind this aggregation is to find terms that are common to a specific scope and less common to a more general scope. So imagine we are looking for users of our website that place more orders in relation to pages they visit out of logs with page visits. You cannot calculate them by just counting the number of orders. You need to find those users that are more common to the set of orders than to the set of page visits. In the prior version this was already possible with terms, so not analysed fields. By enabling field-data or doc_values you could use small analysed fields. But for larger text fields this was a performance problem. Now with the Significant Text Aggregation we can overcome this problem. It also comes with an interesting functionality to deduplicate text (think about emails with the original text in a reply, or retweets).
Sounds a bit to vague? Ok, lets have an example. In elasticsearch documentation they use a dataset from Signal Media. As it is an interesting dataset to work with, I will also use it. You can try it out yourself as well. I downloaded the file and imported it into elasticsearch using logstash. This gist should help you. Now on to the query and the response
GET signalmedia/_search
{
"query": {
"match": {
"content": "rain"
}
},
"aggs": {
"my_sampler": {
"sampler": {
"shard_size": 200
},
"aggs": {
"keywords": {
"significant_text": {
"field": "content",
"filter_duplicate_text": true
}
}
}
}
},
"size": 0
}
So we are looking for documents with the word rain. Now in these documents we are going to lookup terms that occur more often than in the global context.
{
"took": 248,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 11722,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_sampler": {
"doc_count": 600,
"keywords": {
"doc_count": 600,
"bg_count": 1000000,
"buckets": [
{
"key": "rain",
"doc_count": 544,
"score": 69.22167699861609,
"bg_count": 11722
},
{
"key": "showers",
"doc_count": 164,
"score": 32.66807368214775,
"bg_count": 2268
},
{
"key": "rainfall",
"doc_count": 129,
"score": 24.82562838569881,
"bg_count": 1846
},
{
"key": "thundery",
"doc_count": 28,
"score": 20.306396677050884,
"bg_count": 107
},
{
"key": "flooding",
"doc_count": 153,
"score": 17.767450110864743,
"bg_count": 3608
},
{
"key": "meteorologist",
"doc_count": 63,
"score": 16.498915662650603,
"bg_count": 664
},
{
"key": "downpours",
"doc_count": 40,
"score": 13.608547008547008,
"bg_count": 325
},
{
"key": "thunderstorms",
"doc_count": 48,
"score": 11.771851851851853,
"bg_count": 540
},
{
"key": "heatindex",
"doc_count": 5,
"score": 11.56574074074074,
"bg_count": 6
},
{
"key": "southeasterlies",
"doc_count": 4,
"score": 11.104444444444447,
"bg_count": 4
}
]
}
}
}
}
Interesting terms when looking for rain: showers, rainfall, thundery, flooding, etc. These terms could now be returned to the user as possible candidates for improving their search results.
Concluding
That is it for now. I haven’t even scratched all the new cool stuff in the other components like X-Pack, Logstash and Kibana. More to come.