Elasticsearch 5 is coming, what is new and improved?
The guys at elastic have been working on the new 5.0 release of elastic and all the other products in their stack as well. From the first alpha release I have been playing around with new features. Wrote some blogposts about features I played around with. With release candidate 1 out, it is time to write a bit about the new features that I like, and (breaking) changes that I feel are important. Since it is a big release I need a big blog post, so don’t say I did not warn you.
Important additions
Let us first have a look at some of the new features.
Ingest
Ingest is new type of elastic node. An ingest node is meant as a processor for incoming documents. Is shares some functionality with logstash, but is lightweight, integrated with elastic and very easy to use. A few examples of what you can do with ingest are:
- Converting strings to other types: integer, float, string, boolean
- Date parsing
- Use a data field to store the document in the right time based index. Think of an index by day or month.
- Use Grok to parse text based messages like access logs and application logs.
Typical use cases are:
- Sending logs from your servers through filebeat to elasticsearch and parse the logs using grok patterns.
- Sending events from your application immediately to elasticsearch but parse them before storing them.
- Parsing csv files by sending them line by line.
The base for ingest is a pipeline. You can define multiple pipelines and when sending a document to elastic you can specify the pipeline to use. Each pipeline consists of a description and a number of processors that are executed in order.
Pipelines are interacted with using the pipeline API. A good addition to this API is the Simulate option. With this functionality you can test your pipeline. You can even test a pipeline before you insert it into elastic.
The processors can change messages, add fields, combine fields, remove fields, etc. They also have error handling covered. If one processor fails, another one can take over or the chain can continue if nothing happened.
There is a list of available processors, next to that you can also create your own processors.
Matrix stats aggregation
If you are into mathamatics and regularly need to do advanced calculations over fields. You have to have a look at the matrix stats aggregation. With this aggregation you can determine things like: mean, variance, skewness, kurtosis, covariance and correlation.
Search after
Is new functionality to overcome a performance problem with using from/size when doing deep pagination. Nowadays there is even a soft limit of default 10000 records to use with from/size. If you move beyond that amount you get an exeption. With search_after this problem can be overcome. The idea is to use the results from the previous page to help the retrieval of the next page.
The query needs to have a sort part, with as a final part the unique field to guarentee there is an order of documents. In the response for every document, there is next to the _source also a sort part. The contents of this sort part can be used in the next request. First have a look at the query and the response. Notice the sort part with a sort by createDate and _uid.
GET raw_events-*/_search
{
"query": {
"term": {
"name": {
"value": "search_click_job"
}
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
},
{
"_uid": {
"order":"desc"
}
}
]
}
The next code block shows one hit of the response.
{
"_index": "raw_events-2016.10.18",
"_type": "logs",
"_id": "AVfYOvw9lbxWedBmE49G",
"_score": null,
"_source": {
"jobId": "1114",
"@timestamp": "2016-10-18T14:38:46.366Z",
"name": "search_click_job",
"@version": "1",
"userAgent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
"locale": "nl_NL",
"createDate": "2016-10-18T14:38:46.366Z",
"visitorId": "87F95035-2674-EC2D-CAB5-B7CC4494DE50"
},
"sort": [
1476801526366,
"logs#AVfYOvw9lbxWedBmE49G"
]
}
Notice the values for sort in the response. We need to use these values in the next request where we add the search_after element.
GET raw_events-*/_search
{
"query": {"term": {"name": {"value": "search_click_job"}}},
"sort": [{"createDate": {"order": "desc"}},{"_uid": {"order":"desc"}}],
"search_after": [1476801526366 , "logs#AVfYOvw9lbxWedBmE49G"]
}
Now you have a more efficient mechanism to implement next page functionality. Beware though, it does not help you with get me page 20 functionality.
Task Manager
The task manager has become the central point to monitor all long running tasks. Tasks like creating snapshots and re-indexing can now be monitored using the _tasks endpoint. You can also filter the tasks based on groups. Next to that you can cancel tasks. This api is already available in 2.4 New in 5 is the integration in the cat api with _cat/tasks
Percolator
The percolator is sort of an inverse search. You store the query and when searcing use a document as the input to find matching queries. This technology has changed a lot in 5.0. It is moved from being a separate API with it’s own endpoint, to be part of the search API as a special percolate query.
I wrote a longer blogpost about this topic, so if you want more information, read it here.
Painless
There is a new scripting language in town. I must (almost) admit I never really use scripting with elastic. There have been a lot of issues with the security of scripts. Therefore elastic has made a lot of changes, they went from support all, to support sandboxed only (groovy). Now they have implemented a new scripting language that should overcome a lot of the problems that the other scripting languages have.
The performance of the scripts should be better, it should be easy to learn and as the syntax is similar to small groocy scripts migration should not be to hard. If you want more painless info, check the docs.
Rollover endpoint
This is an interesting new feature with which you can control the index that you write to does not become to big or to old. The idea is that if the index does become to big, a new index is created and the alias that first pointed to the first index now points to the newly created index. Documents are written using the alias, that way the document is inserted into the correct index. In the example code I use a number in the index, but you can also use dates. Check the docs for more info.
First create the first index to store events and create an alias write_events to point to that index. Than add a few documents.
PUT events-000001
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"aliases": {
"write_events": {}
}
}
PUT write_events/error/1
{"message": "First event"}
PUT write_events/error/2
{"message": "Second event"}
PUT write_events/error/3
{"message": "Third event"}
In then next code block I show the indexes that are available
GET _cat/indices/events-*
green open events-000001 hB9RfDUJQtaUjysDtQyisA 1 0 3 0 9.6kb 9.6kb
Doing the same call for aliases gives us.
GET _cat/aliases
write_events events-000001 - - -
Next we can ask for a rollover if the index contains more than 2 docs with the following request.
POST write_events/_rollover
{
"conditions": {
"max_docs": 2
}
}
Asking for the indexes again gives you the following output
green open events-000001 hB9RfDUJQtaUjysDtQyisA 1 0 3 0 9.6kb 9.6kb
yellow open events-000002 XsmOJN90RLmIyfcQF-fApg 5 1 0 0 650b 650b
Beware that you need to create an index template with the mapping and settings you need. Without it, defaults are used. In my example for instance, the second index now has 5 shards instead of one which I created the first index with.
Now asking for aliases gives us:
GET _cat/aliases
write_events events-000002 - - -
Notice the difference in the name of the index the alias points to. What I thought in the beginning was that elasticsearch would check for the rollover condition on each addition of a document. But now I realise how sub optimal that would be. So you have to embed this into your own process.
Important changes
Default memory usage
The default memory usage now is min and max 2G
BM25
The default similarity algorithm is now changed to BM25, so no longer the TD/IDF variant. This has a potential impact on the scoring of your documents. So beware. The old is still be available as the classic algorithm.
Completion suggester
The completion suggester is changed completely and should be re-evaluated when you depend upon it.
Norms
Deal with boosting and scoring fields. If you use a field solely for filtering and aggregations you should disable norms.
Refresh on GET
If a document has changed and this change has not been sent to Lucene yet (by a refresh), a refresh will be executed before executing the GET. You can disable this using realtime=false
index defaults
You can no longer set index level defaults in the elasticsearch.yml file or using the command line. You should use index templates instead. Examples of these kind of properties are number_of_shards and number_of_replicas.
Java 8
Elasticsearch 5 requires java 8
Refresh options
Refresh now has three different options: None, Immediate and Wait for.
Obtaining all values in terms aggregation
I saw customers using the size:0 in aggregations to obtain all items. This is no longer possible. You have to provide a non null value for the size.
The final part of the blog is about the most important breaking changes
Breaking changes
When upgrading your cluster, this is most likely the most important part to read. There is extensive documentation on breaking changes that you should read. This part is just about the things that are important to me.
Indexes created in 1.x cannot be migrated to 5.x. So you have to explicitly re-index them. You have two options: migrate to 2.x, use re-index api to create new index or, create second cluster and re-index from 1.x cluster to 5.x cluster. If you want to migrate an index in 2.x, you can use the migration plugin that comes with a mechnanism to migratie an index. Check the blog post:
Upgrade your elasticsearch 1.x cluster to 5.x.
Query features that have been removed
search_type
Search type count and scan have been removed. If you need the count (in case of pure aggregations for instance), use the size is 0 option. If you want to do a scan, use the scroll functionality with a sort based on _doc.
maximum amount of shards to query
When executing a query over to many shards an error is thrown. The default for this behavior is a 1000 shards, but you can change this using the parameter action.search.shard_count.limit. With the following command yoiu can change it, after that check the response when you query to many shards.
PUT _cluster/settings
{
"transient": {
"action.search.shard_count.limit":10
}
}
{
"type": "illegal_argument_exception",
"reason": "Trying to query 27 shards, which is over the limit of 10. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time."
}
removed exists api
Not something I use a lot myself, the exists API. With this query you could check if a document exists or not. Now you should use the size 0 and terminate_after is 1. Using the terminate option, each shard will return as soon as the maximum amount of documents are found, in this case 1.
remove deprecated queries
A lot of 2.x deprecated queries are now removed and can no longer be used:
- filtered, and, or (use bool queries now)
- limit (use the terminate_after parameter)
top level filter parameter
The top level filter parameter is now removed. Only the “post_filter” can now be used.
inner hits
Beware that the format for source filtering of innerhits is changed, now you need to specify the full path.
Mapping changes
You should no longer use the type string, now use the type text for analysed strings and keyword for not analysed strings. When providing a field containing text (being a string) without a prior mapping, the field will become a multi field with the main being text and a subfield of type keyword. So ditch the default implementations for raw now use the keyword alternative. And provide a mapping if you do not need this functionality.
Numeric fields are now stored using a new mechanism called BKD Tree. One side effect of this is that the document frequency is not stored and therefore scoring of numbers does not use the document frequency. If you do need this behavior index the number as a keyword as well.
The index property should now have the value treu/false instead of not_analyzed/no.
Beware that there now is a limit in the number of fields in an index (1000). Of course there is a setting to change this if you need to. Also the number of nested fields is limited as well as the depth of the nesting.
Settings changes
There are now only three node type settings available: node.master, node.data, node.ingest. So node.client has been removed.
Elasticsearch can no longer be configured by setting system properties. Instead, use -Ename.of.setting=value.of.setting.
Scripts
Indexed scripts zijn vervangen door stores scripts, inclusief alle settings.
For scripts most options now only allow true/false, so sandboxed for instance has been removed.
Java Client
Now the java client is in its own package:
org.elasticsearch.client:transport
https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.0/java-api.html
DocumentAlreadyExistsException has been removed and can therefore no longer be used.
Removed ability to pass in boost value using field(String field) method in form e.g. field^2. Use the field(String, float) method instead.
A lot of changes in builders with properties that were deprecated and now have been removed.
There is also a new REST based java client available. I wrote two blog posts about that one already: part 1 and part 2. Reference documentation can be found here.
Scripting
Now the default scripting language is Painless, also the configuration for specifying a file instead of inline has changed.
Indexed scripts and templates have been replaced by stored scripts which stores the scripts and templates in the cluster state instead of a dedicate .scripts index.
Error handling
In some scenarios errors will be thrown instead of returning no results. Some examples:
- querying an unindexed field
- strict url query_string parameter parsing. Example of providing analyser in stead of analyzer.
In case of fatal errors like OutOfMemory that leave the JVM in questionable state, the jvm is now shut down.
The end
Wow you have come far. This is it for now. I think the guys at elastic did a tremendous job. I can’t wait to get this version into production and start using all the cool new features. If you want to upgrade your cluster and need a second opinion or some help doing it, feel free to contact me to discuss available options.