Elasticsearch on the web week 10


Every week there are a lot of posts on the web about elasticsearch and the other elasticsearch supplied products like Logstash and Kibana. I read a lot of these posts and some of them are interesting enough to share with you. In this weeks overview there are posts about Kibana, aggregations and sizing your servers/shards. Also a few older ones that were still on my to read list.

Kibana 4

About two weeks a go a new version of Kibana was releases. In one of my blog posts I talked about the beta of Kibana and I also wrote a blog post using the new Kibana to find abusers of a blog using logstash to import access logs.

You can find the original announcement of Kibana over at elasticsearch. Last there was also a webinar about Kibana 4, you can now watch the recording. There are a lot of other people blogging about Kibana, I personally like this one. A very nice blog post, beware not to read if you like a good beer and are thirsty. Just a warning. It gives a very good usecase of using Kibana 4. It answers a few questions around finding the best Belgian beer based on a Norwegian catalogue. I like the approach, the use of different components and of course the subject.


At the basis of Kibana is the advanced elasticsearch feature called aggregations. This is a very powerful feature as the Kibana product proves. But you can use it to your own advantage as well. Zachary Tong is writing a series of articles discussing the analytical part of elasticsearch. He starts with the concepts of aggregations in the first post. In the second post he goes a step further. He than introduces sub aggregations. Combining a bucket aggregation with a metric aggregation gives you a lot of power. In the third blog post Zachary takes a shot at the percentiles. What they are and how they work. A percentile is a number representing the total documents below a certain value. So the 75 percentile gives you the value of which 75 percent of the documents are below. In some cases you might want to find the percentile belonging to a certain value. He gives the example with the price of houses in the UK. Find the percentile to the avg price cost. You can use the percentile_ranks aggregation to do this.


Deleting documents

A very nice read on the effect of deleting documents and Lucene indexes. Talks about what actually happens when you delete a document. Remember that an update is effectively a delete and an add as well. It also explains that some queries cannot handle deleted documents. The fuzzy query for instance because they do match these ghost terms. Than he also continuous to show the performance impact of a delete heavy index. This is substantial, still not as much as you might expect. There are some knobs to turn to try and optimise dealing with deletes. Be careful though, you can break things easily and Lucene already comes with reasonable defaults.


Sizing shards and hardware

People are often asking on the google group or stack overflow what the size of their shards should be, should they create an index per customer, or what other solution should they use to store their data in elasticsearch. In this blog post the guys from TrackJS tell us their story. Not necessarily a blue print for you solution, but nice to read the steps they took.


Elasticsearch also wrote an interesting post around performance of multiple AWS cloud installations with elasticsearch. So if you are planning on using AWS this is a must read.


Other interesting reads

A nice post by try labs about analysers, what they are for and how to use them

An interesting post about the use of the common terms query and why this query makes the stop words filter obsolete. Using he common terms filter we have the boost of the query the same as using a stop words filter. Only if not enough results have been found the stop words or common terms will be considered as well. So it has the advantage of speed but the option to use stop words to find results.


One of the features not everybody that is using elasticsearch knows about is measuring performance of queries. This is explained in the this post