Blog Elasticon Day 2
This week we are attending elasticon 2016. If you want to follow us along you can check our twitter feeds. If you like to read the summary read this and the previous and the next blog posts. Our twitter handles are:
- @rjlinden
- @ByronVoorbach
- @jettroCoenradie
Get the Lay of the (Lucene) Land
(by Adrien Grand)
Lucene started out as a library for full-text search. Today Lucene is used for so much more: Analytics, Geo Search, Suggestions, Data store, structured search. To support these functionalities the Inverted Index was not enough any more, the Column store was introduced in 4.0. This enabled faster aggregations and sorting for instance. The column store was introduced in Lucene to make it easy to combine the inverted index queries with a column store store queries in one API.
In Lucene a faster implementation of the k-d tree is provided. This enables a much faster structured search, which is important for numbers, geo queries and dates. To enable fast structured search, numbers are index on multiple levels. First the data is filtered using the 1st level, than the heavier searching is done one less terms. More levels means more detail combining the levels, search make it much faster. The idea behind K-D trees is to split the space into cells that contain the same amount of nodes and keep on splitting until you have one item per cell. For Lucene this is too fine grained. Lucene will stop the splitting until you reach around 1000 spots.
Elasticsearch 5.x will support BigDecimal and IPv6 which is currently not supported based on the 64byte limit using the new k-d structure.
An important change in the upcoming Elastic version is the change of the default scoring algorithm. BM25 is going to be the standard over TF-IDF in elasticsearch 5.x, it has better support for common terms.
Lucene 5.4 has improved support for Two-phase iteration, it can do a fast approximation followed by a slower match.
Eventbrite’s Search-based Approach to Recommendations
(by John Berryman)
Eventbrite is building an Elasticsearch-powered, content- and behavior-based recommendation system to match users with events they are sure to enjoy.
Sometimes organizers complained that their events are not easy to find. To solve this problem Eventbrite combines the search and recommendations creates a personalized search for different users, allowing them to provide better results.
This way a user from United Kingdom would get different results with the same search terms as a user from the Canada.
Eventbrite applies data at index time to make it easier to query for recommendations.
Security, Alerting, Monitoring & More With the Elastic Stack
(by lots of speakers)
This is all about x-pack, the commercial products of elastic.
Monitoring
Marvel 1.x was great when you know exactly what you are looking for. The next Marvel becomes Monitoring for elasticsearch 2.0. It will support the new Kibana 5 look and feel. Monitoring comes with support for multiple clusters. It becomes easier to compare the health of different Nodes with each other. There is integration with Watcher to create a feature, most likely called issues, that will automatically recognise problematic situations and make a notification available for you to act upon. Finally it will provide cross stack monitoring, so it includes monitoring of KIbana, logstash and beats together with elasticsearch monitoring.
Security with the Elastic stack
New API for adding users to shield, currently you have to do this using a CLI tool. Kibana Security comes with lots of features to secure your Kibana installation. It has some built in users, secured dashboards, user management.
Alerting
2.0 Introduces an api to activate and deactivate a watch. Good addition. Watcher will embed curator as an action, available to do some cleaning up based on certain states. A cool addition is the watcher UI, which will become available in the next version. This UI will help with managing your watches.
Reporting
It will become possible to generate pdf using Watcher triggers and send the report using watcher actions. A management panel will be available with the possibility to look at the history of generated reports.
Graph Capabilities in the Elastic Stack
(by Mark Harwood and Steve Kearns)
Data is not flat, relationships live in our data. You can relate by having a document that relates two other entities. You can also have a relationship when two documents have the same value for a field. This is ideal to be presented and used as a Graph of relationships, but also for recommendations. Creating the Graph and finding the relevant related nodes is usually done based on the count of relationships. With the availability of elasticsearch, graph exploration can be done with relevance: Follow links not by count, but by relevance. Don’t skip super connected entities, account for them. But understand this does not work in all situations. Therefore explore count based or by relevance and provide the option to choose.
An example is finding creditcard problems of the people in the room. Imagine everyone here has problems with their creditcard. Questions that you can ask to identify the source are: Who bought something from Amazon? Almost everybody. Now who bought a ticket for Elasticon? Again almost everybody. Not a good distinction. But if you change the context from people in the room to people on the street, there will be a big difference. Still almost everyone has bought something at Amazon, but not to many people have bought an Elasticon ticket. Therefore the relevance for people buying Elasticon tickets of the people in this room is a lot bigger. Elasticsearch queries can help with this context.
A Kibana plugin to support visualising the graph and providing input to play with the graph is available in version 5. Using the interface you can store the results of a graph API call as a percolator query. That way use related items to be notified of new albums or something. Within the Graph GUI you can expand your visible nodes easily from each node. And you can group found nodes into one node to interact with that node by finding new relations between that node and others. All of this by pushing a few buttons.
Ingest Node: Enriching Documents within Elasticsearch
(by Tal Levy and Martijn van Groningen)
The Ingest Pipeline is kind of the filter part from logstash. It does not contain input and output. The input is intercepting all incoming documents from elastic and the output is always the indexer for elastic. All nodes can become ingest nodes, in production you most likely will create specific nodes for the ingestion. Ingest comes with a number of processors. Think about Grok, Geo, mutations and converters. You can assign documents that fail the original pipeline to go to another pipeline to be handled differently, use the on_failure to do this. Debugging and testing pipelines is easy to do with a simulation, that way you can try out your pipeline without actually indexing the documents.
With ingest it is now possible to intercept every document add using plugins add fields for metadata. The interception is done without knowledge about how the document is stored. The mapping is not available, the shard it will be send to is unknown. Advantage is that you do not have to do the pre-processing for every shard. The interception and processing is taking place on the node level, not on shard level. That way when working with replica shards all the processing is only done once.
The ingest node also make the re-index API more powerful. Using the new reindex api together with the ingest functionality, you can alter the original documents using the ingest pipeline before sending them to the new index.
The gui wizard in Kibana helps you to create the pipeline. It will also generate the index mapping template when you have configured your complete pipeline and it has a nice feature to help you programming your grok commands by showing the result immediately.
Time for some drinks, tomorrow the final day of elasticon 2016!