What a search PhD is doing at a data solutions company

-

In this post, after a brief commercial for my somewhat recently defended dissertation[1], I’ll draw a couple of parallels between the work I did in academia and how we approach data science at Luminis.

My dissertation is about search, or, information retrieval: how to find relevant bits in the ever growing multitude of information around us. Machine learning pops up a lot in search these days, and also in my work, whether it is classifying queries, clustering search results, or learning to rank documents in response to a query.

All chapters handle specialized search engines, e.g., people, scientific literature, or blog search engines. Using background knowledge, we improve general algorithms. As an example, to improve Twitter search, we used hashtags to automatically train a machine learning algorithm (Chapter 7, based on [2]).

A first obvious parallel between what I did before and what we do at Luminis Amsterdam is search. Elasticsearch is one of our biggest knowledge areas, as the proportion of blog posts on this site about it indicate.

A second is that the machine learning algorithms used in my dissertation can be applied to a wide range of problems, also outside search. For example, at Luminis we use classification techniques for problems like churn prediction and prospect scoring. The algorithms I personally have most experience with, e.g., decision trees, support vector machines, Bayesian classifiers, share the property that they are interpretable. At Luminis, we feel it is important for our customers to be able to understand, maintain, and manipulate the predictive algorithms we build for them.

A third is data. Right in the beginning of my PhD, in one of my favorite projects, we performed a query log analysis of a people search engine [3]. What made this exciting for me was the fact that we were working with real data, from real people. At Luminis, we work with real data as well, e.g., data from schools, hospitals, and businesses.

A fourth is tooling. In my PhD, as my experiments grew more complex and my datasets larger and larger, I appreciated more and more the software engineering challenges associated with data science. Working at Luminis means upgrading my software toolbox in almost every aspect. Python libraries like pandas, scikit-learn, Javascript frameworks like Angular, The ELK stack (Elasticsearch, Logstash, Kibana), Spark, Java frameworks like Spring and OSGI are some of the software that I’ve started using a lot more.

A fifth is dissemination. Of course, science is all about dissemination of knowledge (after one has been the first to obtain a publishable bit of it). But at Luminis, too, we believe in sharing our knowledge, and even our code, with large open source projects like Amdatu [4]. For me personally, it means among other things that I was given the chance to prepare a talk about how one might approach a data science project for a retail business starting from just an Excel sheet and without any business input; the video [5] and code [6] are online.

A sixth is experience. At ILPS, led by Maarten de Rijke, where I did my PhD, there was a vast amount of experience with challenges and opportunities of the full range of major web search engines to smaller specialised search engines like, for example, Netherlands Institute of Sound and Vision. At Luminis, we can draw on a vast amount of experience with the challenges and opportunities of our customers—businesses, semi-public and public organisations.

Putting these ingredients together, this is one way I like to think about how we approach data science: we enable organisations to build a data driven culture that can be sustained by its people, and based on which decision makers can make responsible and informed strategies and decisions. Interpretable algorithms, insightful reports and experiments, interactive dashboards with visualisations that directly relate to what is going on under the hood in predictive algorithms, useful applications; all backed by solid software engineering, resulting in lean and maintainable code bases.

[1] Berendsen, R. W. (2015). Finding people, papers, and posts: Vertical search algorithms and evaluation. PhD Thesis, UvA. URL: http://dare.uva.nl/record/1/489897

[2] Berendsen, R., Tsagkias, M., Weerkamp, W., & De Rijke, M. (2013, July). Pseudo test collections for training and tuning microblog rankers. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 53-62). ACM. Pre-print: http://wouter.weerkamp.com/downloads/sigir2013-pseudotestcollections.pdf

[3] Weerkamp, W., Berendsen, R., Kovachev, B., Meij, E., Balog, K., & De Rijke, M. (2011, July). People searching for people: Analysis of a people search engine log. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 45-54). ACM. Pre-print: http://www.wouter.weerkamp.com/downloads/sigir2011-peoplesearchlog.pdf

[4] https://www.luminis.eu/what-we-share/open-source/

[5] https://youtu.be/8wNl89zXlKw

[6] https://github.com/luminis-ams/devcon-2016-rb