Blog

E-Commerce Site Search Overhaul – Super “selection” year 2021?
It’s not just politics that will be exciting this year. Changes are also on the horizon for e-commerce. The evaluation of an optimal e-commerce site search overhaul – make or buy a solution is quickly becoming a TOP trend in 2021.

But how thorough to prepare an e-commerce search solution overhaul?
Increasing the economic performance (e.g., CTR, CR) and improving the user experience (e.g., faster loading times, discovery features, and content integration.) Both are issues that undoubtedly concern all e-commerce retailers and will need to be dealt with to prevail against the competition.
Reducing the manual effort required to maintain and control on-site search is an essential task in this regard.
Beyond that, however, some other important questions need to be answered in advance within the organization.
The following is a summary of the most critical points.

E-Commerce Search Overhaul — Make or Buy?

Algorithmic:

How good is the search relevance model in full-text search, semantic correlations, long-tail keywords, languages?

Discovery Features:

How well are topics like complex price & availability dependencies, as well as guided selling and recommendations covered?

Content Integration:

What opportunities exist in terms of the controllable blending of products and promotional content?

Merchandising Features & Analytics:

How well can different sales-promoting strategies (including ranking) be combined with business KPIs and evaluated?

Customization:

How easily are individual requirements implemented?

Intellectual Property:

How can it be ensured that contributed domain knowledge and other forms of intellectual property remain in house?

Deployment Model & Architecture:

How flexible are the deployment model and system architecture?

Integration & Ease of Use:

How apt is the system integration, use, and operation?

Which solution is the right one? Whether a commercial solution (such as Algolia, Attraqt, FACT-Finder, Findologic) or an open-source framework (such as OCSS https://blog.searchhub.io/introducing-open-commerce-search-stack-ocss) — the decision must be well-prepared.

Conditions for the selection of an E-Com on-site search

Before making an informed decision about selecting a new on-site solution/technology, it is vital to understand the implications, dependencies, and scope of such a decision.

The deployment of such a solution quite often influences future business functions and strategic decisions without this being directly apparent in advance. Therefore, I examine the three most important influencing factors in more detail below.

The Influence on corporate strategy:

The core functions responsible for the broader business strategy’s economic success are a natural product of the medium- to long-term corporate strategy.

The answers to the following questions about corporate strategy are particularly relevant when preparing for a vendor selection:
1. Is a marketplace game plan a part of the corporate strategy in the next 2-5 years?
2. To what extent do diversified local prices and corresponding availability need to be mapped via the on-site search solution?
3. How will you divide your focus across customer channels in the mid-term? How will the ratios look?
4. Which unique selling points/functionalities provide an anchor for your corporate strategy? Content leadership, expansion of digital advisor functionalities.
5. What are midterm geographic growth markets already known?
6. Is strategic ownership of core technological competencies and technologies part of the corporate strategy?
7. Are there strategic requirements in terms of technological infrastructure (on-premise, private cloud, open cloud)?
8. How large is the internal team (professional and technical) available to operate the on-site search?
Influence of the IT architecture

On-site search has to support many core functionalities of a digital enterprise. E-Commerce Search consumes, processes, or makes available for further processing, various data streams. As a result, an agile integration into existing IT-enterprise architecture is elementary for success.

Suppose there is no acknowledgment of these foundational provisions. In that case, marred by lengthy, risky, and costly follow-up projects, subsequent adjustments, or even fundamental changes to the system landscape are often DOA (Dead on Arrival).

In terms of the IT-organization, the answers to the following questions are particularly relevant when preparing a vendor selection:
1. Which source and target systems integrations with on-site search currently exist, and which will be considered within the midterm?
2. What are the data-security requirements? How often does this data need to be updated?
3. Are there defined requirements concerning service-level agreements?
4. Are there defined requirements in terms of deployment and infrastructure?
5. Should the on-site search system-integration reside exclusively at the data level (headless architecture), or are rendering functionalities must-have requirements?
6. From a technical perspective, should the on-site search system also be used as a product API?
7. Are there complementary functionalities? For example, recommendation engines, personalization, or guided selling systems that need to be functionally linked or even combined with on-site search?
Influence of operational resources and organization

On-Site Search requires constant maintenance and must react to internal and external factors agily. For this reason, the selection, implementation, and operation of an on-site search is always only part of the solution. On the one hand, the system must be continuously managed and maintained by data-driven external systems (e.g., SearchHub.io) and operational staff with the appropriate domain knowledge.

For the planning of operational resources and team organization, the following questions are essential for the professional selection of an on-site search solution:
1. Does a dedicated team of employees already exist to maintain the on-site search manually? If so, how many?
2. Does the team have developers, testers, and analysts? If not, is there a plan to expand the team’s skill set in these areas?
3. Is the On-Site Search Team organized as a vertical business function in itself? I.e., does the team have all the necessary resources and skills to develop the On-Site Search business function on its own?
Conclusion – E-Commerce Site Search Overhaul:

Strategic internal deliberations significantly influence the evaluation of a new on-site search solution (make or buy). Answering these questions reveals their far-reaching and strategic nature. Naturally, thorough preparation will take time, and all necessary stakeholders will need to arrive at a consensus about the objective. This process leads to greater clarity regarding the next steps. Even if that means the best approach may be to keep the current, well-integrated platform and instead work on mitigating its weaknesses.

There’s More than One Way to Skin a Cat

There are often several ways to fix an on-site search-related deficiency. Resorting to blind action for the hell of it should never be one of them. Regardless of the euphoric high associated with onboarding a new complex piece of kit, if you haven’t done your homework, you’ll inevitably be trading fruit-flies for maggots. Like hoping to exchange your partner for a younger, less judgmental model, if you haven’t come to grips with your own shortcomings, you’re damned to take them with you to the next relationship.

Know When to Hold ‘em, Know When to Fold ‘em

I get it. Every so often, there’s nothing left to salvage. It’s best to cut ties and move on. However, between you and me, there are massive benefits in using software like searchHub to boost an existing system quickly. Furthermore, setting up searchHub affords more forward flexibility. This kind of software runs independently of any search solution. Meaning, you can use the logic you built with us and take it to any other search provider you migrate to in the future.
- Best-case scenario: you turn your current solution into a searchandizing powerhouse.
- Worst-case scenario: you now clearly understand what type of e-commerce search solution your business requires. And because your search-engine logic is not married to your on-site search, you’re able to migrate to a new solution with next to zero downtime.
searchHub.io offers data-driven support and helps optimize existing search applications without making a corresponding system change.
March 18, 2021

Sustainable Development for ecommerce site search

What’s your first association when you read “sustainable development”? Perhaps it conjures up some dry country in the Southern Hemisphere with lots of potential for present and future development? Maybe IT startups developing solutions that help reduce CO2 emissions are your thing? Or perhaps a Tesla Model S Plaid+ that “develops” you from 0-100 km/h in a sustainable 2.1 seconds?

© Photo: Johan Eriksson for Dollar Street 2015

My Association with Sustainable Development? Cache!

My Association is Cache. Not Cash. Cache.

These days, everyone’s talking about website performance. The rapid increase in online traffic during the pandemic has contributed to an even greater focus on the topic. And what better way to increase website performance in volatile times than an intelligent caching strategy? The knowledge is not new but bears frequent discussion.

A clever caching design not only increases the performance of your website, but it’s also a smart way to build environmentally friendly and develop sustainable scaling applications.

Siegfried Schüle

Before continuing with this post, it will serve you well to familiarize yourself with the following: Latency numbers every programmer should know.

Wow, — that’s impressive. One quick check in the server’s local memory is 1,500,000 times faster than requesting the same information with an HTTP-request over the internet. Not only is it faster, but it’s more efficient and climate-friendly as well.

Tiny little numbers aren’t your thing? The following compares all the stuff with numbers more humanly compatible:

Compute Performance – Distance of Data as a Measure of Latency

How to Sustainably Develop Ecommerce Infrastructure?

I have often stumbled over the same problem while building e-commerce applications over the last 20 years: website users want to see stuff other users have already seen. This behavior has not changed a tiny bit. Things like a product image, the detailed description of the newest Xphone, a search result page, or in most cases, even the “in stock” status. Now, what’s a developer to do, tasked with delivering this information correctly to the customer?

List of Developer Tasks to Right the World

Images: Loading the image from the media database (usually stored in high resolution), scale it to the correct resolution for the customer’s device, and send it over the net.
Article texts: Load the texts from the PIM (product information management), where all the marketing people could edit the texts at any time and want to see the latest update online as soon as they hit the save button.
In Stock status: Send a request to the ERP system asking for the number of articles still available and determine the “in stock” status based on various information like “already put in some other customers’ baskets,” “already ordered, but the customer has still not finished payment process, and so I do not know exactly whether this one-piece has finally been sold or not” and maybe other fancy stuff.
Search Results: Sending the user query for “ihpone” to the search index, which will try to find products that are more or less similar to the user query and — if lucky — return some iPhones or matching accessories

Congratulations to the development team. If they followed all the guidelines above, they would have built a rock-solid system that will always show real-time data to the customers. But it will not be sustainable.

Why Your e-commerce Infrastructure Development isn’t Sustainable?

The type of development described above requires servers all over the planet to repeatedly calculate or HTTP-request the same stuff, though not a single kilobyte has changed since the last time they (or some other server) calculated it.

You Need A Strategic Caching Approach

It’s all about the Cache

Let’s look at this practically.

If your product images’ source has not changed, there is no need to ask your SaaS image-scaling-service to scale the image to some smaller resolution more than once. It will produce the same output for the first time, the 10th time, or the 1000th time.
Suppose your article texts have not been SEO-optimized within the last few minutes, and there has been no other activity connected with this specific product either. Why then should you bother the (maybe distant) database?
Or suppose no system has yet registered the status change of a particular article from “in stock” to “unavailable.” As a result, interested parties have yet to receive a notification. Why continue asking the ERP like a three-year-old kid bugging his parents, then?
If your domain-specific language hasn’t changed, and as long as customers typing “ihpone” still mean “iPhone,” why should your search engine try finding fuzzy matches all day long? 🤦‍♂️

While the first three aspects are quite obvious and implemented widely throughout the eCommerce landscape, the latter is not. But its impact is enormous!

What is the Impact of Poorly Cached Site Search?

Imagine a search index of product texts which can easily contain 1,000,000 different words. If a user searches for any given phrase, the index must, to some extent, compare each input word with each indexed word. As long as we are talking about exact matches (“iPhone” → “iPhone”) or matches explicitly made by some analyzers such as stemmers (“iPhones” → “iPhone”), this should be no concern. But as soon as we are using more sophisticated fuzzy matches, the impact can be huge. Some algorithms are much less efficient in FACT than the ones used by Elastic — some say this is necessary to achieve higher precision.

I, however, adhere to a massively different approach. Imagine you are sure regarding how relevant a specific user input for a particular product text is. In that case, it would be wise to remember this decision (or load all appropriate decisions into memory). This way, you don’t have to calculate it again next time. Let me show you a rough calculation of the effect this has on your server load. To simplify the calculation, I’ll measure all server costs in milliseconds necessary to perform the operation and the resulting CO2 emissions:

Server Load Relative to CO2 Output

User input	Matching Algorithm	Costs (ms)	Result	Costs per Search	CO2 Emissions per search	CO2 Emissions per 1M Searches
ihpone	exact	0.1	unsuccessful	0.1	0.001 mg	1g
ihpone	Levenshtein Distance 1	1	unsuccessful	1	0..1 mg	10g
ihpone	Levenshtein Distance 2	5	iPhone	5	0.05 mg	50 g
ihpone	Sophisticated algorithm	100	iPhone	100	1 mg	1kg
ihpone	Sophisticated algorithm with cached result + exact match	100.1	iPhone	100 + 0.1 x search	decreasing	1.001g

Calculations based on information found here.

How Site Search Server Load Increases Your Shop’s CO2 Footprint

The first time the term “ihpone” is entered into your shop, it’s necessary for your eCommerce application to use a sophisticated algorithm to determine that the user intended to find an “iPhone.” Some search engines use sophisticated (i.e., load-intensive) algorithms by default. Admittedly, they are easy to use. Simply provide enough server power to scale them horizontally, and they will return surprisingly good results.

Mind your ecological footprint

On the other hand, if we take an ecologically strategic approach to server load compared to its strain on the environment, the story looks dramatically different.

For example: How often do you think users’ search intent changes, for an identically misspelled phrase, over, say hours, days, or even years? Although the product changed since its 2007 debut, the “ihpone” typo and its intent have remained stable throughout the last 14 years. How many billions of search requests have been executed within that period requiring search engines to apply more or less sophisticated algorithms forcing server CPUs to produce heat and resulting CO2?

Only the typo’s first appearance needs expensive algorithms to calculate a proper response in an ideal world. After that, every request uses exact (and cheap) matching technologies.

With searchHub, we do our best not only to optimize the search result quality. By making exact search easy and using it frequently, we also reduce eCommerce search’s climate footprint by utilizing sophisticated calculating knowledge only once and reusing it wherever possible.

American Carbon Footprint is relatively high – Townsquare Media

With 56 billion optimized search requests, we project to have saved roughly 30 tons of CO2 emissions within the last 12 months. That’s equivalent to approximately the yearly CO2 amount of four Belgians, or just over one American! True, this is a tiny drop upon the hot rock we all call home, but maybe it inspires you to rethink caching strategies for your product or within your eCommerce shop.

Siegfried Schüle

CEO

March 4, 2021

Artificial Stupidity – How To Avoid it before it’s too late
The realization struck me while holding the hand of my seven-year-old son, standing at the precipice of the most giant cliff I had ever looked over. At this moment, his boundless freedom to explore his surroundings took a back seat to his safety. In that precarious and volatile moment, my natural intelligence as a human outweighed philosophical notions of parenting. Anything less would have been artificially stupid.

Machine Learning and Real-World Consequences

Assuming my parental judgment, described above, is sound, we could safely say that most parents, placed in a similar situation, would make a similar judgment call. Suppose it is true that we can make intelligent, rational decisions in the interest of posterity. Why are we so sluggish about transferring this embedded natural intelligence to the machine learning algorithms we develop and implement into, arguably, equally precarious business situations?

When AI is your lover — you extrapolate all over the place

Our infatuation with artificial intelligence leads to a mindless disregard for natural intelligence. Unsurprisingly, in the words of Vincent Warmerdam this makes our machine learning algorithms artificially stupid.

Algorithms merely automate, approximate, and interpolate. Extrapolation is the dangerous part.

Vincent Warmerdam, 2019

Image by Gerd Altmann from Pixabay

The danger of getting emotionally involved

This post pays open homage to Vincent’s enlightening talk from 2019 entitled “How to Constrain Artificial Stupidity”– a topic increasingly deserving of a more watchful eye. What follows is part 1 of a series, in which we will take a closer look at several of Vincent’s fixes for Artificial Stupidity in the field of machine learning.

Artificial Stupidity: the lack of real-world consensus (or natural intelligence) reflected in machine learning algorithms.

This complacency around natural intelligence and how to implement it in our machine learning models results in dumbing down the output of our otherwise ingenious AI creations, resulting in disastrous real-world consequences.

Example of Artificial Stupidity in the Wild

The Boston Housing Data Set used broadly to run probability tests on the housing market. One of the data columns delineates the “number of black people in your town.” If unquestioned, running probabilities against this data set will ironically reinforce a preexisting bias within the same data thought to provide a “fair” estimation of housing trends.

This example makes strikingly clear how important remaining curious about your database’s sources and content is before reporting any algorithmic successes.

Artificial: Made or produced by human beings rather than occurring naturally, especially as a copy of something natural.1

Stupidity: Behavior that shows a lack of good sense or judgment.

How wrong can an AI Model be?

There are usually two things that can go HorriblyWrong™ with models.
1. Models don’t perform well on a metric people are attached to.
2. Models do something that you don’t want them to.
My thesis is that the industry is too focused on the performance; we need to worry more about what happens when models fail.

Vincent Warmerdam, 2019

Image by succo from Pixabay

Avoiding the AS (Artificial Stupidity) — “Love is Blind” Trap

If the above thesis is confirmed, a stronger focus on understanding why models fail and taking necessary steps to fix them is in order. It would better serve us if we began approaching machine learning like people in physics: study a system until it becomes clear which model will explain everything.

The following is the first in a set of four suggested fixes. The remaining three will follow in future posts.

Fix #1: Predict Less, and more carefully

We must be honest about what AI does. AI does not, in fact, deliver a probability. Honestly put, AI gives us an approximation of a proxy, given certain known factors.

AI cannot determine how unknown factors will influence what we do know. As a result, any missing data or data we are unaware of will dramatically affect our model’s output. Without all the data, we are unable to illustrate at which point the AI model will fail.

This wouldn’t be a problem if machine learning models weren’t always designed to return a result. We need to build safeguards to constrain when a model returns a result. And determine at which threshold the constraints will prevent an artificially stupid prediction.

In short: If we don’t know, don’t predict!

Missing data or wrong data means unwittingly solving for the wrong problem. In the real world, our model will fail. It’s okay to approach failure with humility, take a step back, and use natural human-intelligence to evaluate if we can come to a more valuable human solution. This humility will help us better articulate what we are solving for. Maybe this will lead to us realizing that we missed something in or asked the wrong questions of the data.

Algorithms merely automate, approximate, and interpolate. Extrapolation is the dangerous part.

Try not to report an AI miracle until we understand when the model will fail.

Fairness at the cost of privacy?

What are the practical implications? If I am looking to build a model to grant the highest possible fairness across my data set, I will need to calculate at what point the model is unfair. Having information like gender, race, and income within the data set will provide more transparency into how fairness is defined within a specific dataset. Baffling as it may be, without being honest about how this type of data influences our models, hiding instead behind good-intentioned data-privacy conventions, businesses can legitimately refuse transparency into their algorithmic predictions on the grounds of anti-discrimination.

In this way, an algorithm whose original purpose was, for example, to generate greater fairness among demographics in the housing market could become the basis for intensified segregation and systemic racism.

This is ethically debased and begs a solution. Something this post is far from providing. Suffice it to say: honest digital business looks different.

At the very least, we need to identify sensitive variables and do our best to correct for them. This means we must do everything we can to understand better the data going into our models.

If the predictions coming out of your model are your responsibility, so too should be the data going into the model.

Rediscover a Whole new World — Design-Thinking

Having this knowledge raises the stakes of machine learning! Simultaneously, approaching machine learning and AI in this way redeems our whole world around design–thinking (Read Andreas Wagner’s interpretation of a findability score to get an idea of what I mean!). Suddenly, we are once again the creators of our own design. No longer blindly plugging data into models whose outcomes we are powerless to influence. Understanding and giving merit to the human intelligence behind the models we use positions us to ask critical questions of the data we plug into the model.

As a result, we can move away from a OneSizeForAll().fit() or model().fit, and toward more meaningful bespoke models tailor.model().

In this way, we increase how articulate a system is while at the same time answering questions about assumptions without resorting to basic metrics.

From this perspective, making a model is: learning from data x whatever constraints I have.

Maybe we should start learning to accept that model.fit()is incredibly naive. Perhaps we would be better served if we began approaching machine learning like people in physics: study a system until it becomes clear which model will explain everything.

Vincent Warmerdam

Most importantly

Take a step back and consider for which use case your model should be a proxy. Does it mimic its real-word naturally intelligent counterpart? Or is your model out-to-lunch concerning real-world application? Beware: you don’t want to be the person designing an algorithm responsible for quoting less than fair housing rates due to the number of black people in a neighborhood! Which natural thinking person would do that?

Natural Intelligence isn’t such a bad thing

Grant yourself the creative freedom to understand the problem. Your solution design will be better as a result.

Check out Vincent’s open-source project called scikit-lego (an environment to play around with these different types of strategies in real-world scenarios) and his YouTube video which inspired this blog post.

Summary

Artificial Intelligence isn’t such a bad thing if we are willing to bestow credit on the beautiful, natural intelligence which is human. This approach is lacking in our Machine Learning models today. If intelligently implemented into our models, the potential for this natural intelligence approach to deliver more meaningful results is excellent.

We’ll be talking more about the remaining three fixes for artificial stupidity in future posts. Stay with us!!
February 26, 2021
Quick-Start with OCSS – Creating a Silver Bullet
Last week, I took pains to share with you my experience building Elasticsearch Product Search Queries. I explained there is no silver bullet. And if you want excellence, you’ll have to build it. And that’s tough. Today, I want to show how our OCSS Quick-Start endeavors to do just that. So, here you have it: a Quick-Start framework to ensure Elasticsearch Product Search performs at an exceptional level, as it ought.

How-To Quick-Start with OCSS

Do you have some data you can get your hands on? Let’s begin by indexing and try working with it. To quickly start with OCSS, you need docker-compose. Find the “operations” folder of the project, at a minimum, and run docker-compose up inside the “docker-compose” folder. It might also be necessary to run docker-compose restart indexer since it will fail to set up properly if the Elasticsearch container is not ready at the start.

You’ll find a script to index CSV data into OCSS in the “operations” folder. Run it without parameters to view all options. Now, use this script to push your data into Elasticsearch. With the “preset” profile in the docker-compose setup active by default, data fields like “EAN,” “title,” “brand,” “description,” and “price” are indexed respectively for search and facet usage. Have a look at the “preset” configuration if more fields need to be indexed for search or facetting.

Configure Query Relaxation

True to the OCSS Quick-Start philosophy, the “preset” configuration already comes with various query stages. Let’s take a look at it; afterward, you should be able to configure your own query logic.

How to configure “EAN-search” and “art-nr-search”

The first two query configurations “EAN-search” and “art-nr-search” are very similar:
```
				
					ocs:
  default-tenant-config:
    query-configuration:
      ean-search:
        strategy: "ConfigurableQuery"          1️⃣
        condition:                             2️⃣
          matchingRegex: "\s*\d{13}\s*(\s+\d{13})*"
          maxTermCount: 42
        settings:                              3️⃣
          operator: "OR"
          tieBreaker: 0
          analyzer: "whitespace"
          allowParallelSpellcheck: false
          acceptNoResult: true
        weightedFields:                        4️⃣
          "[ean]": 1
      art-nr-search:
        strategy: "ConfigurableQuery"          1️⃣
        condition:                             2️⃣
          matchingRegex: "\s*(\d+\w?\d+\s*)+"
          maxTermCount: 42
        settings:                              3️⃣
          operator: "OR"
          tieBreaker: 0
          analyzer: "whitespace"
          allowParallelSpellcheck: false
          acceptNoResult: true
        weightedFields:                        4️⃣
          "[artNr]": 2
          "[masterNr]": 1.5
				
```
1️⃣ OCSS distinguishes between several query strategies. The “ConfigurableQuery” is the most flexible and exposes several Elasticsearch query options (more to come). See further query strategies below.

2️⃣ The condition clause configures when to use a query. These two conditions (“matchingRegex” and “maxTermCount“) specify that a specific regular expression must match the user input. These are then used for a maximum of 42 terms. (A user query is split by whitespaces into separate “terms” in order to verify this condition).

3️⃣ The “settings” govern how the query is built and how it should be used. These settings are documented in the QueryBuildingSettings. Not all settings are supported by all strategies, and some are still missing – this is subject to change. The “acceptNoResult” is essential here because if a numeric string does not match the relevant fields, no other query is sent to Elasticsearch, and no results are returned to the client.

4️⃣ Use the “weightedFields” property to specify which fields should be searched with a given query. Non-existent fields will be ignored with a minor warning in the logs.

How to configure “default-query” the OCSS Quick-Start way

Next, the “default-query” is available to catch most queries:
```
				
					ocs:
  default-tenant-config:
    query-configuration:
      default-query:
        strategy: "ConfigurableQuery"
        condition:                            1️⃣
          minTermCount: 1
          maxTermCount: 10
        settings:
          operator: "AND"
          tieBreaker: 0.7
          multimatch_type: "CROSS_FIELDS"
          analyzer: "standard"                2️⃣
          isQueryWithShingles: true           3️⃣
          allowParallelSpellcheck: false      4️⃣
        weightedFields:
          "[title]": 3
          "[title.standard]": 2.5             5️⃣
          "[brand]": 2
          "[brand.standard]": 1.5
          "[category]": 2
          "[category.standard]": 1.7
				
```
1️⃣ “Condition” is used for all queries with up to 10 terms. This is an arbitrary limit and can, naturally, be increased – depending on users’ search patterns.

2️⃣ “Analyzer” uses the “standard” analyzer on search terms. This means it applies stemming and stopwords. These analyzed terms are then searched within the various fields and subfields (see point #5 below). Simultaneously, the “quote analyzer” is set to “whitespace” to match search phrases exactly.

3️⃣ The option “isQueryWithShingles” is a unique feature we implemented into OCSS. It combines neighboring terms and searches, combined with their individual variations, but set at nearly double the weight. The goal is to find compound words in the data as well.

Example: “living room lamp” will result in “(living room lamp) OR (livingroom^2 lamp)^0.9 OR (living roomlamp^2)^0.9”.

4️⃣ “allowParallelSpellcheck” is set to false here because this requires extra time, which we don’t want to waste in most cases wherever users pick the correct spelling. If enabled, a parallel “suggest query” is sent to Elasticsearch. If the first try yields no results and it’s possible to correct some terms, the same query will be fired again using the corrected words.

5️⃣ As you can see here, subfields can be uniquely applied congruent to their function.

How to configure additional query strategies

I will not go into any great detail regarding the following query stages configured within the “preset” configuration. They are all quite similar — here just a few notes concerning additionally available query strategies.
- DefaultQueryBuilder: This query tries to balance precision and recall using a minShouldMatch value of 80% and automatic fuzziness. Use if you don’t have the time to configure a unique default query.
- PredictionQuery: This is a special implementation that necessitates a blog post all its own. Simply put, this query performs an initial query against Elasticsearch to determine which terms match well. The final query is built based on the returned data. As a result, it might selectively remove terms that would, otherwise, lead to 0 results. Other optimizations are also performed, including shingle creation and spell correction. It’s most suitable for multi-term requests.
- NgramQueryBuilder: This query builder divides the input terms into short chunks and searches them within the analyzed fields in the same manner. In this way, even partial matches can return results. This is a very sloppy approach to search and should only be used as a last resort to ensure products are shown instead of a no-results page.
How to configure my own query handling

Now, use the “application.search-service.yml” to configure your own query handling:
```
				
					ocs:
  tenant-config:
    your-index-name:
      query-configuration:
        your-first-query:
          strategy: "ConfigurableQuery"
          condition:
            # ..
          settings:
            #...
          weightedFields:
            #...
				
```
As you can see, we are trying our best to give you a quick-start with OCSS. It already comes pre-packed with excellent queries, preset configurations, and the ability to use query relaxation without touching a single line of code. And that’s pretty sick! I’m looking forward to increasing the power behind the configuration and leveraging all Elasticsearch options.

Stay tuned for more insights into OCSS.

And if you haven’t noticed already, all the code is freely available. Don’t hesitate to get your hands dirty! We appreciate Pull Requests! 😀
February 11, 2021
My Journey Building Elasticsearch for Retail
If, like me, you’ve taken the journey that is building an Elasticsearch retail project, you’ve inevitably experienced many challenges. Challenges like, how do I index data, use the query API to build facets, page through the results, sorting, and so on? One aspect of optimization that frequently receives too little attention is the correct configuration of search analyzers. Search analyzers define the architecture for a search query. Admittedly, it isn’t straightforward!

The Elasticsearch documentation provides good examples for every kind of query. It explains which query is best for a given scenario. For example, “Phrase Match” queries find matches where the search terms are similar. Or “Multi Match” with “most field” type are “useful when querying multiple fields that contain the same text analyzed in different ways”.

All sounds good to me. But how do I know which one to use, based on the search input?

Elasticsearch works like cogs within a Rolex

Where to Begin? Search query examples for Retail.

Let’s pretend we have a data feed for an electronics store. I will demonstrate a few different kinds of search inputs. Afterward, I will briefly describe how search should work in each case.

Case #1: Product name.

For example: “MacBook Air”

Here we want to have a query that matches both terms in the same field, most likely the title field.

Case #2: A brand name and a product type

For example: “Samsung Smartphone”

In this case, we want each term to match a different field: brand and product type. Additionally, you want to find both terms as a pair. Modifying the query in this way prevents other smartphones or Samsung products from appearing in your result.

Case #3: The specific query that includes attributes or other details

For example: “notebook 16 GB memory”

This one is tricky because you want “notebook” to match the product type, or maybe your category is named such. On the other hand, you want “16 GB” to match the memory attribute field as a unit. The number “16” shouldn’t match some model number or other attribute.

For example: “MacBook Pro 16 inch“ is also in the “notebook” category and has some “GB” of “memory“. To further complicate matters, search texts might not contain the term “memory”, because it’s the attribute name.

As you might guess, there are many more. And we haven’t even considered word composition, synonyms, or typos yet. So how do we build one query that handles all cases?

Know where you come from to know where you’re headed

Preparation

Before striving for a solution, take two steps back and prepare yourself.

Analyze your data

First, take a closer look at the data in question.
- How do people search on your site?
- What are the most common query types?
- Which data fields hold the required content?
- Which data fields are most relevant?
Of course, it’s best if you already have a site search running and can, at least, collect query data there. If you don’t have a site search analytics, even access-logs will do the trick. Moreover, be sure to measure which queries work well and which do not provide proper results. More specifically, I recommend taking a closer look at how to implement tracking, the analysis, and evaluation.

You are welcome to contact us if you need help with this step. We enjoy learning new things ourselves. Adding searchHub to your mix gives you a tool that combines different variations of the same queries (compound & spelling errors, word order variations, etc.). This way, you get a much better view of popular queries.

Track your progress

You’ll achieve good results for the respective queries once you begin tuning them. But don’t get complacent about the ones you’ve already solved! More recent optimizations can break the ones you previously solved.

The solution might simply be to document all those queries. Write down the examples you used, what was wrong with the result before, and how you solved it. Then, perform regression tests on the old cases, following each optimization step.

Take a look at Quepid if you’re interested in a tool that can help you with that. Quepid helps keep track of optimized queries and checks the quality after each optimization step. This way, you immediately see if you’re about to break something.

The fabled, elusive silver-bullet.

The Silver-Bullet Query

Now, let’s get it done! Let me show you the perfect query that solves all your problems…

Ok, I admit it, there is none. Why? Because it heavily depends on the data and all the ways people search.

Instead, I want to share my experience with these types of projects and, in so doing, present our approach to search with Open Commerce Search Stack (OCSS):

Similarity Setting

When dealing with structured data, the scoring algorithms of Elasticsearch “TF/IDF” and BM25 will most likely screw things up. These approaches work well for full-text search, like Wikipedia articles or other kinds of content. And, in the unfortunate case where your product data is smashed into one or two fields, you might also find them helpful. However, with OCSS (Open Commerce Search Stack), we took a different approach and set the similarity to “boolean”. This change makes it much easier to comprehend the scores of retrieved results.

Multiple Analyzers

Let Elasticsearch analyze your data using different types of analyzers. Do as little normalization as possible and as much as necessary for your base search-fields. Use an analyzer that doesn’t remove information. What I mean with this is no stemming, stop-words, or anything like that. Instead, create sub-fields with different analyzer approaches. These “base fields” should always have a greater weight during search time than their analyzed counterparts.

The following shows how we configure search data mappings within OCSS:
```
				
					{
  "search_data": {
    "path_match": "*searchData.*",
    "mapping": {
      "norms": false,
      "fielddata": true,
      "type": "text",
      "copy_to": "searchable_numeric_patterns",
      "analyzer": "minimal",
      "fields": {
        "standard": {
          "norms": false,
          "analyzer": "standard",
          "type": "text"
        },
        "shingles": {
          "norms": false,
          "analyzer": "shingles",
          "type": "text"
        },
        "ngram": {
          "norms": false,
          "analyzer": "ngram",
          "type": "text"
        }
      }
    }
  }
}
				
			
```
Analyzers used above explained

Let’s break down the different types of analyzers used above.
- The base field uses a customized “minimal” analyzer that removes HTML tags, non-word characters, transforms the text to lowercase, and splits it by whitespaces.
- With the subfield “standard”, we use the “standard analyzer” responsible for stemming, stop words, and the like.
- With the subfield “shingles”, we deal with unwanted composition within search queries. For example, someone searches for “jackwolfskin”, but it’s actually “jack wolfskin”.
- With the subfield “ngram,” we split the search data into small chunks. We use that if our best-case query doesn’t find anything – more about that in the next section, “Query Relaxation”.
- Additionally we copy the content to the “searchable_numeric_patterns” field which uses an analyzer that removes everything but numeric attributes, like “16 inch”.
The most powerful Elasticsearch Query

Use the “query string query” to build your final Elasticsearch query. This query type gives you all the features from all other query types. In this way, you can optimize your single query without the need to change to another query type. However, it would be best to strip “syntax tokens”; otherwise, you might get an invalid search query.

Alternatively, use the “simple query string query,” which can also handle most cases if you’re uncomfortable with the above method.

My recommendation is to use the “cross_fields” type. It’s not suitable for all kinds of data and queries, but it returns good results in most cases. Place the search text into quotes and use a different quote_analyzer to prevent the search input from being analyzed with the same analyzer. Also, if the quoted-string receives a higher weight, a result with a matching phrase is boosted. This is how the query-string could look: “search input “^2 OR search input.

And remember, since there is no “one query to rule them all,” use query relaxation.

How do I use Query Relaxation?

After optimizing a few dozen queries, you realize you have to make some compromises. It’s almost impossible to find a single query that works for all searches.

For this reason, most implementations I’ve seen opt for the “OR” operator, thus allowing a single term to match when multiple terms are in the search input. The issue here is that you still end up with results that only partially match. It’s possible to combine the “OR” operator with a “minimum_should_match” definition to boost more matches to the top and control the behavior.

Nevertheless, this may have some unintended consequences. First, it could pollute your facets with irrelevant attributes. For example, the price slider might show a low price range just because the result contains unrelated cheap products. It may also have the unwanted effect of making ranking the results according to business rules more difficult. Irrelevant matches might rank toward the top simply because of their strong scoring values.

So instead of the silver-bullet query – build several queries!

Relax queries, divide the responsibility, use several

The first query is the most accurate and works for most queries while avoiding unnecessary matches. Run a second query that is more sloppy and allows partial matches if the initial one leads to zero results. This more flexible approach should work for the majority of the remaining queries. Try using a third query for the rest. Within OCSS, at the final stage, we use the “ngram” query. Doing so allows for partial word matches.

“But sending three queries to Elasticsearch will need so much time,” you might think. Well, yes, it has some overhead. At the same time, it will only be necessary for about 20% of your searches. Also, zero-matches are relatively fast in their response. They are calculated pretty quickly on 0 results, even if you request aggregations.

Sometimes, it’s even possible to decide in advance which query works best. In such cases, you can quickly pick the correct query. For example, identifying a numeric search is easy. As a result, it’s simple only to search numeric fields. Also, as there is no need to analyze a second query, it’s easier to handle single-term searches uniquely. Try to improve this process even further by using an external spell-checker like SmartQuery and a query-caching layer.

Conclusion

I hope you’re able to learn from my many years of experience; from my mistakes. Frankly, praying your life away (e.g., googling till the wee hours of the morning), hoping, and waiting for a silver-bullet query, is entirely useless and a waste of time. Learning to combine different query analysis types and being able to accept realistic compromises will bring you closer, faster to your desired outcome: search results that convert more visitors, more of the time than what you previously had.

We’ve shown you several types of analyzers and queries that will bring you a few steps closer to this goal today. Strap in and tune in next week to find out more about OCSS if you are interested in a more automated version of the above.
February 11, 2021
How To DIY Site search analytics – made easy
In my first post, I talked about the importance of site search analytics for e-commerce optimization. In this follow-up, I would like to show one way how to easily build a site search analytics system at scale, without spending much time and effort on answering these ever present questions:
1. Which database is best for analytics?
2. How do I operate that database at scale?
3. What are the operating costs for the database?
How-To Site-Search Analytics without the Headache

These questions are important and necessary. Thankfully, in the age of cloud computing, others have already thought about, and found solutions to abstract out the complexity. One of them is Amazon Athena. This will help us build a powerful analysis tool from, in the simplest case, things like CSV files. Amazon Athena, explained in its own words:

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Amazon Athena

This introductory sentence from the Amazon website already answers our questions 1 and 2. All that remains is to answer question 3: how much does it cost? This is answered quickly enough:
- $5.00 per TB of data scanned by Athena
- Standard AWS S3 rates for storage, requests, and data transfer
AWS offers a calculator to roughly estimate the cost. Because Amazon Athena uses Presto under the hood, it works with a variety of data formats. This includes CSV, JSON, ORC, Apache Parquet, and Apache Avro. Choosing the right file format can save you up to a third of the cost.

No data, no DIY analytics

A site search analytics tool requires a foundation. Either data from an e-commerce system or any site search tracking tool like the searchhub search-collector will suffice. For now, we will focus on how to convert data into the best possible format, and leave the question of “how to extract data from the various systems” for a separate post.

As the database needn’t scan a complete row but only the columns which are referenced in the SQL query, a columnar data format is preferred to achieve optimal read performance. And to reduce overall size, the file format should also support data compression algorithms. In the case of Athena, this means we can choose between ORC, Apache Parquet, and Apache Avro. The company bryteflow provides a good comparison of these three formats here. These file formats are efficient and intelligent. Nevertheless, they lack the ability to easily inspect the data in a humanly readable way. For this reason, consider adding an intermediate file format to your ETL pipeline. Use this file to store the original data in an easy-to-read format like CSV or JSON. This will make your life easier when debugging any strange-looking query results.

What are we going to build?

We’ll now build a minimal Spring Boot web application that is capable of the following:
1. Creating dummy data in a humanly readable way
2. Converting that data into Apache Parquet
3. Uploading the Parquet files to AWS S3
4. Query the data from AWS Athena using JOOQ for creating type-safe SQL queries using the Athena JDBC driver.
Creating the application skeleton

Head over to Spring initializr and generate a new application with the following dependencies:
- Spring Boot DevTools
- Lombok
- Spring Web
- JOOQ Access Layer
- Spring Configuration Processor
Hit the generate button to download the project. Afterward, you need to extract the zip file and import the maven project into your favorite IDE.

Our minimal database table will have the following columns:
1. query
2. searches
3. clicks
4. transactions
We will use the jooq-codegen-maven plugin, to build type-safe queries with JOOQ, which will generate the necessary code for us. The plugin can be configured to generate code based on SQL DDL commands. Create a file called jooq.sql inside src/main/resources/db and add the following content to it:
```
				
					CREATE TABLE analytics (
    query VARCHAR,
    searches INT ,
    clicks INT,
    transactions INT,
    dt VARCHAR
);
				
			
```
Next, add the plugin to the existing build/plugins section of our projects pom.xml:
```
				
					<plugin>
    <groupId>org.jooq</groupId>
    <artifactId>jooq-codegen-maven</artifactId>
    <executions>
        <execution>
            <id>generate-jooq-sources</id>
            <phase>generate-sources</phase>
            <goals>
                <goal>generate</goal>
            </goals>
            <configuration>
                <generator>
                    <generate>
                        <pojos>true</pojos>
                        <pojosEqualsAndHashCode>true</pojosEqualsAndHashCode>
                        <javaTimeTypes>true</javaTimeTypes>
                    </generate>
                    <database>
                        <name>org.jooq.meta.extensions.ddl.DDLDatabase</name>
                        <inputCatalog></inputCatalog>
                        <inputSchema>PUBLIC</inputSchema>
                        <outputSchemaToDefault>true</outputSchemaToDefault>
                        <outputCatalogToDefault>true</outputCatalogToDefault>
                        <properties>
                            <property>
                                <key>sort</key>
                                <value>semantic</value>
                            </property>
                            <property>
                                <key>scripts</key>
                                <value>src/main/resources/db/jooq.sql</value>
                            </property>
                        </properties>
                    </database>
                    <target>
                        <clean>true</clean>
                        <packageName>com.example.searchinsightsdemo.db</packageName>
                        <directory>target/generated-sources/jooq</directory>
                    </target>
                </generator>
            </configuration>
        </execution>
    </executions>
    <dependencies>
        <dependency>
            <groupId>org.jooq</groupId>
            <artifactId>jooq-meta-extensions</artifactId>
            <version>${jooq.version}</version>
        </dependency>
    </dependencies>
</plugin>
				
			
```
The IDE may require the maven project to be updated before it can be recompiled. Once done, you should be able to see the generated code under target/generated-sources/jooq.

Before creating SQL queries with JOOQ, we first need to create a DSL-context using an SQL connection to AWS Athena. This assumes we have a corresponding Athena JDBC driver on our classpath. Unfortunately, maven central provides only an older version (2.0.2) of the driver, which isn’t an issue for our demo. For production, however, you should use the most recent version from the AWS website. Once finished, publish it to your maven repository. Or add it as an external library to your project, if you don’t have a repository. Now, we need to add the following dependency to our pom.xml:
```
				
					<dependency>
    <groupId>com.syncron.amazonaws</groupId>
    <artifactId>simba-athena-jdbc-driver</artifactId>
    <version>2.0.2</version>
</dependency>
				
			
```
Under src/main/resources rename the file application.properties to application.yml and paste the following content into it:
```
				
					spring:
  datasource:
    url: jdbc:awsathena://<REGION>.amazonaws.com:443;S3OutputLocation=s3://athena-demo-qr;Schema=demo
    username: ${ATHENA_USER}
    password: ${ATHENA_SECRET}
    driver-class-name: com.simba.athena.jdbc.Driver
				
			
```
This will auto-configure a JDC connection to Athena and Spring will provide us a DSLContext bean which we can auto-wire into our service class. Please note that I assume you have an AWS IAM user that has access to S3 and Athena. Do not store sensitive credentials in the configuration file, rather pass them as environment variables to your application. You can easily do this, if working with Spring Toll Suite. Simply select the demo application from the Boot Dashboard; then the pen icon to open the launch configuration; navigate to the Environment tab and add the following entries:

Please note the datasource URL property, where you need to add proper values for the following placeholders directly in your properties.yml:
1. REGION: The region you created your Athena database in. We will cover this step shortly.
2. S3OutputLocation: The bucket where Athena will store query results.
3. Schema: The name of the Athena database we are going to create shortly.
We are about to load our Spring Boot application. Our Athena database is still missing, however. And the application won’t start without it.

Creating the Athena database

Login to the AWS console and navigate to the S3 service. Hit the Create bucket button and choose a name for it. You won’t be able to use the same bucket as in this tutorial because S3 bucket names must be unique. However, the concept should be clear. For this tutorial, we will use the name, search-insights-demo and skip any further configuration. This is the location to where we will, later, upload our analytics files. Press Create bucket, and navigate over to the Athena service.

Paste the following SQL command into the New query 1 tab:

CREATE DATABASE IF NOT EXISTS demo;

Hit Run query. The result should look similar to this:

Now, that we have successfully created a database open the Database drop-down on the left-hand side and select it. Next we create a table by running the following query:
```
				
					CREATE EXTERNAL TABLE IF NOT EXISTS analytics (
    query STRING,
    searches INT ,
    clicks INT,
    transactions INT
)
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION 's3://search-insights-demo/'
				
			
```
The result should look similar to this:

Please note some important details here:
1. We partition our table by a string called dt. By partitioning, we can restrict the amount of data scanned by each query. This improves performance and reduces cost. Analytics data can be partitioned perfectly into daily slices.
2. We state that our stored files are in Apache Parquet format.
3. We point the table to the previously created S3 bucket. Please adjust the name to the one you have chosen. Important: the location must end with a slash otherwise you will face an IllegalArgumentException.
Adding the first query to our application

Now, that everything is setup we can add a REST controller to our application that counts all records in our table. Naturally, the result we expect is 0 as we have yet to upload any data. But this is enough to prove that everything is working.

Now, return to the IDE and, in the package com.example.searchinsightsdemo.service, create a new class called AthenaQueryService and paste the following code into it:
```
				
					package com.example.searchinsightsdemo.service;
import static com.example.searchinsightsdemo.db.tables.Analytics.ANALYTICS;
import org.jooq.DSLContext;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
@Service
public class AthenaQueryService {
    @Autowired
    private DSLContext context;
    public int getCount() {
        return context.fetchCount(ANALYTICS);
    }
}
				
			
```
Note that we auto-wire the DSLContext which Spring Boot has already auto-configured based on our settings in the properties.yml. The service contains one single method that uses the context to execute a fetch count query on the ANALYTICS table, which the JOOQ code generator has already created (see the static import).

A Spring service is nothing without a controller exposing it to the outside world, so let’s create a new class, in the package com.example.searchinsightsdemo.rest, called AthenaQueryController. Go there now and add the following:
```
				
					@RestController
@RequestMapping("/insights")
public class AthenaQueryController {
    @Autowired
    private AthenaQueryService queryService;
    @GetMapping("/count")
    public ResponseEntity<Integer> getCount() {
        return ResponseEntity.ok(queryService.getCount());
    }
}
				
			
```
Nothing special here. Just some Spring magic that exposes the REST endpoint /insights/count. This in turn calls our service method and returns the results as a ResponseEntity.

We need to add one more configuration block to the properties.yml, before launching the application for the first time:
```
				
					logging:
  level:
    org.jooq: DEBUG
				
			
```
This will enable debug logging for JOOQ which enables viewing the SQL queries it generates as plain text in our IDE’s console.

That was quite a piece of work. Fingers crossed that the application boots. Give it a try by selecting it in the Boot Dashboard and pressing the run button. If everything works as expected you should be able to curl the REST endpoint via:

curl -s localhost:8080/insights/count

The response should match the expected value of 0, and you should be able to see the following log message in your console:
```
				
					Executing query          : select count(*) from "ANALYTICS"
Fetched result           : +-----+
                         : |count|
                         : +-----+
                         : |    0|
                         : +-----+                                                                  
Fetched row(s)           : 1   
				
			
```
Summary

In this first part of our series, we introduced AWS Athena as a cost-effective way of creating an analytics application. We illustrated how to build this yourself by using a Spring Boot web application and JOOQ for type-safe SQL queries. The application hasn’t any analytics capabilities so far. This will be added in part two where we create fake data for the database. To achieve this, we will first show how to create Apache Parquet files; partition them by date, and upload them via AWS S3 Java SDK. Once uploaded, we will look at how to inform Athena about new data.

Stay tuned and come back soon!

The source code for part one can be found on GitHub.
February 2, 2021
Part 2: Search Quality for Discovery & Inspiration
Series: Three Pillars of Search Quality in eCommerce

In the first part of our series, we learned about Search Quality dimensions. We then introduced the Findability metric, and explained the relationship of this metric on search quality. This metric is helpful when considering how well your search engine handles the information retrieval step. Unfortunately, it completely disregards the emotionally important discovery phase. Essential for both eCommerce, as well as retail in general. In order to better grasp this relationship we need to understand how search quality influences discovery and inspiration.

What is the Secret behind the Most Successful high-growth Ecommerce Shops?

If we analyze the success of high-growth shops, three unique areas set them apart from their average counterparts.

Photo by Sigmund on Unsplash – if retail could grow like plants

What Separates High-Growth Retail Apart from the Rest?

1. Narrative: The store becomes the story

Your visitors are not inspired by the same presentation of trending products every time they land on your site. What’s the use of shopping if a customer already knows what’s going to be offered (merchandised) to them?

Customers are intrigued by visual merchandising which is, in essence, brand storytelling. Done correctly, this will transform a shop into an exciting destination that both inspires, as well as entices shoppers. An effective in-store narrative emotionally sparks customers’ imagination, while leveraging store ambience to transmit the personality of the brand. Perhaps using a “hero” to focus attention on a high-impact collection of bold new items. Or an elaborate holiday display that nudges shoppers toward a purchase.

Shopping is most fun, and rewarding, when it involves a sense of discovery or journey. Shoppers are more likely to return when they see new merchandise related to their tastes, and local or global trends.

2. Visibility: What’s seen is sold (from pure retrieval to inspiration)

Whether in-store or online, visibility encourages retailers to feature items that align with a unique brand narrative. All the while helping shoppers easily and quickly find the items they’re after. The principle of visibility prioritizes which products retailers push the most. Products with a high margin or those exclusive enough to drive loyalty, whether by word of mouth, or social sharing.

Online, the e-commerce information architecture, and sitemap flow, help retailers prominently showcase products most likely to sell. This prevents items from being buried deep in the e-commerce site. Merchandisers use data analytics to know which products are most popular and trending. This influences which items are most prominently displayed. These will be the color palettes, fabrics, and cuts that will wow shoppers all the way to the checkout page.

So why treat search simply as a functional information retrieval tool? Try rethinking it from the perspective of how a shopper might look for something in a brick and mortar scenario.

3. Balance: Bringing buyer’s and seller’s interests together in harmony

In stores and online, successful visual merchandising addresses consumers’ felt needs around things like quality, variety, and sensory appeal. However, deeper emotional aspects like trust are strongly encouraged through online product reviews. These inspire their wants: to feel attractive; to be confident, and hopeful. We can agree that merchandisers’ foremost task, is to attend to merchandise and the associated cues to communicate it properly. It’s necessary to showcase sufficient product variety, while at the same time remaining consistent with the core brand theme. This balancing act requires they strike a happy medium between neither overwhelming nor disengaging their audience.

An example for the sake of clarity:

Imagine you are a leading apparel company with a decently sized product catalog. Everyday, a few hundred customers come to your site and search for “jeans”. Your company offers over 140 different types of jeans, about 40 different jeans jackets and roughly 80 jeans shirts.

Now the big question is: which products deserve the most prominent placement in the search result?

Indeed this is a very common challenge for our customers. And yet all of them struggle addressing it. But why is it so challenging? Mainly because we are facing a multi-dimensional and multi-objective optimisation problem.
1. When we receive a query like “jeans”, it is not 100% clear what the user is looking for. Trousers, jackets, shirts, we just don’t know. As a result, we have to make some assumptions. We present different paths for him to discover the desired information, or receive the inspiration she needs. In other words, for the most probable product types “k”, and the given query, we need to identify related products.
2. Next we find the most probable set of product-types. Then, we need to determine which products are displayed at the top for each corresponding set of products. Which pairs of jeans, jeans jackets and jeans shirts? Or again in a more formal way: for each product type “k” find the top-”n” products related to this product-type and the given query.
Or in simple words: diversify the result set into multiple result sets. Then, learn to rank them independently.

Now, you may think this is exactly what a search & discovery platform was built for. But unfortunately, 99% of these platforms are designed to work as single-dimension-rank applications. They retrieve documents for a given query, assign weights to the retrieved documents, and finally rank these documents by weight. This dramatically limits your ability to rank the retrieved documents by your own set of, potentially, completely different dimensions. This is the reason most search results for generic terms tend to look messy. Let’s visualize this scenario to clarify what I mean by “messy”.

You will agree, the image on the left-hand side, is pretty difficult for a user to process and understand. Even if the ranking is mathematically correct. The reason for this is simple: the underlying natural grouping of product types is lost to the user.

Diversification of a search for “jeans”

Now, let’s take a look at a different approach. On the right-hand side, you will notice, we diversify the search result while maintaining the natural product type grouping. Doesn’t this look more intuitive and visually appealing? I will assume you agree. After all, this is the most prominent type of product presentation retail has used over the last 100 years.

Grouping products based on visual similarity

You may argue that the customer could easily narrow the offering with facets/filters. Data reveals, however, that this is not always the case – even less so on mobile devices. The big conundrum is that you’ve no clue what the customer wants. To be inspired, to be guided in his buying process or just to quickly transact. Additionally, you never know for sure what type of customer you are dealing with. Even with the new, hot, latest and greatest, stuff called “personalization” – that unfortunately fails frequently. Using visual merchandising puts us into conversation with the customer. We ask her to confirm her interests by choosing a “product type”. Yet another reason why diversification is important.

Still not convinced, this is what separates high-growth retail from the rest?

Here is another brilliant example of how you could use the natural grouping by product type to diversify your result. Let’s take a look at a seasonal topic in this case. Another very challenging task. So we give customers the perfect starting point to explore your assortment.

Row-based diversification – explore product catalog

If you have ever tried creating such a page, with a single search request, you know this is almost an impossible task. Not to mention trying to maintain the correct facet counts, product stock values, etc.

However, the approach I am presenting offers so much more. This type of result grouping also solves another well-known problem. The multi-objective optimization ranking problem. Making this approach truly game-changing.

What’s a Multi-Objective Optimization Problem?

Never heard of it? Pretend for a moment you are the customer. This time you’re browsing a site searching for “jeans”. The type you have in mind is something close to trousers. Unaware of all the different types of jeans the shop has to offer, you have to go rogue. This means navigating your way through new territory to the product you are most interested in. Using filters and various search terms for things like color, shape, price, size, fabric, and the like. Keep in mind that you can’t be interested in what you can’t see. At the same time, you may be keeping an eye on the best value for your money.

We now turn the table and pick up from the seller’s perspective. As a seller, you want to present products ranked based on stock, margin, and popularity. If you run a well-oiled machine, you may even throw in some fancy Customer Lifetime Value models.

So, our job is to strike the right balance between the seller’s goals and the customer’s desire. The methodology that attempts to strike such a balance is called the multi-objective optimization problem in ranking.

Let’s use a visualization to illustrate a straightforward solution to the problem, by a diversified result-set grouping.

Row-based ranking diversification

Interested in how this approach could be integrated into your Search & Discovery Platform? Reach out to us @searchHub. Our Beta-Testphase for the Visual-Merchandising open-source module, for our OCSS (Open Commerce Search Stack), begins soon. We hope to use this to soon help deliver more engaging and joyful digital experiences.

High-Street Visual Merchandising Wisdom Come Home to Roost

This is all nothing new, rather it’s simply never found its way into digital retailing. For decades, finding the right diversified set of products to attract window shoppers, paired with the right location, was the undisputed most important skill in classical high street retail. Later, this type of shopping engagement was termed “Visual Merchandising”. The process of closing the gap between what the seller wants to sell and what the customer will buy. And of course, how best to manufacture that desire.

Visual merchandising is one of the most sustainable, as well as differentiating, core assets of the retail industry. Nevertheless, it remains totally underrated.

Still don’t believe in the value of Visual Merchandising? Give me a couple of sentences and one more Chart to validate my assumptions.

Before I present the chart to make you believe, we need to align on some terminology.

Product Exposure Rate (PER): The goal of the product exposure rate is to measure if certain products are under- or over-exposed in our store. The product exposure rate is the “sum of all product views for a given product” divided by “the sum of all product views from all products”.

Product Net Profit Margin (PNPM): With this metric, we try to find the products with the highest Net Profit Margin. Please be aware: it’s sensible to include all product related costs in your calculation. Customer Acquisition Costs, cost of Product returns, etc. The Product Net Profit Margin is the “Product Revenue” minus “All Product Costs” divided by the “Product Revenue”.

Now that we have established some common ground, let’s continue calculating these metrics for all active products you sell. We will then visualize them in a graph.

Product Exposure Rate vs. Product Net Profit Margin

The data above represents a random sample of 10,000 products from our customers. It may look a bit different for your product data, but the overall tendency should be similar. Please reach out to me if this is not the case! According to the graph it seems that the products with high PER (Product Exposure Rate) tend to have a significantly lower PNPM (Product Net Profit Margin).

We were able to spot the following two reasons as the most important for this behaviour:

Two Reasons for Significantly Low Product Net Profit Margin
1. Higher Customer Acquisition Costs for trending products mainly because of competition. Because of this you may even spot several products with a negative PNPM.
2. Another reason is the natural tendency for low priced products to dominate the trending items. This type of over-exposure encourages high-value visitors, to purchase cheaper trending products with a lower PNPM. Customers to whom you would expect to sell higher margin products under normal circumstances.
I simply can’t over-emphasize how crucial digital merchandising is for a successful and sustainable eCommerce business. This is the secret weapon for engaging your shoppers and guiding them towards making a purchase. To take full advantage of the breadth of your product catalog, you must diversify and segment. Done intelligently, shoppers are more likely to buy from you. Not only that, they’ll also enjoy engaging with, and handing over their hard-earned money to your digital store. For retailers, this means a significant increase in conversions, higher AOV, higher margins, and more loyal customers.

Conclusion

Initially, I was going to close this post right after describing how this problem can be solved, conceptually. However, I would have missed an essential, if not the most important part of the story.

Yes, we all know that we live in a data-driven world. Believe me, we get it. At searchHub, we process billions of data points every day to help our customers understand their users at scale. But in the end, data alone won’t make you successful. Unless, of course, you are in the fortunate position of having a data monopoly.

To be more concrete: data will/can help you spot or detect patterns and/or anomalies. It will also help you scale your operations more efficiently. But there are many areas where data can’t help. Especially when faced with sparse and biased data. In retail this is the kind of situation we are essentially dealing with 80% of the time. All digital Retailers, of which I am aware, with a product catalog greater than 10,000 SKUs, face the product exposure bias. This means, only 50-65% of the 10.000 SKUs will ever be seen (exposed) by their users. The rest remain hidden somewhere in the endless digital aisle. Not only does this cost money, it also means a lot of missed potential revenue. Simply put: you can’t judge the value of a product that has never been seen. Perhaps it could have been the Top-Seller you were always looking for were it only given the chance to shine?

Keep in mind that retailers offer a service to their customers. Only two things make customers loyal to a service.

What makes loyal customers?
- deliver a superior experience
- be the only one to offer a unique type of service
Being the one that “also” offers the same type of service won’t help to differentiate.

I’m one hundred percent sure that today’s successful retail & commerce players are the ones that:
1. Grasp the importance of connecting brand and commerce
2. Comprehend how shoppers behave
3. Learn their data inside and out
4. Develop an eye for the visual
5. Connect visual experiences to business goals
6. Predict what shoppers will search for
7. Understand the customer journey and how to optimize for it
8. Think differently when it comes to personalizing for customers
9. Realize it’s about the consumer, not the device or channel
I can imagine many eCommerce Managers might feel overwhelmed by the thought of delivering an eCommerce experience that sets their store apart. I admit, it’s a challenge connecting all those insights and capabilities practically. And while we’re not going to minimize the effort involved, we have identified an area that will elevate your digital merchandising to new levels and truly differentiate you from the competition.
January 26, 2021
Search Event Data Collection – Progress So Far
More than 2 years ago we open-sourced our search-collector, a lightweight javascript SDK that allows you to run search KPI collection from your e-commerce website. This post will illustrate our progress with search event data collection to date. Since launch, our search event collector has gathered close to a billion events, all while maintaining utmost user privacy – the collector SDK does not track any personally identifiable information, uses no fingerprinting or any associated techniques. The sole focus of our collector is rather, to simply record search events and pass them to an endpoint.

Why the search collector and why you need it too

One may argue that Google Analytics provides everything you need. However, once you dive deeper into site search analytics, its deficiencies become apparent.
- Google Analytics runs on sampled data. As a result, it does not depict an accurate picture.
- It’s not possible to implement certain KPIs. For example, product click tracking per keyword.
- Google Analytics often lacks optimum configuration within the web-shop. Fixing it rarely requires available engineering resources.
These types of scenarios led to the birth of the search event collector. As we would rather not impose a particular type of configuration, we structured the collector as an SDK. This strategy gives every team the flexibility to assemble a unique search metric collection solution fit for a particular purpose.

How does search-collector work?

Search-collector has two key concepts

Collectors

Collectors are simple javascript classes that attach to a DOM element, ex. the search box. When an event of interest happens at that DOM element, the collector reacts, packages the relevant data, and passes it on to a Writer (see below).

We’ve provided many out of the box collectors that can help track events like typing, searches, refinements, add to baskets and more.

Writers

Writers receive data from the collectors and deliver it to a storage location. Chaining Writers together will provide separation of concerns (SoC) and prepare them for reuse. For example, we offer BufferingWriter whose only role is to buffer the incoming data for a certain amount of time before sending the package on to the endpoint. This is necessary to prevent an HTTP request from firing upon each keypress within the search box.

Two key writers of interest to the readers of this post are the RestEventWriter and the SQSEventWriter, sending data either to a specified REST endpoint or to Amazon’s SQS. In production, we mostly use the SQS writer, due to its low cost and reliability.

Search-Collector: Progress vs. Room For Improvement

The Progress
- The search-collector has reliably collected close to a billion events on both desktop and mobile.
- We have not encountered any client issues, while the appeal of precise search tracking captures the interest of web-shops and e-commerce owners immediately. The resulting data is easy to digest and manage.
- We package the collector as a single script bundle. This single line adds the search-collector to the web-shop. This streamlined initial setup ensures flexible updates to the event collection setup later.
- The SQS mechanism turned out to be a cheap and reliable option for search event storage.
- The composable Collectors and Writers are flexible enough to capture almost any case we’ve encountered to date.
Room For Improvement

The tight coupling of the collector code to the DOM model within the web-shop sometimes creates issues.
- For example, when DOM structure changes are made without notice. We’re working on a best practice document and a new code version that encourages the use of custom client-side events.
  - For example, soon, we will recommend web-shops send a custom searchEvent when the search is triggered. At the same time, the collector code will register as a page listener for these events.
- Impression tracking on mobile is difficult. Events are fired differently and detection, whether a product was within the visible screen area, does not work consistently across devices. Although impressions are rarely used, we’re working on improving in this area.
- Combining Google Analytics data (web-shops usually have it and use it) with Search-Collector data is not trivial. We’re close to launching our Search Insights product that does just that. This will be a considerable help in the event you need to combine these data sources manually – mind the bot traffic.
Summary – Making Search More Measurable and Profitable

2 years in, we’ve learned much from our Search-Collector SDK project. On the one hand, we are collecting more data with seamless web-shop integration than ever before. This ultimately allows for a broader understanding of things like findability. On the other hand, the more information we gather, the more necessary the maintenance of the collection pipelines. It’s clear, however, that the value we add to our customer’s e-commerce shops far outweighs any limitations we may have encountered.

As a result, we continue on this journey and look forward to the next version of our search-collector. This new version will offer the benefits of streamlined integration, and added transparency into Google Analytics site-search data. All the while, maintaining integration flexibility to ensure continuity of the collected data even after sudden, unforeseen changes to web-shop code.

We’ll be launching soon, so please watch this space.

Footnotes
1. The Document Object Model (DOM) defines the logical structure of documents and the way a document is accessed and manipulated.
2. (SoC) is a design principle for separating a computer program into distinct sections such that each section addresses a separate concern.
3. SQS is a queue from which your services pull data, and it only supports exactly-once delivery of messages.
January 19, 2021
The Art of Abstraction – Revisiting Webshop Architecture
Why Abstraction is Necessary for Modern Web Architecture

Why abstraction, and why should I reconsider my web-shop architecture? In the next few minutes, I will attempt to lay clear the increase in architectural flexibility, and the associated profit gains. This is especially true when abstraction is considered foundational rather than cosmetic, operational, or even departmental.

TL;DR

Use Abstraction! It will save you money and increase flexibility!

OK, that was more of a compression than an abstraction 😉

The long story – abstraction a forgotten art

The human brain is bursting with wonder all its own. Just think of the capabilities each of us has balanced between our shoulders.

One such capability is the core concept of using abstraction to grasp the complex world around us and store it in a condensed way.

This, in turn, makes it possible for us humans to talk about objects, structures, and concepts which would be impossible if we had to cope with all the details all the time.

What is Abstraction?

Abstraction is also one of the main principles of programming, making software solutions more flexible, maintainable and extensible.

We programmers are notoriously lazy. As such, not reinventing the wheel is one of the major axioms by which each and every one of us guides our lives by.

Besides saving time, abstraction also reduces the chance of bugs. As a result, should you find any crawling around inside your code, you simply need to squash them in one location, not multiple times over and over again, provided you’ve got your program structure right.

Using abstract definitions to derive concrete implementations helps accomplish precisely this.

Where have you forgotten to implement abstraction?

Nevertheless, there is one location where you might not be adhering to this general concept of abstraction: the central interface between your shop and your underlying search engine. Here you may have opted for quick integration, over decoupled code. As a result, you’ve most likely directly linked these two systems, as in the image below. Search Engines sit atop Webshop architecture, which is most often abstracted.

Perhaps you were lucky enough, when you opened the API documentation of your company’s proprietary site-search engine, to discover well-developed APIs making the integration easy like Sunday morning.

However, I want to challenge you to consider what there is to gain, by adding another layer of abstraction between shop and search engine.

Who needs more abstraction? Don’t make my life more complicated!

At first, you might think, why should I add yet another program or service to my ecosystem. Isn’t that just one more thing I need to take care of?

This depends heavily on what your overall system looks like. For a small pure player online shop, you may be right.

However, the bigger you grow, the more consumers of search results you have. Naturally, this increases the number of search results and related variations across the board. It follows, that the need, within your company, to enhance or manipulate the results will grow congruently. A situation like this markedly increases the rate at which your business stands to profit from abstracted access to the search engine.

One of the main advantages of structuring your system in this way is the greater autonomy you achieve from the site search engine.

Why do I want search engine autonomy?

At this point, it’s necessary to mention that site-search engines, largely, provide the same functionality. Each in its own unique way, of course. So, where’s the problem?

Site-Search APIs are unlikely to be the same among different engines. Whether you compare open source solutions like Solr to Elasticsearch, or commercial solutions like Algolia, FACT-Finder, Fredhopper to whatever else. Switching between or migrating systems will be a bear.

But why is that? All differences aside, the site-search engine use case is the same across the board. Core functionalities must be consistent:
- searching
- category navigation
- filtering
- faceting
- sorting
- suggesting
Site-Search abstraction puts the focus on core functionalities – not APIs

The flexibility you gain through an abstraction-based solution cannot be underplayed.

Once you have created a layer to abstract out these functionalities and made them generally usable for every consumer of search within your company, it is simple to integrate any other solution and switch over just like that.

And, since there is no need to deeply integrate the different adapters into your shop’s software, you can more easily enable simple A/B tests.

Furthermore, if another department also integrates search functionalities, it could be easier for them to use your well-designed abstracted API without re-inventing the wheel locally. Details like, “how does Solr create facets”, or “how do I boost the matching terms in a certain field”, do not need to be rehashed by each department.

Solve this once in your abstraction layer, and everyone profits.

A real-world example worth having a look at is our Open Commerce Search Stack (OCSS). You can find an overview of the architecture in a previous blog post [https://blog.searchhub.io/introducing-open-commerce-search-stack-ocss]. The OCSS abstracts the underlying Elasticsearch component and makes it easier to use and integrate. And, because this adapter is Open Source, it can also be used for other search solutions.

By the way, this method also gives the ability to add functionalities on top. An advantage which cannot be overstated. Let’s have a look at a couple.

Examples of increased webshop flexibility with increased abstraction:
- You want to add real-time prices from another data source to the results found? Just add this as a post-processing step after the search engine retrieved the list of products.
- You want to map visitor queries to their best performing equivalent with our SmartQuery solution? Easy! Just plug in our JAR file, add a few lines of code, and BAAAM, you’re done.
This also enables the use of our redirect module, getting your customers to the right target page with campaigns, content, or the category they are looking for.

Oh, and if you simply want to version update your engine, any related API changes can be “hidden” from the consuming services, making it easy to stay up to date. Or at least making new features an optional enhancement that every department can start using whenever they have the time to integrate the necessary changes and switch to the new version of your centrally abstracted API.

Conclusion

Depending on the complexity of your webshop’s ecosystem and the variety of services you already use or plan to integrate, abstracting the architecture of your internal site-search solution and related connections can make a noticeable difference.

In the long run, it can save you a lot of time, and headaches. And in the end increase profits without having to reinvent the wheel.
January 12, 2021
Monitor Elasticsearch in Kubernetes Using Prometheus
In this article, I will show how to monitor Elasticsearch running inside Kubernetes using the Prometheus-operator, and later Grafana for visualization. Our sample will be based on the cluster described in my last article.

There are plenty of businesses that have to run and operate Elasticsearch on their own. This can be solved pretty well, because of the wide range of deployment types and the large community (an overview here). However, if you’re serious about running Elasticsearch, perhaps as a critical part of your application, you MUST monitor. In this article, I will show how to monitor Elasticsearch running inside Kubernetes using Prometheus as monitoring software. We will use the Prometheus-operator for Kubernetes, but it will work with a plain Prometheus in the same way.

Overview of Elasticsearch Monitoring using Prometheus

If we talk about monitoring Elasticsearch, we have to keep in mind, that there are multiple layers to monitor:
- The first layer is the Hardware Layer where we are monitoring the Hardware’s health, for example, the smart values of the disk.
- The second layer is the process-layer where we are monitoring the OS-Process health, for example, memory usage, CPU usage, etc.
- The third layer is the application layer, because Elasticsearch is written in Java, it’s running on top of a JVM (Java Virtual Machine), so we’re actually monitoring two applications. The JVM and Elasticsearch, which sits on top of the JVM.
- Because most Companies have, hopefully, already been monitoring the hardware-layer and process-layer this article will concentrate on the Application-Layer. The following list gives you an overview of tools/types to help you monitor the Elasticsearch application:
- Elasticsearch and all the Virtualizations based on Elasticsearch monitoring indices (Elasticsearch itself https://www.elastic.co/de/what-is/elasticsearch-monitoring)
- Sematext (https://sematext.com/blog/elasticsearch-open-source-monitoring-tools/)
- Zabbix (https://www.zabbix.com/integrations/elasticsearch)
- Nagios / Icinga (e.g. https://exchange.nagios.org/directory/Plugins/Databases/Others/check_es_system-ElasticSearch-monitoring/details)
- Check_MK (https://checkmk.com/check_mk-werks.php?werk_id=7202)
- Prometheus (e.g. https://github.com/vvanholl/elasticsearch-prometheus-exporter)
- and many more
It is worth noting that every one of these methods uses the Elasticsearch internal stats gathering logic to collect data about the underlying JVM and Elasticsearch itself.

The Motivation Behind Monitoring Elasticsearch Independently

Elasticsearch already contains monitoring functionality, so why try to monitor Elasticsearch with an external monitoring system? Some reasons to consider:
- If Elasticsearch is broken, the internal monitoring is broken
- You already have a functioning monitoring system with processes for alerting, user management, etc.
In our case, this second point was the impetus for using Prometheus to monitor Elasticsearch.

Let’s Get Started – Install the Plugin

To monitor Elasticsearch with Prometheus, we have to export the monitoring data in the Prometheus exposition format. To this end, we have to install a plugin in our Elasticsearch cluster which exposes the information in the right format under /_prometheus/metrics. If we are using the Elasticsearch operator, we can install the plugin in the same way as the S3 plugin, from the last post, using the init container:
```
				
					version: 7.7.0
 ...
 nodeSets:
 - name: master-zone-a
   ...
   podTemplate:
     spec:
       initContainers:
       - name: sysctl
         securityContext:
           privileged: true
         command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
       - name: install-plugins
         command:
         - sh
         - -c
         - |
           bin/elasticsearch-plugin install -b repository-s3 https://github.com/vvanholl/elasticsearch-prometheus-exporter/releases/download/7.7.0.0/prometheus-exporter-7.7.0.0.zip
   ...
				
			
```
If you are not using the Elasticsearch-operator, you have to follow the Elasticsearch plugin installation instructions.

Please note: there is more than one plugin on the market for exposing elasticsearch monitoring data in the Prometheus format, but the Elasticsearch-prometheus-exporter we are using is one of the larger projects which is active and has a big community.

If you are using elasticsearch > 7.17.7 (including 8.x), take a look at the following plugin instead: https://github.com/mindw/elasticsearch-prometheus-exporter/

After installing the plugin, we should now be able to fetch monitoring data from the /_prometheus/metrics endpoint. To test the plugin, we can use Kibana to perform a request against the endpoint. See the picture below:

How To Configure Prometheus

At this point, it’s time to connect Elasticsearch to Prometheus. Now, we can create a ServiceMonitor because we are using the Prometheus-operator for monitoring internal Kubernetes applications. See an example below, which can be used to monitor the Elasticsearch cluster we created in my last post:
```
				
					apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
 labels:
   app: prometheus
   prometheus: kube-prometheus
   chart: prometheus-operator-8.13.8
   release: prometheus-operator
 name: blogpost-es
 namespace: monitoring
spec:
 endpoints:
   - interval: 30s
     path: "/_prometheus/metrics"
     port: https
     scheme: https
     tlsConfig:
       insecureSkipVerify: true
     basicAuth:
       password:
         name: basic-auth-es
         key: password
       username:
         name: basic-auth-es
         key: user
 namespaceSelector:
   matchNames:
   - blog
 selector:
   matchLabels:
     common.k8s.elastic.co/type: elasticsearch
     elasticsearch.k8s.elastic.co/cluster-name: blogpost
				
			
```
For those unfamiliar with the Prometheus-operator or are using plain Prometheus to monitor Elasticsearch. The ServiceMonitor will create a Prometheus job like the one below:
```
				
					- job_name: monitoring/blogpost-es/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /_prometheus/metrics
  scheme: https
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - blog
  basic_auth:
    username: elastic
    password: io3Ahnae2ieW8Ei3aeZahshi
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_common_k8s_elastic_co_type]
    separator: ;
    regex: elasticsearch
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_label_elasticsearch_k8s_elastic_co_cluster_name]
    separator: ;
    regex: ui
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: https
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: https
    action: replace
				
			
```
Warning!: in our example, the scrap interval is 30 seconds. It may be necessary to adjust the interval for your production cluster. Proceed with caution! Gathering information for every scrape creates a heavy load on your Elasticsearch cluster, especially on the master nodes. A short scrape interval can easily kill your cluster.

If your configuration of Prometheus was successful, you will now see the cluster under the “Targets” section of Prometheus under “All”. See the picture below:

Import Grafana-Dashboard

Theoretically, we are now finished. However, because most people out there use Prometheus with Grafana, I want to show how to import the dashboard especially made for this plugin. You can find it here on grafana.com. The screenshots below explain how to import the Dashboard:

Following the dashboard import, you should see the elasticsearch monitoring graphs as in the following screenshot:

Wrapping Up

In this article, we briefly covered the possible monitoring options. I showed why it makes sense to monitor elasticsearch using an external monitoring system and some reasons for doing so. Finally, I showed how to monitor Elasticsearch with Prometheus and Grafana for visualization.
January 5, 2021