Understanding ElasticSearch analyzers

1 2 Page 2
Page 2 of 2
Searching for lemons
<span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<code class='javascript'><span class='line'><span class="nx">query</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">"query"</span> <span class="o">:</span> <span class="p">{</span> <span class="s2">"term"</span> <span class="o">:</span> <span class="p">{</span> <span class="s2">"ingredients"</span> <span class="o">:</span> <span class="s2">"lemons"</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span>
</span><span class='line'>
</span><span class='line'><span class="nx">client</span><span class="p">.</span><span class="nx">search</span><span class="p">(</span><span class="s1">'beer_recipes'</span><span class="p">,</span> <span class="s1">'beer'</span><span class="p">,</span> <span class="nx">query</span><span class="p">).</span><span class="nx">on</span><span class="p">(</span><span class="s1">'data'</span><span class="p">,</span> <span class="p">(</span><span class="nx">data</span><span class="p">)</span> <span class="o">-></span>
</span><span class='line'>  <span class="nx">data</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">data</span><span class="p">)</span>
</span><span class='line'>  <span class="k">for</span> <span class="nx">doc</span> <span class="k">in</span> <span class="nx">data</span><span class="p">.</span><span class="nx">hits</span><span class="p">.</span><span class="nx">hits</span>
</span><span class='line'>      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span> <span class="nx">doc</span><span class="p">.</span><span class="nx">_source</span><span class="p">.</span><span class="nx">style</span>
</span><span class='line'>      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span> <span class="nx">doc</span><span class="p">.</span><span class="nx">_source</span><span class="p">.</span><span class="nx">name</span>
</span><span class='line'>      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span> <span class="nx">doc</span><span class="p">.</span><span class="nx">_source</span><span class="p">.</span><span class="nx">ingredients</span>
</span><span class='line'><span class="p">).</span><span class="nx">exec</span><span class="p">()</span>
</span></code>

Lo and behold, this search returns a hit! But that’s inconvenient, to say the least. Basically the words in the ingredients field are tokenized as is. Hence, a search for “lemons” works while “lemon” doesn’t. Note: there are various mechanisms for searching, and a search on “lemon*” should have returned a result.

When a document is added into an ElasticSearch index, its fields are analyzed and converted into tokens. When you execute a search against an index, you search against those tokens. How ElasticSearch tokenizes a document is configurable.

There are different ElasticSearch analyzers available – from language analyzers that allow you to support non-English language searches to the snowball analyzer, which converts a word into its root (or stem and that process of creating a stem from a word is called stemming), yielding a simpler token. For example, a snowball of “lemons” would be “lemon”. Or if the words “knocks” and “knocking” were in a snowball analyzed document, both terms would have “knock” as a stem.

You can change how documents are tokenized via the index mapping API like so:

Changing the mapping for an index using cURL
<span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<code class='bash'><span class='line'>curl -XPUT <span class="s1">'http://localhost:9200/beer_recipes'</span> -d <span class="s1">'{ "mappings" : {</span>
</span><span class='line'><span class="s1">  "beer" : {</span>
</span><span class='line'><span class="s1">    "properties" : {</span>
</span><span class='line'><span class="s1">      "ingredients" : { "type" : "string", "analyzer" : "snowball" }</span>
</span><span class='line'><span class="s1">    }</span>
</span><span class='line'><span class="s1">   }</span>
</span><span class='line'><span class="s1"> }</span>
</span><span class='line'><span class="s1">}'</span>
</span></code>

Note how the above mapping specifies that the ingredients field will be analyzed via the snowball analyzer. Also note, you have to change the mapping of an index before you begin to add documents to it! So, in this case, I’ll need to drop the index, run the mapping call above, and then re-add those two recipes.

Now I can begin searching recipes for the ingredient “lemon” or “lemons”.

Searching for lemon now works!
<span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<code class='javascript'><span class='line'><span class="nx">query</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">"query"</span> <span class="o">:</span> <span class="p">{</span> <span class="s2">"term"</span> <span class="o">:</span> <span class="p">{</span> <span class="s2">"ingredients"</span> <span class="o">:</span> <span class="s2">"lemon"</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span>
</span><span class='line'>
</span><span class='line'><span class="nx">client</span><span class="p">.</span><span class="nx">search</span><span class="p">(</span><span class="s1">'beer_recipes'</span><span class="p">,</span> <span class="s1">'beer'</span><span class="p">,</span> <span class="nx">query</span><span class="p">).</span><span class="nx">on</span><span class="p">(</span><span class="s1">'data'</span><span class="p">,</span> <span class="p">(</span><span class="nx">data</span><span class="p">)</span> <span class="o">-></span>
</span><span class='line'>  <span class="nx">data</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">data</span><span class="p">)</span>
</span><span class='line'>  <span class="k">for</span> <span class="nx">doc</span> <span class="k">in</span> <span class="nx">data</span><span class="p">.</span><span class="nx">hits</span><span class="p">.</span><span class="nx">hits</span>
</span><span class='line'>      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span> <span class="nx">doc</span><span class="p">.</span><span class="nx">_source</span><span class="p">.</span><span class="nx">style</span>
</span><span class='line'>      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span> <span class="nx">doc</span><span class="p">.</span><span class="nx">_source</span><span class="p">.</span><span class="nx">name</span>
</span><span class='line'>      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span> <span class="nx">doc</span><span class="p">.</span><span class="nx">_source</span><span class="p">.</span><span class="nx">ingredients</span>
</span><span class='line'><span class="p">).</span><span class="nx">exec</span><span class="p">()</span>
</span></code>

Keep in mind that snowballing can inadvertently make your search results less relevant. Long words can be stemmed into more common but completely different words. For example, if you snowball a document that contains the word “sextant”, the word “sex” will result as a stem. Thus, searches for “sextant” will also return documents that contain the word “sex” (and vice versa).

ElasticSearch puts a powerful search engine into your clutches; plus, with a little forethought into how a document’s contents are analyzed, you’ll make searches event more relevant.

This story, "Understanding ElasticSearch analyzers" was originally published by JavaWorld.

Copyright © 2013 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2