Monday, March 7, 2016

Spell check with Solr


In this post I present examples and explain how to setup a spell check or did you mean (DYM) feature with your solr instance. I organize this as a set of steps so it is easier to follow. This post assumes you have solr already installed and working. For installing solr, please refer to the solr wiki.

1. Decide what will be the source of spell suggestions; index or a dictionary file.
One of the first things you need to consider is what will be the source of your spell check. You can base your spell suggestions on the index or a dictionary file. You can also use both by defining multiple spellcheckers and combining them in your query. The example in this post will use both and index and file based spellchecker. Another consideration is the different languages for which you want to be able to provide spell suggestions. In general, you will want to have separate fields and/or separate dictionary files for each language. In this post I use English language as an example but the steps are similar for any  other language. 

2. Declare the field type and field for use with index based spellchecker.

To base the spell suggestions on the index, you need to declare the field type and field that will serve as the source of spell suggestions. You do not want too much analysis to be applied to this field. The standard tokenizer should be sufficient for most uses. It splits the text field into tokens, treating whitespace and punctuation as delimiters. Similarly, the Standard filter should be good for most cases. You may want to apply a stop words filter to remove any terms not appropriate for spell suggestions. In the example below, I define a field type called 'text_suggest':

schema.xml:

    <fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.StandardFilterFactory"/>
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
   <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
    </fieldType>

Next define a field for spell suggestions.
This field will be of the type declared previously. The field needs to be indexed and stored. You may set omitNorms to true if you are not going to use this field for regular search.

<field name="fieldspellcheck_en" type="text_suggest" indexed="true" stored="true" omitNorms="true" />

You may then use the copyField directive to copy content from your language specific text field/s to the spellcheck field to ensure the spell check fields gets populated with appropriate language specific text. 

<copyField source="title_en" dest="fieldspellcheck_en" />

3. Define the spell check component in the solrconfig.xml file.
You need to define the spellcheck component in the solrConfig.xml file. Here you can define one or more 'spellcheckers'. The 'name' element within the spellchecker declares a name for the spellchecker and can be referred to at query time or inside a request handler definition. Inside the spellchecker, the 'classname' element defines the specific spellcheck implementation to be used with this spellchecker. 

In this example I define three spellcheckers using different implementations. The first is index based working off the 'fieldspellcheck_en' that we defined above. The second uses the solr WordBreakSolrSpellChecker which offers suggestions by combining adjacent query terms and/or breaking terms into multiple words. The third uses an external file as its spell check dictionary. I have provided comments in black explaining some of the parameters used.

solrconfig.xml:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <!-- a spellchecker built from a field of the main index -->
    <lst name="spellchecker">
      <str name="name">spellcheck_en</str>
      <str name="field">fieldspellcheck_en</str>
      <str name="classname">solr.IndexBasedSpellChecker</str>
      <str name="spellcheckIndexDir">./spellchecker</str>
      <!-- the spellcheck distance measure used, the default is the internal levenshtein -->
      <str name="distanceMeasure">org.apache.lucene.search.spell.LevensteinDistance</str>
      <!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
      <str name="accuracy">0.5</str>
      <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -->
      <int name="maxEdits">1</int>
      <!-- the minimum shared prefix when enumerating terms -->
      <int name="minPrefix">1</int>
      <!-- maximum number of inspections per result. -->
      <int name="maxInspections">5</int>
      <!-- minimum length of a query term to be considered for correction -->
      <int name="minQueryLength">4</int>
      <!-- maximum threshold of documents a query term can appear to be considered for correction -->
      <float name="maxQueryFrequency">0.01</float>
      <!-- uncomment this to require suggestions to occur in 1% of the documents
        <float name="thresholdTokenFrequency">.01</float>
      -->
    </lst>
    
   <lst name="spellchecker">
      <str name="name">wordbreak_en</str>
      <str name="classname">solr.WordBreakSolrSpellChecker</str>      
      <str name="field">fieldspellcheck_en</str>
      <!-- should multiple words from the query be combined in a dictionary search -->
      <str name="combineWords">true</str>
      <!-- should words from the query be broken in a dictionary search -->
      <str name="breakWords">true</str>
      <int name="maxChanges">5</int>
      <int name="minBreakLength">3</int>
    </lst>

    <!-- A spellchecker that reads the list of words from a file -->
    <lst name="spellchecker">
      <str name="classname">solr.FileBasedSpellChecker</str>
      <str name="name">filespellcheck</str>
      <str name="sourceLocation">spellings.txt</str>
      <str name="characterEncoding">UTF-8</str>
      <str name="spellcheckIndexDir">spellcheckerFile</str>
    </lst>
  </searchComponent>

4. Now you are ready to index your content.
Reload the solr core and make sure the schema and solrconfig.xml get loaded without errors. You can also open the 'schema browser' from the solr console to ensure the new field and field types are configured properly.
You can then go ahead and index content.

5. Build the Solr spell check dictionaries/index
Once you have your content indexed, you need to build the spell check dictionaries before they can be used for returning suggestions. You can use the following type of request for this purpose:
http://<server ip>:<solr port>/solr/<request handler name>?q=*:*&spellcheck.build=true&spellcheck.dictionary=<spellchecker name>&spellcheck.q=<query for spellcheck>&spellcheck=true

where, 
spellchecker name - is the name you specified when declaring your spellcheckers in step 3.

A concrete example may look like this:
http://localhost:8983/solr/select?q=*:*&spellcheck.build=true&spellcheck.dictionary=wordbreak_en&spellcheck.q=wirelesssetup&spellcheck=true

Make sure you build all the spellcheckers you defined in step 3.

6. You are now ready get back spell suggestions at query time.
You can use a request of the following form:

http://<server ip>:<solr port>/solr/<request handler name>?q=<search query>&spellcheck.dictionary=<spellchecker name>&spellcheck=true&spellcheck.count=<number of suggestions>&spellcheck.collate=<false|true>&spellcheck.maxCollations=<N>&spellcheck.collateExtendedResults=<false|true>

where,
the spellcheck.dictionary parameter value is the name of one of the spellcheckers you defined in step 3 and dictates the specific spellchecker to use. You can pass multiple spellcheck.dictionary parameters, each referring a different spellchecker that will be consulted by solr, the results are interleaved.

spellcheck.collate - Optional, turns collations on or off. Setting it to true makes Solr build a new query based on the best suggestion for each term in the submitted query. When enabled, solr can guarantee that collations will return results if re-run by the client (applying original fq params also). This is especially helpful when there is more than one correction per query.

spellcheck.maxCollations the maximum number of collations to return. Only applicable if collations are enabled with spellcheck.collate=true.

spellcheck.collateExtendedResults - if true will return an expanded response for each collation, showing the number of hits. Only applicable if collations are enabled with spellcheck.collate=true.

A concrete example query may look like the following (Note that in this example I show only a handful of the various parameters you can pass. For a full list of all the possible spell check parameters, refer to the solr wiki.):

http://localhost:8983/solr/sitewide/select?fl=id,score&wt=json&defType=edismax&qf=title&q=lattop&rows=10&spellcheck=on&spellcheck.count=3&spellcheck.collate=true&spellcheck.dictionary=wordbreak_en&spellcheck.dictionary=spellcheck_en&spellcheck.collateExtendedResults=true&spellcheck.maxCollations=3

The response should have the spellcheck section similar to the one below:

  • spellcheck:
    {
    • suggestions:
      [
      • "lattop",
      • {
        • numFound3,
        • startOffset0,
        • endOffset6,
        • suggestion:
          [
          • "laptops",
          • "palmtop",
          • "attny"
          ]
        }
      ],
    • collations:
      [
      • "collation",
      • {
        • collationQuery"laptop",
        • hits2366,
        • misspellingsAndCorrections:
          [
          • "lattop",
          • "laptop"
          ]
        },
      • "collation",
      • {
        • collationQuery"laptops",
        • hits689,
        • misspellingsAndCorrections:
          [
          • "lattop",
          • "laptops"
          ]
        },
      • "collation",
      • {
        • collationQuery"button",
        • hits3,
        • misspellingsAndCorrections:
          [
          • "lattop",
          • "button"
          ]
        }
      ]
    }

And that is it! Hope this has been helpful.