The new website for MarkLogic is www.progress.com/marklogic. Visit it.
BLOG ARTICLE

Working with Nicknames: Dictionary or Thesaurus?

Back to blog
11.27.2018
1 minute read
Back to blog
11.27.2018
1 minute read

Robbie, Bobby, Rob, and Bob derive from Robert.  Johnny, John, and Jon derive from Jonathan.

When dealing with person names, nicknames can make it hard to tell if two people are indeed the same person, unless you had a tool to help you identify these names. But do you use a custom stemming dictionary? Stemming thesaurus? Are there other options? Here, we compare options for stemming person names in MarkLogic to help you decide which is the right approach for you.

Stemming Dictionary

When stemming names using a dictionary, all of the following apply:

  • applies to all databases
  • applies to a single language
  • configuration using special APIs to load and configure
  • single term lookup
  • simpler query
  • slightly slower ingest (but probably negligible)
  • special compact in-memory representation, replicated for each instance of stemmer (but stemmers are pooled/shared)

Stemming Thesaurus

When stemming names using a thesaurus, consider:

  • applied to a single database
  • not language-aware at all
  • configuration by loading a document
  • query expansion
  • multiple-term lookup
  • slightly slower query (also probably negligible)
  • stored in expanded tree cache; out of cache can be a performance hit

And…Entity Extraction?

It would be overkill for this person name stemming use case, but it is worth pointing out a trick using entity extraction. Feed in query strings to cts:parse with function bindings to turn a query string into a tagged query, which you then expand and interpret according to whatever criteria you like, whether or not you do entity extraction on the actual content. Using an entity extraction approach:

  • applies to single database
  • is language aware
  • expands queries (unless you also ran entity extraction on content too, of course)
  • requires more code to write (binding functions), but also means more flexibility
  • is configured via special APIs (which amounts to a document load)
  • is stored in special compact data structure, or a dedicated cache

Bottom Line

If you have a large set of alternatives, or care about language context, go with the stemming dictionary.

Additional Resources

 

Mary Holstege

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Developer Insights

Multi-Model Search using Semantics and Optic API

The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.

All Blog Articles
Developer Insights

Create Custom Steps Without Writing Code with Pipes

Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]

All Blog Articles
Developer Insights

Part 3: What’s New with JavaScript in MarkLogic 10?

Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo