Gartner Cloud DBMS Report Names MarkLogic a Visionary

Working with Nicknames: Dictionary or Thesaurus?

Robbie, Bobby, Rob, and Bob derive from Robert.  Johnny, John, and Jon derive from Jonathan.

When dealing with person names, nicknames can make it hard to tell if two people are indeed the same person, unless you had a tool to help you identify these names. But do you use a custom stemming dictionary? Stemming thesaurus? Are there other options? Here, we compare options for stemming person names in MarkLogic to help you decide which is the right approach for you.

Stemming Dictionary

When stemming names using a dictionary, all of the following apply:

  • applies to all databases
  • applies to a single language
  • configuration using special APIs to load and configure
  • single term lookup
  • simpler query
  • slightly slower ingest (but probably negligible)
  • special compact in-memory representation, replicated for each instance of stemmer (but stemmers are pooled/shared)

Stemming Thesaurus

When stemming names using a thesaurus, consider:

  • applied to a single database
  • not language-aware at all
  • configuration by loading a document
  • query expansion
  • multiple-term lookup
  • slightly slower query (also probably negligible)
  • stored in expanded tree cache; out of cache can be a performance hit

And…Entity Extraction?

It would be overkill for this person name stemming use case, but it is worth pointing out a trick using entity extraction. Feed in query strings to cts:parse with function bindings to turn a query string into a tagged query, which you then expand and interpret according to whatever criteria you like, whether or not you do entity extraction on the actual content. Using an entity extraction approach:

  • applies to single database
  • is language aware
  • expands queries (unless you also ran entity extraction on content too, of course)
  • requires more code to write (binding functions), but also means more flexibility
  • is configured via special APIs (which amounts to a document load)
  • is stored in special compact data structure, or a dedicated cache

Bottom Line

If you have a large set of alternatives, or care about language context, go with the stemming dictionary.

Additional Resources

 

Start a discussion

Connect with the community

STACK OVERFLOW

EVENTS

GITHUB COMMUNITY

Most Recent

View All

Digital Acceleration Series: Powering MDM with MarkLogic

Our next event series covers key aspects of MDM including data integration, third-party data, data governance, and data security -- and how MarkLogic brings all of these elements together in one future-facing, agile MDM data hub.
Read Article

Of Data Warehouses, Data Marts, Data Lakes … and Data Hubs

New technology solutions arise in response to new business needs. Learn why a data hub platform makes the most sense for complex data.
Read Article

5 Key Findings from MarkLogic-Sponsored Financial Data Leaders Study

Financial institutions differ in their levels of maturity in managing and utilizing their enterprise data. To understand trends and winning strategies in getting the greatest value from this data, we recently co-sponsored a survey with the Financial Information Management WBR Insights research division.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.