Progress Acquires MarkLogic! Learn More
BLOG ARTICLE

Recommendations about Stemming Options

Back to blog
12.08.2015
1 minute read
Back to blog
12.08.2015
1 minute read

We have some internal email lists at MarkLogic and sometimes the information that pops up is too good not to share. Recently, we had this question:

Are there any recommendations regarding the stemming option to use among basic, advanced and decompounding? Would it be a good approach to always use the “advanced” option when enabling stemming for French?

The answer came from Mary Holstege, who built many of the MarkLogic search features.

In languages with a lot of inflections, alternative stems are fairly common and you should use advanced stemming. You end up with homonyms colliding, especially for short words. So: pretty much everything except English and Chinese. Most European languages will also see certain verb forms produce both an adjective stem and a verb stem (e.g. English “crowded” or “flying”). In English, with few inflections, this is the main case where advanced stemming buys you anything — even in the case of homonyms the stems end up the same anyway. Decompounding is mainly useful for Germanic languages that do a lot of noun compound formation (German, Dutch, Norwegian) and to a lesser extent Japanese. English would be in this camp except at some point in our linguistic past we decided to put spaces in our noun compounds (French influence, probably) so you don’t get anything out of decompounding.

I would also add, that if you are doing stemmed searches in languages that care about accents (like French) you’ll get better results with explicitly diacritic-sensitive searches (assuming you spelled your French words with the correct accents), and likewise for German you’ll get better results if you spell your nouns with Capital Letters the German Way and use case-sensitive searches. It so happens the stemmers are sensitive to that detail.

Mary Holstege

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Developer Insights

Multi-Model Search using Semantics and Optic API

The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.

All Blog Articles
Developer Insights

Create Custom Steps Without Writing Code with Pipes

Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]

All Blog Articles
Developer Insights

Part 3: What’s New with JavaScript in MarkLogic 10?

Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo