We’ve joined forces with Smartlogic to reveal smarter decisions—together.

Recommendations about Stemming Options

We have some internal email lists at MarkLogic and sometimes the information that pops up is too good not to share. Recently, we had this question:

Are there any recommendations regarding the stemming option to use among basic, advanced and decompounding? Would it be a good approach to always use the “advanced” option when enabling stemming for French?

The answer came from Mary Holstege, who built many of the MarkLogic search features.

In languages with a lot of inflections, alternative stems are fairly common and you should use advanced stemming. You end up with homonyms colliding, especially for short words. So: pretty much everything except English and Chinese. Most European languages will also see certain verb forms produce both an adjective stem and a verb stem (e.g. English “crowded” or “flying”). In English, with few inflections, this is the main case where advanced stemming buys you anything — even in the case of homonyms the stems end up the same anyway. Decompounding is mainly useful for Germanic languages that do a lot of noun compound formation (German, Dutch, Norwegian) and to a lesser extent Japanese. English would be in this camp except at some point in our linguistic past we decided to put spaces in our noun compounds (French influence, probably) so you don’t get anything out of decompounding.

I would also add, that if you are doing stemmed searches in languages that care about accents (like French) you’ll get better results with explicitly diacritic-sensitive searches (assuming you spelled your French words with the correct accents), and likewise for German you’ll get better results if you spell your nouns with Capital Letters the German Way and use case-sensitive searches. It so happens the stemmers are sensitive to that detail.

Start a discussion

Connect with the community

STACK OVERFLOW

EVENTS

GITHUB COMMUNITY

Most Recent

View All

Unifying Data, Metadata, and Meaning

We're all drowning in data. Keeping up with our data - and our understanding of it - requires using tools in new ways to unify data, metadata, and meaning.
Read Article

How to Achieve Data Agility

Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.
Read Article

Scaling Memory in MarkLogic Server

This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory. Learn more about how MarkLogic uses memory, why you might need more memory, when you need more memory, and how you can add more memory.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.