The stemming is the process of producing morphological variants of a root/base word.
Stemming programs are commonly referred to as stemming algorithms or stemmers to reduces the words.
Errors in Stemming can be overstemming and understemming.
These two words are stemmed to the same root that are of different stems then the term is overstemming.
When two words are stemmed to same root that are not of different stems then the term used is understemming.
Applications of stemming are used in information retrieval systems like search engines or is used to determine domain vocabularies in domain analysis.
Let install this python module named nltk with pip tool:
C:\Python373\Scripts>pip install nltk
Collecting nltk
...
Successfully installed nltk-3.4.1 six-1.12.0
The nltk python module work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging, graphical demonstrations, sample data sets, and semantic reasoning.
The next step is to download the models and data, see more at this official webpage.
First run this lines of code to update the nltk python module.
import nltk
nltk.download()
Let's test a simple implementation of stemming words using nltk python module:from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
my_porter = PorterStemmer()
quote = "Deep in the human unconscious is a pervasive need for a logical universe that makes sense."
words = word_tokenize(quote)
for w in words:
print(w, " : ", my_porter.stem(w))
The result is something like this:C:\Users\catafest>python stemming_001.py
Deep : deep
in : in
the : the
human : human
unconscious : unconsci
is : is
a : a
pervasive : pervas
need : need
for : for
a : a
logical : logic
universe : univers
that : that
makes : make
sense : sens
. : .
C:\Users\catafest>
You can read more about the stemming at Wikipedia.