Class GermanStemmer


  • public class GermanStemmer
    extends java.lang.Object
    A stemmer for German words.

    The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (joerg.caumanns at isst.fhg.de).

    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static java.util.Locale locale  
      private java.lang.StringBuilder sb
      Buffer for the terms while stemming them.
      private int substCount
      Amount of characters that are removed with substitute() while stemming.
    • Constructor Summary

      Constructors 
      Constructor Description
      GermanStemmer()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private boolean isStemmable​(java.lang.String term)
      Checks if a term could be stemmed.
      private void optimize​(java.lang.StringBuilder buffer)
      Does some optimizations on the term.
      private void removeParticleDenotion​(java.lang.StringBuilder buffer)
      Removes a particle denotion ("ge") from a term.
      private void resubstitute​(java.lang.StringBuilder buffer)
      Undoes the changes made by substitute().
      protected java.lang.String stem​(java.lang.String term)
      Stemms the given term to an unique discriminator.
      private void strip​(java.lang.StringBuilder buffer)
      suffix stripping (stemming) on the current term.
      private void substitute​(java.lang.StringBuilder buffer)
      Do some substitutions for the term to reduce overstemming:
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • sb

        private java.lang.StringBuilder sb
        Buffer for the terms while stemming them.
      • substCount

        private int substCount
        Amount of characters that are removed with substitute() while stemming.
      • locale

        private static final java.util.Locale locale
    • Constructor Detail

      • GermanStemmer

        public GermanStemmer()
    • Method Detail

      • stem

        protected java.lang.String stem​(java.lang.String term)
        Stemms the given term to an unique discriminator.
        Parameters:
        term - The term that should be stemmed.
        Returns:
        Discriminator for term
      • isStemmable

        private boolean isStemmable​(java.lang.String term)
        Checks if a term could be stemmed.
        Returns:
        true if, and only if, the given term consists in letters.
      • strip

        private void strip​(java.lang.StringBuilder buffer)
        suffix stripping (stemming) on the current term. The stripping is reduced to the seven "base" suffixes "e", "s", "n", "t", "em", "er" and * "nd", from which all regular suffixes are build of. The simplification causes some overstemming, and way more irregular stems, but still provides unique. discriminators in the most of those cases. The algorithm is context free, except of the length restrictions.
      • optimize

        private void optimize​(java.lang.StringBuilder buffer)
        Does some optimizations on the term. This optimisations are contextual.
      • removeParticleDenotion

        private void removeParticleDenotion​(java.lang.StringBuilder buffer)
        Removes a particle denotion ("ge") from a term.
      • substitute

        private void substitute​(java.lang.StringBuilder buffer)
        Do some substitutions for the term to reduce overstemming:

        - Substitute Umlauts with their corresponding vowel:äöü -> aou, "ß" is substituted by "ss" - Substitute a second char of a pair of equal characters with an asterisk: ?? -> ?* - Substitute some common character combinations with a token: sch/ch/ei/ie/ig/st -> $/§/%/&/#/!

      • resubstitute

        private void resubstitute​(java.lang.StringBuilder buffer)
        Undoes the changes made by substitute(). That are character pairs and character combinations. Umlauts will remain as their corresponding vowel, as "ß" remains as "ss".