Blog

Advanced approximate sentence matching in Python

In our last post, we went over a range of options to perform approximate sentence matching in Python, an import task for many natural language processing and machine learning tasks.  To begin, we defined terms like:

  • tokens: a word, number, or other “discrete” unit of text.
  • stems: words that have had their “inflected” pieces removed based on simple rules, approximating their core meaning.
  • lemmas: words that have had their “inflected” pieces removed based on complex databases, providing accurate, context-aware meaning.
  • stopwords: low-information, repetitive, grammatical, or auxiliary words that are removed from a corpus before performing approximate matching.

We used these concepts to match sentences in the following ways:

  • Exact case-insensitive token matching after stopword removal
  • Exact case-insensitive stem matching after stopword removal
  • Exact case-insensitive lemma matching after stopword removal

The source from these examples is available in the IPython notebook here.

These methods all work by transforming or removing elements from input sequences, then comparing the output sequences for exact matches.  If any element of the sequence matches or their orders differ, then the match fails, and the sentences are not identified as equivalent.

In order to locate more true positives, we need to relax our definition of equivalence.  The next most logical way to do this is to swap our exact sequence matching with a set similarity measure.  One of the most common set similarity measures is the Jaccard similarity index, which is based on the simple set operations union and intersection.

Jaccard similarity  for approximate sentence matching

The diagram above shows the intuition behind the Jaccard similarity measure.  We are comparing two sentences: A and B.  We represent each sentence as a set of tokens, stems, or lemmae, and then we compare the two sets.  The larger their overlap, the higher the degree of similarity, ranging from 0% to 100%.  If we set a threshold percentage or ratio, then we have a matching criterion to use.  Let’s walk through an example.

Let’s say we define sentences to be equivalent if 50% or more of their tokens are equivalent.  Here are two sample sentences:

  • The young cat is hungry.
  • The cat is very hungry.

When we parse these sentences to remove stopwords, we end up with the following two sets:

  • {young, cat, hungry}
  • {cat, very, hungry}

The Jaccard index is composed of a numerator and denominator.  In the numerator, we count the number of items that are shared between the sets.  In the denominator, we count the total number of items across both sets.  In our example above, our intersection is {cat, hungry}, which has count of two.  The union of the sets is {young, cat, very, hungry}, which has a count of four.  Therefore, our Jaccard similarity index is two divided by four, or 50%.    Given our threshold above, we would consider this to be  a match.

So how can we apply this logic to perform better approximate sentence matching?  Let’s repeat each of our methods from the last post, listed above, but replace our exact sequence matching with set similarity instead.

APPROACH #4 – CASE-INSENSITIVE TOKEN SET SIMILARITY AFTER STOPWORD REMOVAL

In this approach, we follow the method of approach #1 from our last post by stripping stopwords and applying the Punkt tokenizer to the sentences.  However, instead of looking for an exact match between the sequence of tokens, we instead calculate the Jaccard similarity of the token sets.  The code sample below shows us how to do this.

Looking back at where we last left off, we were matching the following sentences in green: 

  1. “In the eighteenth century it was often convenient to regard man as a clockwork automaton.”
  2. “in the eighteenth century      it was often convenient to regard man as a clockwork automaton”
  3. “In the eighteenth century, it was often convenient to regard man as a clockwork automaton.”
  4. “In the eighteenth century, it was not accepted to regard man as a clockwork automaton.”
  5. “In the eighteenth century, it was often convenient to regard man as clockwork automata.”
  6. “In the eighteenth century, it was often convenient to regard man as clockwork automatons.”
  7. “It was convenient to regard man as a clockwork automaton in the eighteenth century.”
  8. “In the 1700s, it was common to regard man as a clockwork automaton.”
  9. “In the 1700s, it was convenient to regard man as a clockwork automaton.”
  10. “In the eighteenth century.”
  11. “Man as a clockwork automaton.”
  12. “In past centuries, man was often regarded as a clockwork automaton.”
  13. “The eighteenth century was characterized by man as a clockwork automaton.”
  14. “Very long ago in the eighteenth century, many scholars regarded man as merely a clockwork automaton.”

Once we apply this method, we are now able to match sentences #4, #7, #9, and #13.  The key difference is that are now able to survive small insertions or order changes in the sequence of tokens, unlike our previous matching methods.

APPROACH #5 – CASE-INSENSITIVE STEM SIMILARITY AFTER STOPWORD REMOVAL

Our next approach, like approach #2 from the last post, replaces tokens with stems.  As we discussed then, stemming is the process “uninflecting” or “reducing” words to their stem, which allows us to equate words with a common meaning.  The code sample below shows us how to perform this approximate matching.

By switching from tokens to stems, we now match sentence #12 in addition to previous sentences.

APPROACH #6 – CASE-INSENSITIVE NOUN LEMMA SIMILARITY AFTER STOPWORD REMOVAL

  Our sixth approach for this section is to replace stems with lemmas.  As we discussed last time, lemmatization is a more accurate (although more expense) approach to reducing words.

  However, in addition to just lemmatizing tokens, we are also going to ignore all lemmas that do not correspond to nouns.  The motivation here is that we care about the “things” in the sentence, not the “actions” in the sentence; as a result, using only noun lemma sets will allow us to achieve more positive matches.  The code sample below demonstrates this approach.


Once we apply this change, we now match all sentences except #10, which is the one sentence substantially different from our target.

NEXT STEPS IN APPROXIMATE SENTENCE MATCHING

In our next post, we’ll walk through a few additional approaches to sentence matching, including pairwise token fuzzy string matching and part-of-speech filtering using WordNet.  These refinements will allow us to more finely control our matching logic from a natural language perspective, which is an important way to control for false positives.  Stay tuned!

8 Comments

  1. Laertis Pappas

    Very nice article. When will the next post for Wordnet tutorial?

  2. Evelyn Guo

    AttributeError: module ‘nltk.tokenize.punkt’ has no attribute ‘PunktWordTokenizer’

    It occurs when I run the code.

    1. shikha

      Excellent article! really helpfull
      but
      AttributeError: module ‘nltk.tokenize.punkt’ has no attribute ‘PunktWordTokenizer’
      It occurs when I run the code.
      pls anyone fix this error.

      1. Matt

        Hi, I made a quick fix here.

        Instead of using:
        tokenizer = nltk.tokenize.punkt.PunktWordTokenizer()
        I removed it and replaced it with:

        def tokenize(text):
        tokens = nltk.word_tokenize(text)
        return tokens

        Then I changed:
        pos_x = map(get_wordnet_pos, nltk.pos_tag(tokenizer.tokenize(x)))
        to
        pos_a = map(get_wordnet_pos, nltk.pos_tag(tokenize(x)))

        and it ran well for me and replicated the expected results.

        Hope that helps:)

      2. Johan Steunenberg

        @shikha, @Evelyn Guo, according to https://github.com/nltk/nltk/pull/746 the PunktWordTokenizer should not be used. That’s the reason it was removed from the API. So you might switch to the function nltk.word_tokenize

      3. Ishaan Kapoor

        Newest nltk versions > 3.0.0 do not have PunktWordTokenizer class any more.
        Appropriate changes have to be made in estnltk so this error would not occur any more.

      4. Brian

        PunktWordTokenizer has been dropped from nltk – you can delete the tokeniser declaration line and use “nltk.tokenize.word_tokenize” instead.

        So:

        import nltk.tokenize

        tokenizer = nltk.tokenize.punkt.PunktWordTokenizer() # <– DELETE THIS LINE

        tokens_a = [token.lower().strip(string.punctuation) for token in nltk.tokenize.word_tokenizer(a) \
        if token.lower().strip(string.punctuation) not in stopwords]

  3. Melvin

    Awesome tutorial…Thanks a lot

Leave a Comment

Your email address will not be published. Required fields are marked *