Context snippet returns first occurance even if the word is appearing as a substring #5

sethdandridge · 2019-06-03T18:00:44Z

You have a small bug in NYT-first-said.parsers.simple_scrape.context: if the word appears as a substring of a word before appearing on its own, the context snippet returns the first occurrence of that word and not the standalone word.

This bug manifests itself if there's a new word that appears plural first (with an s at the end) and then singular, the snippet will always return the context of the plural (since str.find() returns the index of the first occurrence). See: https://twitter.com/NYT_first_said/status/1135591139413778433

One possible fix would be to find the shortest word (token) in the article that contains the new word and use that to determine the snippet:

def context(content, word):
    tokens_containing_word = []
    tokens = content.split()
    for token in tokens:
        if word in token:
            tokens_containing_word.append(token)
    # you also might want to write a custom key function here that calculates length after 
    # removing punctuation, otherwise "crocodyliforms" is the same length as "crocodyliform."
    context_token = min(tokens_containing_word, key=lambda x: len(x))
    loc = content.find(context_token)
    # existing logic proceeds...

MaxBittker · 2019-06-10T05:19:06Z

I agree with your analysis & solution, thanks for opening the detailed issue!

I don't have time to fix this right now but I will get to it eventually - I suspect there are some edge cases related to the way I split/tokenize

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context snippet returns first occurance even if the word is appearing as a substring #5

Context snippet returns first occurance even if the word is appearing as a substring #5

sethdandridge commented Jun 3, 2019 •

edited

Loading

MaxBittker commented Jun 10, 2019

Context snippet returns first occurance even if the word is appearing as a substring #5

Context snippet returns first occurance even if the word is appearing as a substring #5

Comments

sethdandridge commented Jun 3, 2019 • edited Loading

MaxBittker commented Jun 10, 2019

sethdandridge commented Jun 3, 2019 •

edited

Loading