You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You have a small bug in NYT-first-said.parsers.simple_scrape.context: if the word appears as a substring of a word before appearing on its own, the context snippet returns the first occurrence of that word and not the standalone word.
This bug manifests itself if there's a new word that appears plural first (with an s at the end) and then singular, the snippet will always return the context of the plural (since str.find() returns the index of the first occurrence). See: https://twitter.com/NYT_first_said/status/1135591139413778433
One possible fix would be to find the shortest word (token) in the article that contains the new word and use that to determine the snippet:
defcontext(content, word):
tokens_containing_word= []
tokens=content.split()
fortokenintokens:
ifwordintoken:
tokens_containing_word.append(token)
# you also might want to write a custom key function here that calculates length after # removing punctuation, otherwise "crocodyliforms" is the same length as "crocodyliform."context_token=min(tokens_containing_word, key=lambdax: len(x))
loc=content.find(context_token)
# existing logic proceeds...
The text was updated successfully, but these errors were encountered:
You have a small bug in NYT-first-said.parsers.simple_scrape.context: if the word appears as a substring of a word before appearing on its own, the context snippet returns the first occurrence of that word and not the standalone word.
This bug manifests itself if there's a new word that appears plural first (with an s at the end) and then singular, the snippet will always return the context of the plural (since str.find() returns the index of the first occurrence). See: https://twitter.com/NYT_first_said/status/1135591139413778433
One possible fix would be to find the shortest word (token) in the article that contains the new word and use that to determine the snippet:
The text was updated successfully, but these errors were encountered: