Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39191: [R] throw error when string_replace is passed vector of values in pattern #39219

Merged
merged 7 commits into from
Dec 19, 2023

Conversation

abfleishman
Copy link
Contributor

@abfleishman abfleishman commented Dec 13, 2023

Rationale for this change

See #39191 This PR will hopefully throw an informative error message to let the user know that while the stringr::str_replace_all function can handle a named vector of values as the pattern argument, the arrow R package implementation cannot.

What changes are included in this PR?

  • add tests for passing vector to the pattern argument
  • add check for length > 1 to the string replace bindings

Are these changes tested?

yes (though I need help!)

Are there any user-facing changes?

yes. Hopefully the user will be alerted by an informative error message that they cannot pass a vector to the pattern argument. No breaking changes are expected.

Copy link

⚠️ GitHub issue #39191 has been automatically assigned in GitHub to PR creator.

@abfleishman
Copy link
Contributor Author

@thisisnic I have started this PR with my failed attempt to add the check and tests I thought might be useful. I could not figure it all out and my tests fail.

@thisisnic
Copy link
Member

Awesome, thanks for submitting this PR @abfleishman!

What's happening here is that we have some code which pulls the data in R if there is an error trying to run it in Arrow - in a lot of cases, this might be because Arrow doesn't support something that the original R function can, and so we chose to make sure the code can still run if possible.

There are other functions which trigger errors which we have tested the error and warning messages for, which can be found here:

test_that("errors and warnings in string splitting", {
# These conditions generate an error, but abandon_ship() catches the error,
# issues a warning, and pulls the data into R (if computing on InMemoryDataset)
# Elsewhere we test that abandon_ship() works,
# so here we can just call the functions directly
x <- Expression$field_ref("x")
expect_error(
call_binding("str_split", x, fixed("and", ignore_case = TRUE)),
"Case-insensitive string splitting not supported in Arrow"
)
expect_error(
call_binding("str_split", x, coll("and.?")),
"Pattern modifier `coll()` not supported in Arrow",
fixed = TRUE
)
expect_error(
call_binding("str_split", x, boundary(type = "word")),
"Pattern modifier `boundary()` not supported in Arrow",
fixed = TRUE
)
expect_error(
call_binding("str_split", x, "and", n = 0),
"Splitting strings into zero parts not supported in Arrow"
)
# This condition generates a warning
expect_warning(
call_binding("str_split", x, fixed("and"), simplify = TRUE),
"Argument 'simplify = TRUE' will be ignored"
)
})

So, if you update your test to be like those ones, you should be able to test for the error message you have created.

A question for you also, to fix the reported bug, do we need to check the input length for both pattern and replacement, or just one?

@abfleishman
Copy link
Contributor Author

A question for you also, to fix the reported bug, do we need to check the input length for both pattern and replacement, or just one?
Looks like if pattern is length 1, then replacement has to be length 1. If pattern is length >1 then replacement can be length 1 or the same as pattern. So I think checking for length(pattern)==1 should be enough.

@thisisnic can you explain what the call_binding is doing?

also, I think this is ready? LMK if there is more I should do?

@thisisnic
Copy link
Member

@thisisnic can you explain what the call_binding is doing?

Sure. Here's the code for call_binding():

arrow/r/R/dplyr-funcs.R

Lines 136 to 138 in 75c6b64

call_binding <- function(fun_name, ...) {
nse_funcs[[fun_name]](...)
}

In the arrow R package, we create "bindings" which create a link between the call to the R code and an expression which can be understood and ran by Arrow - in your case in this PR, the binding to stringr::string_replace(). We keep a list of bindings in a registry - which is that the nse_funcs which is being accessed in that function is. When we call call_binding(), it finds the relevant function in the registry and runs it, to create the correct Arrow Expression based on the R call.

Here's an example of using call_binding(). It needs to know the column we want to run it on (in this case we'll call it x) and any parameter to pass in (in this case, the pattern and replacement). You can see below that it maps the stringr function str_replace to the Arrow function replace_substring_regex.

> x <- Expression$field_ref("x")
> call_binding("str_replace", x, "a", "A")
Expression
replace_substring_regex(x, {pattern="a", replacement="A", max_replacements=1})

On its own call_binding() isn't much use in day-to-day arrow usage - it's used internally in more complex functions. But when we are running tests and want to isolate that the error messages we have designed appear correctly, then we can use it and so not have to worry about the other components of the functions which use call_binding() (for example when they ignore error messages to just run the code in R if it's possible, like you found earlier).

@thisisnic
Copy link
Member

also, I think this is ready? LMK if there is more I should do?

My apologies, I had a brain fart moment when I asked if we need to test them both, and was getting the input data column mixed up with replacement. I think we do need to test both pattern and replacement.

This PR is looking excellent, and is most of the way there; just needs the extra check (for replacement added in, a test for it), and then, if you run styler::style_file() on each of the changed files, it'll fix the linting issues which are causing one of the CI jobs to fail at the moment.

Comment on lines 436 to 453
expect_error(
arrow_table(df) %>%
transmute(x = call_binding("str_replace_all", x, c("F" = "_", "b" = ""))) %>%
collect(),
regexp = "`pattern` must be a length 1 character vector",
)
expect_error(
arrow_table(df) %>%
transmute(x = call_binding("str_replace_all", x, c("F", "b"), c("_", ""))) %>%
collect(),
regexp = "`pattern` must be a length 1 character vector",
)
expect_error(
arrow_table(df) %>%
transmute(x = call_binding("str_replace_all", x, c("F"), c("_", ""))) %>%
collect(),
regexp = "`replacement` must be a length 1 character vector",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny last change to suggest, and then this will be ready to merge! You don't need the arrow_table() etc stuff here as call_binding() can be called on its own. What is needed instead is to create a field reference so call_binding() has something to refer to in the expression it creates, and then expect_error() can just wrap call_binding().

For example, in that final test, you'll need something a bit shorter, like this:

  x <- Expression$field_ref("x")
  expect_error(
    call_binding("str_replace_all", x, c("F"), c("_", ""))),
    regexp = "`replacement` must be a length 1 character vector"
  )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'll be honest. I really am lost in this whole call_bindings/Expression$field_ref("x") stuff. I guess thats what I get for being a field biologist dabbling in computer science. Thanks for walking me through this stuff!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've dived into a really tricky bit of the codebase, but you've done a great job! A lot of my own early PRs to arrow involved drawing analogies between different bits of the codebase, but not understanding exactly what was going on (and this is still the case for any of my PRs which involve any C++).

Congratulations on your first PR to Arrow! Once the CI passes I'll merge it. Welcome to the project :)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 18, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 18, 2023
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, thank you for making this PR!

@thisisnic thisisnic merged commit 64fed4e into apache:main Dec 19, 2023
12 checks passed
@thisisnic thisisnic removed the awaiting changes Awaiting changes label Dec 19, 2023
@abfleishman
Copy link
Contributor Author

Nice! Thanks @thisisnic! Slightly confusing but great (small) issue to tackle as a first issue! Thanks for walking me through making the edits. I imagine you spent more time responding to me than it would have taken you to make the changes yourself, but I guess that is part of the open-source philosophy!

@thisisnic
Copy link
Member

Nice! Thanks @thisisnic! Slightly confusing but great (small) issue to tackle as a first issue! Thanks for walking me through making the edits. I imagine you spent more time responding to me than it would have taken you to make the changes yourself, but I guess that is part of the open-source philosophy!

Happy to do it again any time :)

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 64fed4e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…r of values in `pattern` (apache#39219)

### Rationale for this change
See apache#39191 This PR will hopefully throw an informative error message to let the user know that while the stringr::str_replace_all function can handle a named vector of values as the pattern argument, the arrow R package implementation cannot.  

### What changes are included in this PR?
- [ ] add tests for passing vector to the pattern argument
- [ ] add check for length > 1 to the string replace bindings

### Are these changes tested?
yes (though I need help!)

### Are there any user-facing changes?
 yes. Hopefully the user will be alerted by an informative error message that they cannot pass a vector to the pattern argument. No breaking changes are expected.

* Closes: apache#39191

Authored-by: Abram B. Fleishman <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…r of values in `pattern` (apache#39219)

### Rationale for this change
See apache#39191 This PR will hopefully throw an informative error message to let the user know that while the stringr::str_replace_all function can handle a named vector of values as the pattern argument, the arrow R package implementation cannot.  

### What changes are included in this PR?
- [ ] add tests for passing vector to the pattern argument
- [ ] add check for length > 1 to the string replace bindings

### Are these changes tested?
yes (though I need help!)

### Are there any user-facing changes?
 yes. Hopefully the user will be alerted by an informative error message that they cannot pass a vector to the pattern argument. No breaking changes are expected.

* Closes: apache#39191

Authored-by: Abram B. Fleishman <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
@baarthur
Copy link

baarthur commented Jan 6, 2025

hey guys!

i'd like to suggest including a caveat on this page to highlight that pattern/replacement vectors are not supported.

also, user feedback: i think the error message could be a little more informative, by explicitly saying that arrow cannot support multiple pattern/replacement vectors, it would be clearer to me since i thought it was a message from R and not from arrow

@thisisnic
Copy link
Member

Hi @baarthur - as this PR has now been marged, I don't suppose you'd mind opening a new issue with this? It'd be great to include your suggestion for the updated error message. If you were interested, we'd be happy to accept a PR for those changes too :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] arrow R package: support for stringr::str_replace_all() incomplete
3 participants