Character extraction in R?


76 просмотра

4 ответа

27 Репутация автора

How would I extract 'mpg' from the following formula in R? I understand that it would be useful to convert the formula into character first and then use some kind of regex. But I don't know which one.

mpg ~ x1 + x2
Автор: mDe Источник Размещён: 24.04.2017 06:36

Ответы (4)

5 плюса

465226 Репутация автора

We can use all.vars

#[1] "mpg"

Or with terms

as.character(attr(terms(form), "variables")[[2]])
#[1] "mpg"

Or another option is

#[1] "mpg"


form <-  mpg ~ x1 + x2
Автор: akrun Размещён: 24.04.2017 06:38

2 плюса

14891 Репутация автора

Given the formula:

frm <- as.formula(mpg ~ x1 + x2)

it is possible to extract the term on the left side simply using:

[1] "mpg"
Автор: Marco Sandri Размещён: 24.04.2017 06:57

3 плюса

23010 Репутация автора

Here's an approach that uses regex

x = mpg ~ x1 + x2
gsub(" ","",gsub("~.*", "", deparse(x)))
#[1] "mpg"
Автор: d.b Размещён: 24.04.2017 07:08

3 плюса

12883 Репутация автора

All of the given answers will work for your specific use case. But if you wish to use this in a more generalized sense, there are some caveats to be aware of. To discuss these, we'll define a few formulae

fm <- mpg ~ x1 + x2
fm_one <- ~ x1 + x2
fm_multi <- mpg + y1 ~ x1 + x2

all.vars will return a character vector of all of the variables in the formula. It is the fastest of the options given to this point. However, it does not distinguish between variables on the left hand and right hand side of the equation. Whether or not this is acceptable depends on your use case.

all.vars(fm)[1]         # "mpg"
all.vars(fm_one)[1]     # "x1" (this is a right hand side variable)
all.vars(fm_multi)[1]   # "mpg"  (missing other left hand side variables)

The terms approach (as.character(attr(terms(fm), "variables"))) will generate a similar vector, but the variable names start in the second position (the list call takes up the first element). It suffers the same disadvantages as the all.vars approach.

as.character(attr(terms(fm), "variables"))[2]        # "mpg"
as.character(attr(terms(fm_one), "variables"))[2]    # "x1"
as.character(attr(terms(fm_multi), "variables"))[2]  # "mpg"

Using as.character produces a character vector of either length 3 or 2, depending on if there is or isn't a left hand side. This at least has the ability to return the entire left side, but it won't return a character vector of the response variables. It still has the disadvantage, however, of not being distinguishing left sides variables from right side variables.

as.character(fm)        # "~" "mpg" "x1" "x2"
as.character(fm_one)    # "~" "x1" "x2"
as.character(fm_multi)  # "~" "mpg" "y1" "x1" "x2"

The deparse method is somewhat slower than all.vars (but still measured in nanoseconds), and has the primary advantage of distinguishing left hand side from right hand side.

gsub(" ","",gsub("~.*", "", deparse(fm)))        # "mpg"
gsub(" ","",gsub("~.*", "", deparse(fm_one)))    # ""
gsub(" ","",gsub("~.*", "", deparse(fm_multi)))  # "mpg+y1"

Depending on your actual needs, you may not need to protect against one-sided or multivariate formulae. If you are working in a system where it is known that all of your formulae will be univariate and two sided, all.vars is probably your best bet. If you can't be sure of that, I'd recommend using the deparse method. That will at least ensure that you always get response variables when you are looking for response variables.

Автор: Benjamin Размещён: 24.04.2017 07:43
Вопросы из категории :