dplyr do: Some Tips for Using and Programming
This post was originally posted on Quantide blog. Read the full article here.
If you want to compute arbitrary operations on a data frame returning more than one number back, use dplyr
do()
!
This post aims to explore some basic concepts of do()
, along with giving some advice in using and programming.
do()
is a verb (function) of dplyr
. dplyr
is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames, like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.
First of all, you have to install dplyr
package:
install.packages("dplyr")
and to load it:
require(dplyr)
We will analyze the use of do()
with the following dataset, created with random data:
set.seed(100)
ds <- data.frame(group=c(rep("a",100), rep("b",100), rep("c",100)),
x=rnorm(n = 300, mean = 3, sd = 2), y=rnorm(n = 300, mean = 2, sd = 2))
We firstly transform it into a tbl_df
object to achieve a better print method. No changes occur on the input data frame.
ds <- tbl_df(ds)
ds
Source: local data frame [300 x 3]
group x y
(fctr) (dbl) (dbl)
1 a 1.995615 -1.71089045
2 a 3.263062 -0.03712943
3 a 2.842166 -0.09022217
4 a 4.773570 0.69742469
5 a 3.233943 2.76536531
6 a 3.637260 4.06379942
7 a 1.836419 2.26214995
8 a 4.429065 2.75438347
9 a 1.349481 -1.77539016
10 a 2.280276 3.04043881
.. ... ... ...
Base Concepts of do() (Non Standard Evaluation Version)
As we already said, do()
computes arbitrary operations on a data frame returning more than one number back.
To use do()
, you must know that:
- it always returns a dataframe
unlike the others data manipulation verbs of
dplyr
,do()
needs the specification of.
placeholder inside the function to apply, referring to the data it has to work with.# Head of ds
ds %>% do(head(.))
Source: local data frame [6 x 3]
group x y
(fctr) (dbl) (dbl)
1 a 1.995615 -1.71089045
2 a 3.263062 -0.03712943
3 a 2.842166 -0.09022217
4 a 4.773570 0.69742469
5 a 3.233943 2.76536531
6 a 3.637260 4.06379942
it is conceived to be used with dplyr
group_by()
to compute operations within groups:# Head of ds by group
ds %>% group_by(group) %>% do(head(.))
Source: local data frame [18 x 3]
Groups: group [3]
group x y
(fctr) (dbl) (dbl)
1 a 1.99561530 -1.71089045
2 a 3.26306233 -0.03712943
3 a 2.84216582 -0.09022217
4 a 4.77356962 0.69742469
5 a 3.23394254 2.76536531
6 a 3.63726018 4.06379942
7 b 2.33415330 -0.56965729
8 b 5.72622741 1.71643653
9 b 2.06170532 4.87756954
10 b 4.68575126 -0.08011508
11 b 0.08401255 -0.04767590
12 b 2.19938816 4.18954758
13 c 3.05634353 -0.89257491
14 c 2.28659319 2.63171152
15 c 4.70525275 1.31450497
16 c 4.02673050 -1.86270620
17 c 5.03640599 2.48564201
18 c 0.95704183 1.27446410
the argument of
do()
can be named or unnamed:named arguments (more than one supplied) become list-columns, with one element for each group:
Tail (last 3 obs) of x by group
ds %>% group_by(group) %>% do(out=tail(.$x, 3))
Source: local data frame [3 x 2]
Groups:group out
(fctr) (chr)
1 a
2 b
3 cunnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:
Tail (last 3 obs) of x by group
ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))
Source: local data frame [9 x 2]
Groups: group [3]group out
(fctr) (dbl)
1 a 3.8270397
2 a 0.6426337
3 a 0.6519305
4 b 3.3238824
5 b 0.8290942
6 b 4.1538746
7 c 6.5861213
8 c 4.6280643
9 c 0.3599512
Its use is the same working with customized functions.
Let us define the following function, which performs two simple operations returning a data frame:
my_fun <- function(x, y){
res_x = mean(x) + 2
res_y = mean(y) * 5
return(data.frame(res_x, res_y))
}
If the argument is named the result is:
# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))
Source: local data frame [3 x 2]
Groups: <by row>
group out
(fctr) (chr)
1 a <data.frame [1,2]>
2 b <data.frame [1,2]>
3 c <data.frame [1,2]>
Otherwise, if argument is unnamed the result is:
# Apply my_fun() function to ds by group
ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))
Source: local data frame [3 x 3]
Groups: group [3]
group res_x res_y
(fctr) (dbl) (dbl)
1 a 5.005825 9.167546
2 b 5.022282 8.683619
3 c 5.025586 11.240558
Programming with do_() (Standard Evaluation Version)
How can we enclose the previous operations inside a function? Simple! Using do_()
(the SE version of do()
) and interp()
function of lazyeval
package.
Continue reading on Quantide blog…
The post dplyr do: Some Tips for Using and Programming appeared first on MilanoR.
还没有评论,来说两句吧...