dplyr do: Some Tips for Using and Programming

秒速五厘米 2022-07-20 15:10 290阅读 0赞

This post was originally posted on Quantide blog. Read the full article here.

If you want to compute arbitrary operations on a data frame returning more than one number back, use dplyr do()!

This post aims to explore some basic concepts of do(), along with giving some advice in using and programming.

do() is a verb (function) of dplyr. dplyr is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames, like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.

First of all, you have to install dplyr package:

  1. install.packages("dplyr")

and to load it:

  1. require(dplyr)

We will analyze the use of do() with the following dataset, created with random data:

  1. set.seed(100)
  2. ds <- data.frame(group=c(rep("a",100), rep("b",100), rep("c",100)),
  3. x=rnorm(n = 300, mean = 3, sd = 2), y=rnorm(n = 300, mean = 2, sd = 2))

We firstly transform it into a tbl_df object to achieve a better print method. No changes occur on the input data frame.

  1. ds <- tbl_df(ds)
  2. ds
  3. Source: local data frame [300 x 3]
  4. group x y
  5. (fctr) (dbl) (dbl)
  6. 1 a 1.995615 -1.71089045
  7. 2 a 3.263062 -0.03712943
  8. 3 a 2.842166 -0.09022217
  9. 4 a 4.773570 0.69742469
  10. 5 a 3.233943 2.76536531
  11. 6 a 3.637260 4.06379942
  12. 7 a 1.836419 2.26214995
  13. 8 a 4.429065 2.75438347
  14. 9 a 1.349481 -1.77539016
  15. 10 a 2.280276 3.04043881
  16. .. ... ... ...

Base Concepts of do() (Non Standard Evaluation Version)

As we already said, do() computes arbitrary operations on a data frame returning more than one number back.

To use do(), you must know that:

  • it always returns a dataframe
  • unlike the others data manipulation verbs of dplyr, do()needs the specification of . placeholder inside the function to apply, referring to the data it has to work with.

    1. # Head of ds
    2. ds %>% do(head(.))
    3. Source: local data frame [6 x 3]
    4. group x y
    5. (fctr) (dbl) (dbl)
    6. 1 a 1.995615 -1.71089045
    7. 2 a 3.263062 -0.03712943
    8. 3 a 2.842166 -0.09022217
    9. 4 a 4.773570 0.69742469
    10. 5 a 3.233943 2.76536531
    11. 6 a 3.637260 4.06379942
  • it is conceived to be used with dplyr group_by() to compute operations within groups:

    1. # Head of ds by group
    2. ds %>% group_by(group) %>% do(head(.))
    3. Source: local data frame [18 x 3]
    4. Groups: group [3]
    5. group x y
    6. (fctr) (dbl) (dbl)
    7. 1 a 1.99561530 -1.71089045
    8. 2 a 3.26306233 -0.03712943
    9. 3 a 2.84216582 -0.09022217
    10. 4 a 4.77356962 0.69742469
    11. 5 a 3.23394254 2.76536531
    12. 6 a 3.63726018 4.06379942
    13. 7 b 2.33415330 -0.56965729
    14. 8 b 5.72622741 1.71643653
    15. 9 b 2.06170532 4.87756954
    16. 10 b 4.68575126 -0.08011508
    17. 11 b 0.08401255 -0.04767590
    18. 12 b 2.19938816 4.18954758
    19. 13 c 3.05634353 -0.89257491
    20. 14 c 2.28659319 2.63171152
    21. 15 c 4.70525275 1.31450497
    22. 16 c 4.02673050 -1.86270620
    23. 17 c 5.03640599 2.48564201
    24. 18 c 0.95704183 1.27446410
  • the argument of do() can be named or unnamed:

    • named arguments (more than one supplied) become list-columns, with one element for each group:

      Tail (last 3 obs) of x by group

      ds %>% group_by(group) %>% do(out=tail(.$x, 3))

      Source: local data frame [3 x 2]
      Groups:

      group out
      (fctr) (chr)
      1 a
      2 b
      3 c

    • unnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:

      Tail (last 3 obs) of x by group

      ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3)))

      Source: local data frame [9 x 2]
      Groups: group [3]

      group out
      (fctr) (dbl)
      1 a 3.8270397
      2 a 0.6426337
      3 a 0.6519305
      4 b 3.3238824
      5 b 0.8290942
      6 b 4.1538746
      7 c 6.5861213
      8 c 4.6280643
      9 c 0.3599512

Its use is the same working with customized functions.

Let us define the following function, which performs two simple operations returning a data frame:

  1. my_fun <- function(x, y){
  2. res_x = mean(x) + 2
  3. res_y = mean(y) * 5
  4. return(data.frame(res_x, res_y))
  5. }

If the argument is named the result is:

  1. # Apply my_fun() function to ds by group
  2. ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y))
  3. Source: local data frame [3 x 2]
  4. Groups: <by row>
  5. group out
  6. (fctr) (chr)
  7. 1 a <data.frame [1,2]>
  8. 2 b <data.frame [1,2]>
  9. 3 c <data.frame [1,2]>

Otherwise, if argument is unnamed the result is:

  1. # Apply my_fun() function to ds by group
  2. ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y))
  3. Source: local data frame [3 x 3]
  4. Groups: group [3]
  5. group res_x res_y
  6. (fctr) (dbl) (dbl)
  7. 1 a 5.005825 9.167546
  8. 2 b 5.022282 8.683619
  9. 3 c 5.025586 11.240558

Programming with do_() (Standard Evaluation Version)

How can we enclose the previous operations inside a function? Simple! Using do_() (the SE version of do()) and interp() function of lazyeval package.

Continue reading on Quantide blog…

The post dplyr do: Some Tips for Using and Programming appeared first on MilanoR.

发表评论

表情:
评论列表 (有 0 条评论,290人围观)

还没有评论,来说两句吧...

相关阅读

    相关 Some Tips and Tricks about Qt

    刚刚接触qt和qt quick,遇到一些小点子、小陷阱与大家一起分享和学习。csdn中有qt大神[安晓辉][Link 1],我也不敢班门弄斧,只是跟大家一起进步、一起分享。