ㅣseplyrㅣ dplyr을 reusable하게 - YA-Hwang 기술 블로그

벡터 개념을 활용하여 deplyr 코드를 reusable하게 만들어준다. –

seplyr은 Standard Evaluation Ply R이라는 뜻으로, dplyr의 function은 그대로 유지하면서 인터페이스를 향상시켰다.

타이틀은 Standard Evaluation Improved Interfaces for Common Data Manipulation Tasks

Standard Evaluation

많은 프로그래밍 언어에서 쓰이는 call by value를 말한다.

즉, function( x )에서 변수 x는 value값만을 function으로 가져온다. X는 value에 따라 여러 값이 될 수 있다.

dplyr은 매번 코드 안에 컬럼명을 직접 입력해야 하지만 seplyr은 매개변수만 수정하면 번거로움을 줄일 수 있다.

dplyr ㅣVSㅣ seplyr

seplyr은 dplyr 함수에 _se 또는 _nse를 붙여서 사용하면 된다. nse는 컬럼명에만 ‘ ‘를 취한다.

chain은 %.>% 형태를 사용한다. ( .이 하나 더 붙은 모습)

chain을 사용할 때는 함수명( . , Vector) 형태를 가진다.

select_se ㅣ&ㅣ deselct

dplyr에서는 -컬럼명으로 컬럼을 제거할 수 있다. seplyr에서는 deselect라는 함수를 사용해야 한다.

mtcars %>% select(mpg, cyl)
# seplyr
selectTerms <- c('mpg','cyl')
mtcars %.>% select_se(., selectTerms)

mtcars %>% select(-vs, -am, -gear, -carb)
# seplyr
deSelectTerms <- c('vs','am','gear','carb')
mtcars %.>% deselect(., deSelectTerms)
#
                     mpg cyl  disp  hp drat    wt  qsec
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02

filter

mtcars %>% filter(disp > 110, qsec < 17) %>% head()
# seplyr
# 조건이 하나의 character 값이다.
filterTerms <- c('disp > 110','qsec < 17')
mtcars %.>% filter_se(., filterTerms) %.>% head(.)
#
   mpg cyl  disp  hp drat   wt  qsec vs am gear carb
1 21.0   6 160.0 110 3.90 2.62 16.46  0  1    4    4
2 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4

arrange

mtcars %>% arrange(cyl, desc(gear)) %>% head()
# seplyr
orderTerms <- c('cyl', 'desc(gear)')
mtcars %.>% arrange_se(., orderTerms) %.>% head(.)
# 
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
2 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

mutate

'컬럼명' := '조건식' 활용

starwars %>% select(name, height, mass) %>% mutate(mass=floor(mass), PASS=ifelse(height>175,TRUE,FALSE))
# seplyr
selectTerms <- c('name','height','mass')
mutateTerms <- c('mass' := 'floor(mass)', 'PASS' := 'ifelse(height>175, TRUE, FALSE)')
starwars %.>% select_se(., selectTerms) %.>% mutate_se(., mutateTerms)
# A tibble: 87 x 4
                 name height  mass  PASS
                <chr>  <int> <dbl> <lgl>
 1     Luke Skywalker    172    77 FALSE
 2              C-3PO    167    75 FALSE

계산을 나눠서 연속적으로 하는 mutate

partition_mutate_se ㅣ&ㅣ mutate_seb 활용

벡터에 나열되는 순서대로 계산을 진행한다. 조건식이 복잡해지는 경우, 유용하게 사용할 수 있다.

# 100을 나눈 height를 다시 받아 BMI 계산에 활용한다.
plan <- partition_mutate_se(
        c('height' := 'height/100',
          'BMI' := 'round(mass/(height**2),2)'))
starwars %.>% select_se(., selectTerms) %.>% mutate_seb(., plan)
# A tibble: 87 x 4
                 name height  mass   BMI
                <chr>  <dbl> <dbl> <dbl>
 1     Luke Skywalker   1.72    77 26.03
 2              C-3PO   1.67    75 26.89

if_else_device()

ifelse문을 변수로 만들어 사용할 수 있다. 단점은 컬럼 생성은 안 되고 기존 컬럼값에 대한 변경만 가능하다.

ifebtest_로 시작하는 컬럼이 생성되는데 grepdf라는 함수로 해당 컬럼을 제외할 수 있다.

wrapr::grepdf - 정규식으로 컬럼명을 가져오는 함수, invert=TRUE - 해당 컬럼을 제외한 모든 컬럼명을 가져온다.

condition <- if_else_device(
  testexpr = "height<170",
  thenexprs = "height" := "NA",
  elseexprs = "mass" := "NA")

starwars %.>% mutate_se(., condition)
# A tibble: 6 x 14
            name height  mass  ifebtest_03ayk0mjkts9
           <chr>  <int> <dbl>      <lgl>
1 Luke Skywalker    172    NA      FALSE
2          C-3PO     NA    75      TRUE
# grepdf('^ifebtest_.*', ., invert=TRUE) => "name"   "height" "mass" ...
starwars %.>% mutate_se(., condition) %.>% select_se(., grepdf('^ifebtest_.*', ., invert=TRUE))
# A tibble: 6 x 14
            name height  mass
           <chr>  <int> <dbl>      
1 Luke Skywalker    172    NA  
2          C-3PO     NA    75      

그 외 다양한 함수도 한 번 알아볼 필요가 있다.

dplyr과 seplyr 속도 비교

dplyr이 seplyr보다는 속도가 약간 빠른 듯하다.

data.table은 seplyr과 함께 사용하면 매우 느려지는 경향이 있다.

# filter 비교
              test elapsed relative replications
DT.setkey.dplyr    0.88    1.000          100
       DF.dplyr    0.97    1.102          100
       DT.dplyr    1.06    1.205          100
DT.setkey.seplyr    1.19    1.352          100
      DF.seplyr    1.26    1.432          100
      DT.seplyr    1.46    1.659          100

# mutate 비교
       test replications elapsed relative
      DT          100    0.37    1.000    
DF.dplyr          100    0.64    1.730    
DT.dplyr          100    0.70    1.892      
DF.seplyr          100    1.06    2.865    
DT.seplyr          100    1.35    3.649   

seplyr은 속도보다 인터페이스에 초점이 맞춰져 있는 듯 하여 적절한 상황에 사용하는 것이 관건인 듯하다.

References :

An introduction to seplyr