データの操作

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

理想的なデータ構造

  • 1 行 = 1 観測値
  • 1 列 = 1 変数

データ加工・操作用関数

データの結合 (mutating join)

x, y, by は関数の引数です。by で指定したキー(変数名)が一致するように行を合わせることができる。

  • full_join(x, y, by):全ての xy 行と列を結合する。
  • inner_join(x, y, by)xy で共通する行と列を結合する。
  • left_join(x, y, by):左側()第 1 引数のtibble に y の変数を追加する。
  • right_join(x, y, by):右側()第 2 引数のtibble に x の変数を追加する。

データの結合 (join)

  • bind_cols():渡したtibbleを横に結合する(行数が異なったらエラーが発生する)。
  • bind_rows():渡した tibble を立てに結合する(一致する変数名を合わせてくれます)。

mutating join のつかいかた

X = tibble(x = c("A", "B", "C", "G"), y = c(NA, rnorm(3, mean = 5)))
Y = tibble(x = c("A", "C", "D", "E"), z = c(rpois(3, lambda = 5), NA))
X
# A tibble: 4 × 2
  x         y
  <chr> <dbl>
1 A     NA   
2 B      4.28
3 C      3.72
4 G      3.37
Y
# A tibble: 4 × 2
  x         z
  <chr> <int>
1 A         4
2 C         2
3 D         5
4 E        NA
full_join(X,Y, by = "x")
# A tibble: 6 × 3
  x         y     z
  <chr> <dbl> <int>
1 A     NA        4
2 B      4.28    NA
3 C      3.72     2
4 G      3.37    NA
5 D     NA        5
6 E     NA       NA
inner_join(X, Y, by = "x")
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <int>
1 A     NA        4
2 C      3.72     2
left_join(X, Y, by = "x")
# A tibble: 4 × 3
  x         y     z
  <chr> <dbl> <int>
1 A     NA        4
2 B      4.28    NA
3 C      3.72     2
4 G      3.37    NA
right_join(X, Y, by = "x")
# A tibble: 4 × 3
  x         y     z
  <chr> <dbl> <int>
1 A     NA        4
2 C      3.72     2
3 D     NA        5
4 E     NA       NA
bind_rows(X, Y)
# A tibble: 8 × 3
  x         y     z
  <chr> <dbl> <int>
1 A     NA       NA
2 B      4.28    NA
3 C      3.72    NA
4 G      3.37    NA
5 A     NA        4
6 C     NA        2
7 D     NA        5
8 E     NA       NA
bind_rows("X" = X, "Y" = Y, .id = "origin")
# A tibble: 8 × 4
  origin x         y     z
  <chr>  <chr> <dbl> <int>
1 X      A     NA       NA
2 X      B      4.28    NA
3 X      C      3.72    NA
4 X      G      3.37    NA
5 Y      A     NA        4
6 Y      C     NA        2
7 Y      D     NA        5
8 Y      E     NA       NA
bind_cols(X, Y)
New names:
• `x` -> `x...1`
• `x` -> `x...3`
# A tibble: 4 × 4
  x...1     y x...3     z
  <chr> <dbl> <chr> <int>
1 A     NA    A         4
2 B      4.28 C         2
3 C      3.72 D         5
4 G      3.37 E        NA

行と列の加工・操作用関数

列における操作

  • mutate():既存の変数の書き換えや変数の追加する
  • select():既存の変数を選らぶ
  • rename():既存の変数の名前を変える
  • pull():既存の変数をリストとして抽出する
  • `relocate()``:指定した列の位置を変える

行における操作

  • filter():条件を満たした行を返す
  • distinct():指定した変数から重複している行を外す
  • slice():指定した行インデックスを返す
  • arrange():指定した列の昇順で行を並べ替える

列の加工

iris |> as_tibble() |> mutate(P2 = Petal.Length^2)
# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    P2
          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
 1          5.1         3.5          1.4         0.2 setosa   1.96
 2          4.9         3            1.4         0.2 setosa   1.96
 3          4.7         3.2          1.3         0.2 setosa   1.69
 4          4.6         3.1          1.5         0.2 setosa   2.25
 5          5           3.6          1.4         0.2 setosa   1.96
 6          5.4         3.9          1.7         0.4 setosa   2.89
 7          4.6         3.4          1.4         0.3 setosa   1.96
 8          5           3.4          1.5         0.2 setosa   2.25
 9          4.4         2.9          1.4         0.2 setosa   1.96
10          4.9         3.1          1.5         0.1 setosa   2.25
# … with 140 more rows
iris |> as_tibble() |> select(Species, Petal.Length)
# A tibble: 150 × 2
   Species Petal.Length
   <fct>          <dbl>
 1 setosa           1.4
 2 setosa           1.4
 3 setosa           1.3
 4 setosa           1.5
 5 setosa           1.4
 6 setosa           1.7
 7 setosa           1.4
 8 setosa           1.5
 9 setosa           1.4
10 setosa           1.5
# … with 140 more rows
iris |> as_tibble() |> select(matches("Length"))
# A tibble: 150 × 2
   Sepal.Length Petal.Length
          <dbl>        <dbl>
 1          5.1          1.4
 2          4.9          1.4
 3          4.7          1.3
 4          4.6          1.5
 5          5            1.4
 6          5.4          1.7
 7          4.6          1.4
 8          5            1.5
 9          4.4          1.4
10          4.9          1.5
# … with 140 more rows
iris |> as_tibble() |> rename(PL = Petal.Length)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width    PL Petal.Width Species
          <dbl>       <dbl> <dbl>       <dbl> <fct>  
 1          5.1         3.5   1.4         0.2 setosa 
 2          4.9         3     1.4         0.2 setosa 
 3          4.7         3.2   1.3         0.2 setosa 
 4          4.6         3.1   1.5         0.2 setosa 
 5          5           3.6   1.4         0.2 setosa 
 6          5.4         3.9   1.7         0.4 setosa 
 7          4.6         3.4   1.4         0.3 setosa 
 8          5           3.4   1.5         0.2 setosa 
 9          4.4         2.9   1.4         0.2 setosa 
10          4.9         3.1   1.5         0.1 setosa 
# … with 140 more rows
iris |> as_tibble() |> 
  rename_with(~str_replace_all(.x, "[(a-z.)]", ""), .cols = matches("(Pet)|(Sep)"))
# A tibble: 150 × 5
      SL    SW    PL    PW Species
   <dbl> <dbl> <dbl> <dbl> <fct>  
 1   5.1   3.5   1.4   0.2 setosa 
 2   4.9   3     1.4   0.2 setosa 
 3   4.7   3.2   1.3   0.2 setosa 
 4   4.6   3.1   1.5   0.2 setosa 
 5   5     3.6   1.4   0.2 setosa 
 6   5.4   3.9   1.7   0.4 setosa 
 7   4.6   3.4   1.4   0.3 setosa 
 8   5     3.4   1.5   0.2 setosa 
 9   4.4   2.9   1.4   0.2 setosa 
10   4.9   3.1   1.5   0.1 setosa 
# … with 140 more rows
iris |> as_tibble() |> pull(Species)
  [1] setosa     setosa     setosa     setosa     setosa     setosa    
  [7] setosa     setosa     setosa     setosa     setosa     setosa    
 [13] setosa     setosa     setosa     setosa     setosa     setosa    
 [19] setosa     setosa     setosa     setosa     setosa     setosa    
 [25] setosa     setosa     setosa     setosa     setosa     setosa    
 [31] setosa     setosa     setosa     setosa     setosa     setosa    
 [37] setosa     setosa     setosa     setosa     setosa     setosa    
 [43] setosa     setosa     setosa     setosa     setosa     setosa    
 [49] setosa     setosa     versicolor versicolor versicolor versicolor
 [55] versicolor versicolor versicolor versicolor versicolor versicolor
 [61] versicolor versicolor versicolor versicolor versicolor versicolor
 [67] versicolor versicolor versicolor versicolor versicolor versicolor
 [73] versicolor versicolor versicolor versicolor versicolor versicolor
 [79] versicolor versicolor versicolor versicolor versicolor versicolor
 [85] versicolor versicolor versicolor versicolor versicolor versicolor
 [91] versicolor versicolor versicolor versicolor versicolor versicolor
 [97] versicolor versicolor versicolor versicolor virginica  virginica 
[103] virginica  virginica  virginica  virginica  virginica  virginica 
[109] virginica  virginica  virginica  virginica  virginica  virginica 
[115] virginica  virginica  virginica  virginica  virginica  virginica 
[121] virginica  virginica  virginica  virginica  virginica  virginica 
[127] virginica  virginica  virginica  virginica  virginica  virginica 
[133] virginica  virginica  virginica  virginica  virginica  virginica 
[139] virginica  virginica  virginica  virginica  virginica  virginica 
[145] virginica  virginica  virginica  virginica  virginica  virginica 
Levels: setosa versicolor virginica
iris |> as_tibble() |> relocate(Species, .before = "Sepal.Length")
# A tibble: 150 × 5
   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
   <fct>          <dbl>       <dbl>        <dbl>       <dbl>
 1 setosa           5.1         3.5          1.4         0.2
 2 setosa           4.9         3            1.4         0.2
 3 setosa           4.7         3.2          1.3         0.2
 4 setosa           4.6         3.1          1.5         0.2
 5 setosa           5           3.6          1.4         0.2
 6 setosa           5.4         3.9          1.7         0.4
 7 setosa           4.6         3.4          1.4         0.3
 8 setosa           5           3.4          1.5         0.2
 9 setosa           4.4         2.9          1.4         0.2
10 setosa           4.9         3.1          1.5         0.1
# … with 140 more rows
iris |> as_tibble() |> relocate(Species, matches("Length"), .before = "Sepal.Length")
# A tibble: 150 × 5
   Species Sepal.Length Petal.Length Sepal.Width Petal.Width
   <fct>          <dbl>        <dbl>       <dbl>       <dbl>
 1 setosa           5.1          1.4         3.5         0.2
 2 setosa           4.9          1.4         3           0.2
 3 setosa           4.7          1.3         3.2         0.2
 4 setosa           4.6          1.5         3.1         0.2
 5 setosa           5            1.4         3.6         0.2
 6 setosa           5.4          1.7         3.9         0.4
 7 setosa           4.6          1.4         3.4         0.3
 8 setosa           5            1.5         3.4         0.2
 9 setosa           4.4          1.4         2.9         0.2
10 setosa           4.9          1.5         3.1         0.1
# … with 140 more rows

行の加工

iris |> as_tibble() |> filter(str_detect(Species, "versicolor"))
# A tibble: 50 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          7           3.2          4.7         1.4 versicolor
 2          6.4         3.2          4.5         1.5 versicolor
 3          6.9         3.1          4.9         1.5 versicolor
 4          5.5         2.3          4           1.3 versicolor
 5          6.5         2.8          4.6         1.5 versicolor
 6          5.7         2.8          4.5         1.3 versicolor
 7          6.3         3.3          4.7         1.6 versicolor
 8          4.9         2.4          3.3         1   versicolor
 9          6.6         2.9          4.6         1.3 versicolor
10          5.2         2.7          3.9         1.4 versicolor
# … with 40 more rows
iris |> as_tibble() |> filter(Petal.Length > 6 & Sepal.Length > 7.5)
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <fct>    
1          7.6         3            6.6         2.1 virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.7         2.6          6.9         2.3 virginica
4          7.7         2.8          6.7         2   virginica
5          7.9         3.8          6.4         2   virginica
6          7.7         3            6.1         2.3 virginica
iris |> as_tibble() |> distinct(Species)
# A tibble: 3 × 1
  Species   
  <fct>     
1 setosa    
2 versicolor
3 virginica 
iris |> as_tibble() |> distinct(Petal.Length, .keep_all = T)
# A tibble: 43 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.1         3.5          1.4         0.2 setosa    
 2          4.7         3.2          1.3         0.2 setosa    
 3          4.6         3.1          1.5         0.2 setosa    
 4          5.4         3.9          1.7         0.4 setosa    
 5          4.8         3.4          1.6         0.2 setosa    
 6          4.3         3            1.1         0.1 setosa    
 7          5.8         4            1.2         0.2 setosa    
 8          4.6         3.6          1           0.2 setosa    
 9          4.8         3.4          1.9         0.2 setosa    
10          7           3.2          4.7         1.4 versicolor
# … with 33 more rows
iris |> as_tibble() |> slice(1:5)
# A tibble: 5 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
iris |> as_tibble() |> slice_head(n = 2)
# A tibble: 2 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
iris |> as_tibble() |> slice_tail(n = 2)
# A tibble: 2 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <fct>    
1          6.2         3.4          5.4         2.3 virginica
2          5.9         3            5.1         1.8 virginica
iris |> as_tibble() |> slice_min(Petal.Length)
# A tibble: 1 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          4.6         3.6            1         0.2 setosa 
iris |> as_tibble() |> slice_max(Petal.Length)
# A tibble: 1 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <fct>    
1          7.7         2.6          6.9         2.3 virginica
iris |> as_tibble() |> slice_sample(n = 3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5.7         4.4          1.5         0.4 setosa    
2          6.1         2.8          4.7         1.2 versicolor
3          6.3         3.3          4.7         1.6 versicolor
iris |> as_tibble() |> arrange(Sepal.Length)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          4.3         3            1.1         0.1 setosa 
 2          4.4         2.9          1.4         0.2 setosa 
 3          4.4         3            1.3         0.2 setosa 
 4          4.4         3.2          1.3         0.2 setosa 
 5          4.5         2.3          1.3         0.3 setosa 
 6          4.6         3.1          1.5         0.2 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          4.6         3.6          1           0.2 setosa 
 9          4.6         3.2          1.4         0.2 setosa 
10          4.7         3.2          1.3         0.2 setosa 
# … with 140 more rows
iris |> as_tibble() |> 
  arrange(desc(Sepal.Length), desc(Sepal.Width))
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
 1          7.9         3.8          6.4         2   virginica
 2          7.7         3.8          6.7         2.2 virginica
 3          7.7         3            6.1         2.3 virginica
 4          7.7         2.8          6.7         2   virginica
 5          7.7         2.6          6.9         2.3 virginica
 6          7.6         3            6.6         2.1 virginica
 7          7.4         2.8          6.1         1.9 virginica
 8          7.3         2.9          6.3         1.8 virginica
 9          7.2         3.6          6.1         2.5 virginica
10          7.2         3.2          6           1.8 virginica
# … with 140 more rows

グループ化・ネストに関する関数

  • group_by()tibble をグループ化する
  • group_nest():グループ化した tibble をネスト(入れ子)する
  • nest():渡した列をネストする
  • unnest():ネストされている列を展開(アンネスト)する
  • group_map():グループ化した tibble に関数を適応して、リストを返す
  • group_modify():グループ化した tibble に関数を適応して、tibble を返す

tibble のグループ化

iris |> as_tibble() |> select(1:3)
# A tibble: 150 × 3
   Sepal.Length Sepal.Width Petal.Length
          <dbl>       <dbl>        <dbl>
 1          5.1         3.5          1.4
 2          4.9         3            1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5           3.6          1.4
 6          5.4         3.9          1.7
 7          4.6         3.4          1.4
 8          5           3.4          1.5
 9          4.4         2.9          1.4
10          4.9         3.1          1.5
# … with 140 more rows
iris |> as_tibble() |> group_by(Species) |> select(1:3)
Adding missing grouping variables: `Species`
# A tibble: 150 × 4
# Groups:   Species [3]
   Species Sepal.Length Sepal.Width Petal.Length
   <fct>          <dbl>       <dbl>        <dbl>
 1 setosa           5.1         3.5          1.4
 2 setosa           4.9         3            1.4
 3 setosa           4.7         3.2          1.3
 4 setosa           4.6         3.1          1.5
 5 setosa           5           3.6          1.4
 6 setosa           5.4         3.9          1.7
 7 setosa           4.6         3.4          1.4
 8 setosa           5           3.4          1.5
 9 setosa           4.4         2.9          1.4
10 setosa           4.9         3.1          1.5
# … with 140 more rows
iris |> as_tibble() |> group_nest(Species)
# A tibble: 3 × 2
  Species                  data
  <fct>      <list<tibble[,4]>>
1 setosa               [50 × 4]
2 versicolor           [50 × 4]
3 virginica            [50 × 4]
iris |> as_tibble() |> nest(data = matches("Length|Width"))
# A tibble: 3 × 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>
iris |> as_tibble() |> group_nest(Species) |> unnest(data)
# A tibble: 150 × 5
   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
   <fct>          <dbl>       <dbl>        <dbl>       <dbl>
 1 setosa           5.1         3.5          1.4         0.2
 2 setosa           4.9         3            1.4         0.2
 3 setosa           4.7         3.2          1.3         0.2
 4 setosa           4.6         3.1          1.5         0.2
 5 setosa           5           3.6          1.4         0.2
 6 setosa           5.4         3.9          1.7         0.4
 7 setosa           4.6         3.4          1.4         0.3
 8 setosa           5           3.4          1.5         0.2
 9 setosa           4.4         2.9          1.4         0.2
10 setosa           4.9         3.1          1.5         0.1
# … with 140 more rows
iris |> as_tibble() |> group_by(Species) |> group_map(~head(.x, n = 2))
[[1]]
# A tibble: 2 × 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1          5.1         3.5          1.4         0.2
2          4.9         3            1.4         0.2

[[2]]
# A tibble: 2 × 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1          7           3.2          4.7         1.4
2          6.4         3.2          4.5         1.5

[[3]]
# A tibble: 2 × 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1          6.3         3.3          6           2.5
2          5.8         2.7          5.1         1.9
iris |> as_tibble() |> group_by(Species) |> group_modify(~head(.x, n = 2))
# A tibble: 6 × 5
# Groups:   Species [3]
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa              5.1         3.5          1.4         0.2
2 setosa              4.9         3            1.4         0.2
3 versicolor          7           3.2          4.7         1.4
4 versicolor          6.4         3.2          4.5         1.5
5 virginica           6.3         3.3          6           2.5
6 virginica           5.8         2.7          5.1         1.9

その他の関数

  • drop_na()NA(欠損値)を含む行を削除

  • replace_na()NAを他の値と書き換える

  • fill()NAを直前の値で埋める

  • separate():文字列の変数を任意の区切りで複数変数に分裂する

  • unite():複数の変数を任意の区切りで 1 列にまとめる

X
# A tibble: 4 × 2
  x         y
  <chr> <dbl>
1 A     NA   
2 B      4.28
3 C      3.72
4 G      3.37
X |> drop_na()
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 B      4.28
2 C      3.72
3 G      3.37
X |> replace_na(list(x = "Z", y = 0))
# A tibble: 4 × 2
  x         y
  <chr> <dbl>
1 A      0   
2 B      4.28
3 C      3.72
4 G      3.37
X |> mutate(y = replace_na(y, 0))
# A tibble: 4 × 2
  x         y
  <chr> <dbl>
1 A      0   
2 B      4.28
3 C      3.72
4 G      3.37
Y |> fill(z)
# A tibble: 4 × 2
  x         z
  <chr> <int>
1 A         4
2 C         2
3 D         5
4 E         5
tibble(x = c(NA, "Iris.setosa", "Iris.virginica", "Iris.versicolor")) |> 
  separate(x, into = c("Genus", "Species"))
# A tibble: 4 × 2
  Genus Species   
  <chr> <chr>     
1 <NA>  <NA>      
2 Iris  setosa    
3 Iris  virginica 
4 Iris  versicolor
tibble(x = rep("Iris", 3), y = c("setosa", "virginica", "versicolor")) |> 
  unite(Species, x, y, sep = "_")
# A tibble: 3 × 1
  Species        
  <chr>          
1 Iris_setosa    
2 Iris_virginica 
3 Iris_versicolor

ピボット・tibbleを変形する関数

  • pivot_longer()tibble を wide format (横広)から long format (縦長)に変える
  • pivot_wider()tibble をlong format から wide format に変える

重要な引数

pivot_longer()

  • cols:動かす変数
  • names_to:動かした変数の名前の移動先
  • values_to:動かした変数の値の移動先
  • names_transform:移動先の変数のタイプを変換

pivot_wider()

  • id_cols:行(値)を区別するための列名
  • names_from:移動先の列名になる変数
  • values_from:移動したい値
  • values_fill:存在しない要素の埋め込み方法
  • values_fn:行の区別ができないときの処理(デフォルトはリスト)

pivot_longer() の使い方

relig_income |> as_tibble()
# A tibble: 18 × 11
   religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Agnostic           27      34      60      81      76     137     122     109
 2 Atheist            12      27      37      52      35      70      73      59
 3 Buddhist           27      21      30      34      33      58      62      39
 4 Catholic          418     617     732     670     638    1116     949     792
 5 Don’t know/r…      15      14      15      11      10      35      21      17
 6 Evangelical …     575     869    1064     982     881    1486     949     723
 7 Hindu               1       9       7       9      11      34      47      48
 8 Historically…     228     244     236     238     197     223     131      81
 9 Jehovah's Wi…      20      27      24      24      21      30      15      11
10 Jewish             19      19      25      25      30      95      69      87
11 Mainline Prot     289     495     619     655     651    1107     939     753
12 Mormon             29      40      48      51      56     112      85      49
13 Muslim              6       7       9      10       9      23      16       8
14 Orthodox           13      17      23      32      32      47      38      42
15 Other Christ…       9       7      11      13      13      14      18      14
16 Other Faiths       20      33      40      46      49      63      46      40
17 Other World …       5       2       3       4       2       7       3       4
18 Unaffiliated      217     299     374     365     341     528     407     321
# … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
#   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
#   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`
relig_income |> as_tibble() |> 
pivot_longer(!religion, names_to = "income", values_to = "count")
# A tibble: 180 × 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# … with 170 more rows
billboard |> as_tibble()
# A tibble: 317 × 79
   artist track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9
   <chr>  <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac  Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA
 2 2Ge+h… The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA
 3 3 Doo… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51
 4 3 Doo… Loser 2000-10-21    76    76    72    69    67    65    55    59    62
 5 504 B… Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53
 6 98^0   Give… 2000-08-19    51    39    34    26    26    19     2     2     3
 7 A*Tee… Danc… 2000-07-08    97    97    96    95   100    NA    NA    NA    NA
 8 Aaliy… I Do… 2000-01-29    84    62    51    41    38    35    35    38    38
 9 Aaliy… Try … 2000-03-18    59    53    38    28    21    18    16    14    12
10 Adams… Open… 2000-08-26    76    76    74    69    68    67    61    58    57
# … with 307 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
billboard |> as_tibble() |> 
  pivot_longer(col = starts_with("wk"),
               names_to = "week", names_prefix = "wk",
               values_to = "rank", values_drop_na = TRUE)
# A tibble: 5,307 × 5
   artist  track                   date.entered week   rank
   <chr>   <chr>                   <date>       <chr> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
# … with 5,297 more rows
who |> as_tibble()
# A tibble: 7,240 × 60
   country     iso2  iso3   year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
   <chr>       <chr> <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Afghanistan AF    AFG    1980      NA      NA      NA      NA      NA      NA
 2 Afghanistan AF    AFG    1981      NA      NA      NA      NA      NA      NA
 3 Afghanistan AF    AFG    1982      NA      NA      NA      NA      NA      NA
 4 Afghanistan AF    AFG    1983      NA      NA      NA      NA      NA      NA
 5 Afghanistan AF    AFG    1984      NA      NA      NA      NA      NA      NA
 6 Afghanistan AF    AFG    1985      NA      NA      NA      NA      NA      NA
 7 Afghanistan AF    AFG    1986      NA      NA      NA      NA      NA      NA
 8 Afghanistan AF    AFG    1987      NA      NA      NA      NA      NA      NA
 9 Afghanistan AF    AFG    1988      NA      NA      NA      NA      NA      NA
10 Afghanistan AF    AFG    1989      NA      NA      NA      NA      NA      NA
# … with 7,230 more rows, 50 more variables: new_sp_m65 <dbl>,
#   new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
#   new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
#   new_sp_f65 <dbl>, new_sn_m014 <dbl>, new_sn_m1524 <dbl>,
#   new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
#   new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_f014 <dbl>,
#   new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, …
who %>% as_tibble() |>
  pivot_longer(cols = new_sp_m014:newrel_f65,
               names_to = c("diagnosis", "gender", "age"),
               names_pattern = "new_?(.*)_(.)(.*)",
               values_to = "count", values_drop_na = TRUE)
# A tibble: 76,046 × 8
   country     iso2  iso3   year diagnosis gender age   count
   <chr>       <chr> <chr> <dbl> <chr>     <chr>  <chr> <dbl>
 1 Afghanistan AF    AFG    1997 sp        m      014       0
 2 Afghanistan AF    AFG    1997 sp        m      1524     10
 3 Afghanistan AF    AFG    1997 sp        m      2534      6
 4 Afghanistan AF    AFG    1997 sp        m      3544      3
 5 Afghanistan AF    AFG    1997 sp        m      4554      5
 6 Afghanistan AF    AFG    1997 sp        m      5564      2
 7 Afghanistan AF    AFG    1997 sp        m      65        0
 8 Afghanistan AF    AFG    1997 sp        f      014       5
 9 Afghanistan AF    AFG    1997 sp        f      1524     38
10 Afghanistan AF    AFG    1997 sp        f      2534     36
# … with 76,036 more rows
anscombe |> as_tibble()
# A tibble: 11 × 8
      x1    x2    x3    x4    y1    y2    y3    y4
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    10    10    10     8  8.04  9.14  7.46  6.58
 2     8     8     8     8  6.95  8.14  6.77  5.76
 3    13    13    13     8  7.58  8.74 12.7   7.71
 4     9     9     9     8  8.81  8.77  7.11  8.84
 5    11    11    11     8  8.33  9.26  7.81  8.47
 6    14    14    14     8  9.96  8.1   8.84  7.04
 7     6     6     6     8  7.24  6.13  6.08  5.25
 8     4     4     4    19  4.26  3.1   5.39 12.5 
 9    12    12    12     8 10.8   9.13  8.15  5.56
10     7     7     7     8  4.82  7.26  6.42  7.91
11     5     5     5     8  5.68  4.74  5.73  6.89
anscombe %>% as_tibble() |> 
 pivot_longer(everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)"
 )
# A tibble: 44 × 3
   set       x     y
   <chr> <dbl> <dbl>
 1 1        10  8.04
 2 2        10  9.14
 3 3        10  7.46
 4 4         8  6.58
 5 1         8  6.95
 6 2         8  8.14
 7 3         8  6.77
 8 4         8  5.76
 9 1        13  7.58
10 2        13  8.74
# … with 34 more rows

pivot_wider() の使い方

fish_encounters
# A tibble: 114 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# … with 104 more rows
fish_encounters |> as_tibble() |> 
  pivot_wider(names_from = station, values_from = seen)
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA    NA
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
fish_encounters
# A tibble: 114 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# … with 104 more rows
# 存在しない組み合わせの要素を埋める
fish_encounters |>  as_tibble() |> 
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1     0     0     0     0     0     0
 5 4847        1     1      1     0       0     0     0     0     0     0     0
 6 4848        1     1      1     1       0     0     0     0     0     0     0
 7 4849        1     1      0     0       0     0     0     0     0     0     0
 8 4850        1     1      0     1       1     1     1     0     0     0     0
 9 4851        1     1      0     0       0     0     0     0     0     0     0
10 4854        1     1      0     0       0     0     0     0     0     0     0
11 4855        1     1      1     1       1     0     0     0     0     0     0
12 4857        1     1      1     1       1     1     1     1     1     0     0
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1     0     0     0     0     0     0
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1     0     0
17 4863        1     1      0     0       0     0     0     0     0     0     0
18 4864        1     1      0     0       0     0     0     0     0     0     0
19 4865        1     1      1     0       0     0     0     0     0     0     0
us_rent_income |> as_tibble()
# A tibble: 104 × 5
   GEOID NAME       variable estimate   moe
   <chr> <chr>      <chr>       <dbl> <dbl>
 1 01    Alabama    income      24476   136
 2 01    Alabama    rent          747     3
 3 02    Alaska     income      32940   508
 4 02    Alaska     rent         1200    13
 5 04    Arizona    income      27517   148
 6 04    Arizona    rent          972     4
 7 05    Arkansas   income      23789   165
 8 05    Arkansas   rent          709     5
 9 06    California income      29454   109
10 06    California rent         1358     3
# … with 94 more rows
us_rent_income |> as_tibble() |> 
  pivot_wider(names_from = variable, values_from = c(estimate, moe))
# A tibble: 52 × 6
   GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
   <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
 1 01    Alabama                        24476           747        136        3
 2 02    Alaska                         32940          1200        508       13
 3 04    Arizona                        27517           972        148        4
 4 05    Arkansas                       23789           709        165        5
 5 06    California                     29454          1358        109        3
 6 08    Colorado                       32401          1125        109        5
 7 09    Connecticut                    35326          1123        195        5
 8 10    Delaware                       31560          1076        247       10
 9 11    District of Columbia           43198          1424        681       17
10 12    Florida                        25952          1077         70        3
# … with 42 more rows
# us_rent_income  |> as_tibble() |> 
#   pivot_wider(names_from = variable,
#               names_sep = ".",
#               values_from = c(estimate, moe))

# us_rent_income  |> as_tibble() |> 
#   pivot_wider(names_from = variable,
#               names_glue = "{variable}_{.value}",
#               values_from = c(estimate, moe))
warpbreaks |> as_tibble()
# A tibble: 54 × 3
   breaks wool  tension
    <dbl> <fct> <fct>  
 1     26 A     L      
 2     30 A     L      
 3     54 A     L      
 4     25 A     L      
 5     70 A     L      
 6     52 A     L      
 7     51 A     L      
 8     26 A     L      
 9     67 A     L      
10     18 A     M      
# … with 44 more rows
warpbreaks |> as_tibble() |> 
  pivot_wider(names_from = wool,
              values_from = breaks)
Warning: Values from `breaks` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(tension, wool) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)
# A tibble: 3 × 3
  tension A         B        
  <fct>   <list>    <list>   
1 L       <dbl [9]> <dbl [9]>
2 M       <dbl [9]> <dbl [9]>
3 H       <dbl [9]> <dbl [9]>
warpbreaks |> as_tibble()
# A tibble: 54 × 3
   breaks wool  tension
    <dbl> <fct> <fct>  
 1     26 A     L      
 2     30 A     L      
 3     54 A     L      
 4     25 A     L      
 5     70 A     L      
 6     52 A     L      
 7     51 A     L      
 8     26 A     L      
 9     67 A     L      
10     18 A     M      
# … with 44 more rows
warpbreaks |> as_tibble() |> 
  pivot_wider(names_from = wool,
              values_from = breaks,
              values_fn = mean)
# A tibble: 3 × 3
  tension     A     B
  <fct>   <dbl> <dbl>
1 L        44.6  28.2
2 M        24    28.8
3 H        24.6  18.8

不都合なデータ構造

fname = "photosynthesis1_low.csv"
dset1_low = read_csv(fname)
dset1_low
# A tibble: 35 × 12
   sample   min   `0`   `5`  `10`  `15`  `20`  `25`  `30`  `35`  `40`  `45`
    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1     0  9.99 10.1  10.0  10.0  10.1   9.81 10.0  10.1  10.0  10.2 
 2      2     0  9.97  9.99 10.1  10.0   9.99  9.94  9.91  9.95  9.79  9.97
 3      3     0 10.0  10.1  10.1   9.97 10.1  10.0  10.0  10.0   9.88 10.0 
 4      4     0 10.0   9.87 10.1   9.87 10.0   9.95 10.2  10.0  10.1  10.1 
 5      5     0  9.92  9.97  9.88  9.86 10.0   9.91 10.0   9.94 10.0   9.95
 6      1     5  9.67 10.0   9.93 10.5  10.4  10.7  10.8  10.6  10.7  10.6 
 7      2     5  9.40  9.87 10.1  10.0  10.4  10.6  10.7  10.6  10.6  10.7 
 8      3     5  9.37  9.84 10.2  10.3  10.5  10.6  10.5  10.6  10.7  10.7 
 9      4     5  9.52  9.71  9.92 10.1  10.5  10.5  10.7  10.5  10.7  10.8 
10      5     5  9.65  9.83 10.1  10.4  10.4  10.6  10.5  10.5  10.7  10.7 
# … with 25 more rows

samplemin の列はサンプル番号と時間 (minutes) の変数です。 それぞれに、サンプル番号と時間の値が入っています。 0 から 45 の列には溶存酸素濃度の値が入っています。 この時の変数名は光条件ですね。

ワイドからロングへ変換

dset1_low |> 
  pivot_longer(cols = matches("[0-9]+"), names_to = "light",
               names_transform  = list(light = as.numeric))
# A tibble: 350 × 4
   sample   min light value
    <dbl> <dbl> <dbl> <dbl>
 1      1     0     0  9.99
 2      1     0     5 10.1 
 3      1     0    10 10.0 
 4      1     0    15 10.0 
 5      1     0    20 10.1 
 6      1     0    25  9.81
 7      1     0    30 10.0 
 8      1     0    35 10.1 
 9      1     0    40 10.0 
10      1     0    45 10.2 
# … with 340 more rows

残りのデータの読み込み

dset1_high = read_csv("photosynthesis1_high.csv")
dset2_low  = read_csv("photosynthesis2_low.csv")
dset2_high = read_csv("photosynthesis2_high.csv")

ピボットしてから結合

dset1_low  = dset1_low  |> pivot_longer(cols = matches("[0-9]+"), names_to = "light", names_transform = list(light = as.numeric))
dset1_high = dset1_high |> pivot_longer(cols = matches("[0-9]+"), names_to = "light", names_transform = list(light = as.numeric))
dset2_low  = dset2_low  |> pivot_longer(cols = matches("[0-9]+"), names_to = "light", names_transform = list(light = as.numeric))
dset2_high = dset2_high |> pivot_longer(cols = matches("[0-9]+"), names_to = "light", names_transform = list(light = as.numeric))
alldata = bind_rows(dset1_low, dset2_low, dset1_high, dset2_high)
alldata
# A tibble: 910 × 4
   sample   min light value
    <dbl> <dbl> <dbl> <dbl>
 1      1     0     0  9.99
 2      1     0     5 10.1 
 3      1     0    10 10.0 
 4      1     0    15 10.0 
 5      1     0    20 10.1 
 6      1     0    25  9.81
 7      1     0    30 10.0 
 8      1     0    35 10.1 
 9      1     0    40 10.0 
10      1     0    45 10.2 
# … with 900 more rows

結合してからピボット

Rows: 35 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (12): sample, min, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 35 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): sample, min, 50, 75, 100

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 35 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (12): sample, min, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 35 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): sample, min, 50, 75, 100

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dset1 = full_join(dset1_low, dset1_high, by = c("sample", "min"))
dset2 = full_join(dset2_low, dset2_high, by = c("sample", "min"))
alldata = bind_rows(dset1, dset2)
alldata = alldata |> 
  pivot_longer(cols = matches("[0-9]+"), names_to = "light", 
               names_transform = list(light = as.numeric))