A lot of your time working in R will likely be spent organizing, cleaning, and subsetting data to prepare it for plotting and statistical analysis. The tidyverse is a set of R packages that make this easier by establishing a shared set of standards for how to represent and work with data sets. It includes commonly used packages for managing data like dplyr and tidyr, and the ggplot2 graphics package. This module focuses on data management with dplyr and tidyr. For more an introduction to ggplot2, check out this module.
The goal of this module is to introduce R users to the philosophy of tidyverse and illustrate how its features can make data management and analysis easier. More specifically, this module guides users through
First thing’s first – let’s install tidyverse!
install.packages("tidyverse") # install the package
Note that installing a package just gets it onto our computer– not into R! To access a package in R, we have to load the package using the library()
function.
library(tidyverse) # load it in the current R session.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.1.4 ✔ stringr 1.4.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
You can see that loading tidyverse actually loads a set of other packages – ggplot2, purrr, tibble, etc. These are the workhorses of the tidyverse. You can also see a list of “conflicts”. These are cases where a package we just loaded (e.g., dplyr) has its own version of a base R function (e.g., filter, lag), so it’s good to be mindful of this if we use those functions.
The package dplyr within the tidyverse family has a lot of helpful functions for getting your data ready to visualize and analyze. I’ll focus on the most commonly used ones in this module, but there’s a great cheatsheet available here.
Let’s try our hand at some “data wrangling” using some published data on Antarctic penguins. These data were collected by Kristen Gorman with the Palmer Station Long Term Ecological Research Program and later developed into an R package for educational uses by Allison Horst, and Alison Hill, and Kristen Gorman. The published manuscript focused on differences in sexual size dimorphism among 3 species of penguin and their relation to sex differences in foraging ecology. The full data set has really neat info on stable isotope blood measurements and reproductive success, but for the sake of brevity we will focus on the morphological measures used to quantify sexual size dimorphism.
First let’s install the palmerpenguins package.
install.packages("palmerpenguins") # install the package
And now let’s load it.
library(palmerpenguins) # load it in current R session
The data we’ll be using are now available in an object called penguins
. The full data are in penguins_raw
, which I encourage you to explore on your own later.
Let’s take a look at our data set using the print()
function in base R.
print(penguins)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
Interesting! It tells us that the penguins dataset is saved in R as a tibble
. Tibbles are one of the main features of the tidyverse approach to data wrangling. A tibble
is like a data.frame
from base R, but with a few important differences. For one, viewing and printing tibbles is generally tidier (get it?) than data.frame
s. Notice how viewing our tibble didn’t drown the console with the entire dataset at once?? Instead we get a nice summary. It also tells us what type of information is stored in each column – factors, integers, machine precision numbers (“dbl”), and so on. The authors of palmerpenguins like the tidyverse approach to data, so the data come prepackaged as a tibble.
A data.frame
is not quite as friendly. However, you may sometimes need to turn a tibble to a data.frame to take advantage of functions in base R or other packages. Let’s transform our penguin tibble into a data.frame and take a look at it.
penguins_df = as.data.frame(penguins) # creata a data.frame by applying the as.data.frame() function to the penguin tibble
print(penguins_df)
## species island bill_length_mm bill_depth_mm flipper_length_mm
## 1 Adelie Torgersen 39.1 18.7 181
## 2 Adelie Torgersen 39.5 17.4 186
## 3 Adelie Torgersen 40.3 18.0 195
## 4 Adelie Torgersen NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193
## 6 Adelie Torgersen 39.3 20.6 190
## 7 Adelie Torgersen 38.9 17.8 181
## 8 Adelie Torgersen 39.2 19.6 195
## 9 Adelie Torgersen 34.1 18.1 193
## 10 Adelie Torgersen 42.0 20.2 190
## 11 Adelie Torgersen 37.8 17.1 186
## 12 Adelie Torgersen 37.8 17.3 180
## 13 Adelie Torgersen 41.1 17.6 182
## 14 Adelie Torgersen 38.6 21.2 191
## 15 Adelie Torgersen 34.6 21.1 198
## 16 Adelie Torgersen 36.6 17.8 185
## 17 Adelie Torgersen 38.7 19.0 195
## 18 Adelie Torgersen 42.5 20.7 197
## 19 Adelie Torgersen 34.4 18.4 184
## 20 Adelie Torgersen 46.0 21.5 194
## 21 Adelie Biscoe 37.8 18.3 174
## 22 Adelie Biscoe 37.7 18.7 180
## 23 Adelie Biscoe 35.9 19.2 189
## 24 Adelie Biscoe 38.2 18.1 185
## 25 Adelie Biscoe 38.8 17.2 180
## 26 Adelie Biscoe 35.3 18.9 187
## 27 Adelie Biscoe 40.6 18.6 183
## 28 Adelie Biscoe 40.5 17.9 187
## 29 Adelie Biscoe 37.9 18.6 172
## 30 Adelie Biscoe 40.5 18.9 180
## 31 Adelie Dream 39.5 16.7 178
## 32 Adelie Dream 37.2 18.1 178
## 33 Adelie Dream 39.5 17.8 188
## 34 Adelie Dream 40.9 18.9 184
## 35 Adelie Dream 36.4 17.0 195
## 36 Adelie Dream 39.2 21.1 196
## 37 Adelie Dream 38.8 20.0 190
## 38 Adelie Dream 42.2 18.5 180
## 39 Adelie Dream 37.6 19.3 181
## 40 Adelie Dream 39.8 19.1 184
## 41 Adelie Dream 36.5 18.0 182
## 42 Adelie Dream 40.8 18.4 195
## 43 Adelie Dream 36.0 18.5 186
## 44 Adelie Dream 44.1 19.7 196
## 45 Adelie Dream 37.0 16.9 185
## 46 Adelie Dream 39.6 18.8 190
## 47 Adelie Dream 41.1 19.0 182
## 48 Adelie Dream 37.5 18.9 179
## 49 Adelie Dream 36.0 17.9 190
## 50 Adelie Dream 42.3 21.2 191
## 51 Adelie Biscoe 39.6 17.7 186
## 52 Adelie Biscoe 40.1 18.9 188
## 53 Adelie Biscoe 35.0 17.9 190
## 54 Adelie Biscoe 42.0 19.5 200
## 55 Adelie Biscoe 34.5 18.1 187
## 56 Adelie Biscoe 41.4 18.6 191
## 57 Adelie Biscoe 39.0 17.5 186
## 58 Adelie Biscoe 40.6 18.8 193
## 59 Adelie Biscoe 36.5 16.6 181
## 60 Adelie Biscoe 37.6 19.1 194
## 61 Adelie Biscoe 35.7 16.9 185
## 62 Adelie Biscoe 41.3 21.1 195
## 63 Adelie Biscoe 37.6 17.0 185
## 64 Adelie Biscoe 41.1 18.2 192
## 65 Adelie Biscoe 36.4 17.1 184
## 66 Adelie Biscoe 41.6 18.0 192
## 67 Adelie Biscoe 35.5 16.2 195
## 68 Adelie Biscoe 41.1 19.1 188
## 69 Adelie Torgersen 35.9 16.6 190
## 70 Adelie Torgersen 41.8 19.4 198
## 71 Adelie Torgersen 33.5 19.0 190
## 72 Adelie Torgersen 39.7 18.4 190
## 73 Adelie Torgersen 39.6 17.2 196
## 74 Adelie Torgersen 45.8 18.9 197
## 75 Adelie Torgersen 35.5 17.5 190
## 76 Adelie Torgersen 42.8 18.5 195
## 77 Adelie Torgersen 40.9 16.8 191
## 78 Adelie Torgersen 37.2 19.4 184
## 79 Adelie Torgersen 36.2 16.1 187
## 80 Adelie Torgersen 42.1 19.1 195
## 81 Adelie Torgersen 34.6 17.2 189
## 82 Adelie Torgersen 42.9 17.6 196
## 83 Adelie Torgersen 36.7 18.8 187
## 84 Adelie Torgersen 35.1 19.4 193
## 85 Adelie Dream 37.3 17.8 191
## 86 Adelie Dream 41.3 20.3 194
## 87 Adelie Dream 36.3 19.5 190
## 88 Adelie Dream 36.9 18.6 189
## 89 Adelie Dream 38.3 19.2 189
## 90 Adelie Dream 38.9 18.8 190
## 91 Adelie Dream 35.7 18.0 202
## 92 Adelie Dream 41.1 18.1 205
## 93 Adelie Dream 34.0 17.1 185
## 94 Adelie Dream 39.6 18.1 186
## 95 Adelie Dream 36.2 17.3 187
## 96 Adelie Dream 40.8 18.9 208
## 97 Adelie Dream 38.1 18.6 190
## 98 Adelie Dream 40.3 18.5 196
## 99 Adelie Dream 33.1 16.1 178
## 100 Adelie Dream 43.2 18.5 192
## 101 Adelie Biscoe 35.0 17.9 192
## 102 Adelie Biscoe 41.0 20.0 203
## 103 Adelie Biscoe 37.7 16.0 183
## 104 Adelie Biscoe 37.8 20.0 190
## 105 Adelie Biscoe 37.9 18.6 193
## 106 Adelie Biscoe 39.7 18.9 184
## 107 Adelie Biscoe 38.6 17.2 199
## 108 Adelie Biscoe 38.2 20.0 190
## 109 Adelie Biscoe 38.1 17.0 181
## 110 Adelie Biscoe 43.2 19.0 197
## 111 Adelie Biscoe 38.1 16.5 198
## 112 Adelie Biscoe 45.6 20.3 191
## 113 Adelie Biscoe 39.7 17.7 193
## 114 Adelie Biscoe 42.2 19.5 197
## 115 Adelie Biscoe 39.6 20.7 191
## 116 Adelie Biscoe 42.7 18.3 196
## 117 Adelie Torgersen 38.6 17.0 188
## 118 Adelie Torgersen 37.3 20.5 199
## 119 Adelie Torgersen 35.7 17.0 189
## 120 Adelie Torgersen 41.1 18.6 189
## 121 Adelie Torgersen 36.2 17.2 187
## 122 Adelie Torgersen 37.7 19.8 198
## 123 Adelie Torgersen 40.2 17.0 176
## 124 Adelie Torgersen 41.4 18.5 202
## 125 Adelie Torgersen 35.2 15.9 186
## 126 Adelie Torgersen 40.6 19.0 199
## 127 Adelie Torgersen 38.8 17.6 191
## 128 Adelie Torgersen 41.5 18.3 195
## 129 Adelie Torgersen 39.0 17.1 191
## 130 Adelie Torgersen 44.1 18.0 210
## 131 Adelie Torgersen 38.5 17.9 190
## 132 Adelie Torgersen 43.1 19.2 197
## 133 Adelie Dream 36.8 18.5 193
## 134 Adelie Dream 37.5 18.5 199
## 135 Adelie Dream 38.1 17.6 187
## 136 Adelie Dream 41.1 17.5 190
## 137 Adelie Dream 35.6 17.5 191
## 138 Adelie Dream 40.2 20.1 200
## 139 Adelie Dream 37.0 16.5 185
## 140 Adelie Dream 39.7 17.9 193
## 141 Adelie Dream 40.2 17.1 193
## 142 Adelie Dream 40.6 17.2 187
## 143 Adelie Dream 32.1 15.5 188
## 144 Adelie Dream 40.7 17.0 190
## 145 Adelie Dream 37.3 16.8 192
## 146 Adelie Dream 39.0 18.7 185
## 147 Adelie Dream 39.2 18.6 190
## 148 Adelie Dream 36.6 18.4 184
## 149 Adelie Dream 36.0 17.8 195
## 150 Adelie Dream 37.8 18.1 193
## 151 Adelie Dream 36.0 17.1 187
## 152 Adelie Dream 41.5 18.5 201
## 153 Gentoo Biscoe 46.1 13.2 211
## 154 Gentoo Biscoe 50.0 16.3 230
## 155 Gentoo Biscoe 48.7 14.1 210
## 156 Gentoo Biscoe 50.0 15.2 218
## 157 Gentoo Biscoe 47.6 14.5 215
## 158 Gentoo Biscoe 46.5 13.5 210
## 159 Gentoo Biscoe 45.4 14.6 211
## 160 Gentoo Biscoe 46.7 15.3 219
## 161 Gentoo Biscoe 43.3 13.4 209
## 162 Gentoo Biscoe 46.8 15.4 215
## 163 Gentoo Biscoe 40.9 13.7 214
## 164 Gentoo Biscoe 49.0 16.1 216
## 165 Gentoo Biscoe 45.5 13.7 214
## 166 Gentoo Biscoe 48.4 14.6 213
## 167 Gentoo Biscoe 45.8 14.6 210
## 168 Gentoo Biscoe 49.3 15.7 217
## 169 Gentoo Biscoe 42.0 13.5 210
## 170 Gentoo Biscoe 49.2 15.2 221
## 171 Gentoo Biscoe 46.2 14.5 209
## 172 Gentoo Biscoe 48.7 15.1 222
## 173 Gentoo Biscoe 50.2 14.3 218
## 174 Gentoo Biscoe 45.1 14.5 215
## 175 Gentoo Biscoe 46.5 14.5 213
## 176 Gentoo Biscoe 46.3 15.8 215
## 177 Gentoo Biscoe 42.9 13.1 215
## 178 Gentoo Biscoe 46.1 15.1 215
## 179 Gentoo Biscoe 44.5 14.3 216
## 180 Gentoo Biscoe 47.8 15.0 215
## 181 Gentoo Biscoe 48.2 14.3 210
## 182 Gentoo Biscoe 50.0 15.3 220
## 183 Gentoo Biscoe 47.3 15.3 222
## 184 Gentoo Biscoe 42.8 14.2 209
## 185 Gentoo Biscoe 45.1 14.5 207
## 186 Gentoo Biscoe 59.6 17.0 230
## 187 Gentoo Biscoe 49.1 14.8 220
## 188 Gentoo Biscoe 48.4 16.3 220
## 189 Gentoo Biscoe 42.6 13.7 213
## 190 Gentoo Biscoe 44.4 17.3 219
## 191 Gentoo Biscoe 44.0 13.6 208
## 192 Gentoo Biscoe 48.7 15.7 208
## 193 Gentoo Biscoe 42.7 13.7 208
## 194 Gentoo Biscoe 49.6 16.0 225
## 195 Gentoo Biscoe 45.3 13.7 210
## 196 Gentoo Biscoe 49.6 15.0 216
## 197 Gentoo Biscoe 50.5 15.9 222
## 198 Gentoo Biscoe 43.6 13.9 217
## 199 Gentoo Biscoe 45.5 13.9 210
## 200 Gentoo Biscoe 50.5 15.9 225
## 201 Gentoo Biscoe 44.9 13.3 213
## 202 Gentoo Biscoe 45.2 15.8 215
## 203 Gentoo Biscoe 46.6 14.2 210
## 204 Gentoo Biscoe 48.5 14.1 220
## 205 Gentoo Biscoe 45.1 14.4 210
## 206 Gentoo Biscoe 50.1 15.0 225
## 207 Gentoo Biscoe 46.5 14.4 217
## 208 Gentoo Biscoe 45.0 15.4 220
## 209 Gentoo Biscoe 43.8 13.9 208
## 210 Gentoo Biscoe 45.5 15.0 220
## 211 Gentoo Biscoe 43.2 14.5 208
## 212 Gentoo Biscoe 50.4 15.3 224
## 213 Gentoo Biscoe 45.3 13.8 208
## 214 Gentoo Biscoe 46.2 14.9 221
## 215 Gentoo Biscoe 45.7 13.9 214
## 216 Gentoo Biscoe 54.3 15.7 231
## 217 Gentoo Biscoe 45.8 14.2 219
## 218 Gentoo Biscoe 49.8 16.8 230
## 219 Gentoo Biscoe 46.2 14.4 214
## 220 Gentoo Biscoe 49.5 16.2 229
## 221 Gentoo Biscoe 43.5 14.2 220
## 222 Gentoo Biscoe 50.7 15.0 223
## 223 Gentoo Biscoe 47.7 15.0 216
## 224 Gentoo Biscoe 46.4 15.6 221
## 225 Gentoo Biscoe 48.2 15.6 221
## 226 Gentoo Biscoe 46.5 14.8 217
## 227 Gentoo Biscoe 46.4 15.0 216
## 228 Gentoo Biscoe 48.6 16.0 230
## 229 Gentoo Biscoe 47.5 14.2 209
## 230 Gentoo Biscoe 51.1 16.3 220
## 231 Gentoo Biscoe 45.2 13.8 215
## 232 Gentoo Biscoe 45.2 16.4 223
## 233 Gentoo Biscoe 49.1 14.5 212
## 234 Gentoo Biscoe 52.5 15.6 221
## 235 Gentoo Biscoe 47.4 14.6 212
## 236 Gentoo Biscoe 50.0 15.9 224
## 237 Gentoo Biscoe 44.9 13.8 212
## 238 Gentoo Biscoe 50.8 17.3 228
## 239 Gentoo Biscoe 43.4 14.4 218
## 240 Gentoo Biscoe 51.3 14.2 218
## 241 Gentoo Biscoe 47.5 14.0 212
## 242 Gentoo Biscoe 52.1 17.0 230
## 243 Gentoo Biscoe 47.5 15.0 218
## 244 Gentoo Biscoe 52.2 17.1 228
## 245 Gentoo Biscoe 45.5 14.5 212
## 246 Gentoo Biscoe 49.5 16.1 224
## 247 Gentoo Biscoe 44.5 14.7 214
## 248 Gentoo Biscoe 50.8 15.7 226
## 249 Gentoo Biscoe 49.4 15.8 216
## 250 Gentoo Biscoe 46.9 14.6 222
## 251 Gentoo Biscoe 48.4 14.4 203
## 252 Gentoo Biscoe 51.1 16.5 225
## 253 Gentoo Biscoe 48.5 15.0 219
## 254 Gentoo Biscoe 55.9 17.0 228
## 255 Gentoo Biscoe 47.2 15.5 215
## 256 Gentoo Biscoe 49.1 15.0 228
## 257 Gentoo Biscoe 47.3 13.8 216
## 258 Gentoo Biscoe 46.8 16.1 215
## 259 Gentoo Biscoe 41.7 14.7 210
## 260 Gentoo Biscoe 53.4 15.8 219
## 261 Gentoo Biscoe 43.3 14.0 208
## 262 Gentoo Biscoe 48.1 15.1 209
## 263 Gentoo Biscoe 50.5 15.2 216
## 264 Gentoo Biscoe 49.8 15.9 229
## 265 Gentoo Biscoe 43.5 15.2 213
## 266 Gentoo Biscoe 51.5 16.3 230
## 267 Gentoo Biscoe 46.2 14.1 217
## 268 Gentoo Biscoe 55.1 16.0 230
## 269 Gentoo Biscoe 44.5 15.7 217
## 270 Gentoo Biscoe 48.8 16.2 222
## 271 Gentoo Biscoe 47.2 13.7 214
## 272 Gentoo Biscoe NA NA NA
## 273 Gentoo Biscoe 46.8 14.3 215
## 274 Gentoo Biscoe 50.4 15.7 222
## 275 Gentoo Biscoe 45.2 14.8 212
## 276 Gentoo Biscoe 49.9 16.1 213
## 277 Chinstrap Dream 46.5 17.9 192
## 278 Chinstrap Dream 50.0 19.5 196
## 279 Chinstrap Dream 51.3 19.2 193
## 280 Chinstrap Dream 45.4 18.7 188
## 281 Chinstrap Dream 52.7 19.8 197
## 282 Chinstrap Dream 45.2 17.8 198
## 283 Chinstrap Dream 46.1 18.2 178
## 284 Chinstrap Dream 51.3 18.2 197
## 285 Chinstrap Dream 46.0 18.9 195
## 286 Chinstrap Dream 51.3 19.9 198
## 287 Chinstrap Dream 46.6 17.8 193
## 288 Chinstrap Dream 51.7 20.3 194
## 289 Chinstrap Dream 47.0 17.3 185
## 290 Chinstrap Dream 52.0 18.1 201
## 291 Chinstrap Dream 45.9 17.1 190
## 292 Chinstrap Dream 50.5 19.6 201
## 293 Chinstrap Dream 50.3 20.0 197
## 294 Chinstrap Dream 58.0 17.8 181
## 295 Chinstrap Dream 46.4 18.6 190
## 296 Chinstrap Dream 49.2 18.2 195
## 297 Chinstrap Dream 42.4 17.3 181
## 298 Chinstrap Dream 48.5 17.5 191
## 299 Chinstrap Dream 43.2 16.6 187
## 300 Chinstrap Dream 50.6 19.4 193
## 301 Chinstrap Dream 46.7 17.9 195
## 302 Chinstrap Dream 52.0 19.0 197
## 303 Chinstrap Dream 50.5 18.4 200
## 304 Chinstrap Dream 49.5 19.0 200
## 305 Chinstrap Dream 46.4 17.8 191
## 306 Chinstrap Dream 52.8 20.0 205
## 307 Chinstrap Dream 40.9 16.6 187
## 308 Chinstrap Dream 54.2 20.8 201
## 309 Chinstrap Dream 42.5 16.7 187
## 310 Chinstrap Dream 51.0 18.8 203
## 311 Chinstrap Dream 49.7 18.6 195
## 312 Chinstrap Dream 47.5 16.8 199
## 313 Chinstrap Dream 47.6 18.3 195
## 314 Chinstrap Dream 52.0 20.7 210
## 315 Chinstrap Dream 46.9 16.6 192
## 316 Chinstrap Dream 53.5 19.9 205
## 317 Chinstrap Dream 49.0 19.5 210
## 318 Chinstrap Dream 46.2 17.5 187
## 319 Chinstrap Dream 50.9 19.1 196
## 320 Chinstrap Dream 45.5 17.0 196
## 321 Chinstrap Dream 50.9 17.9 196
## 322 Chinstrap Dream 50.8 18.5 201
## 323 Chinstrap Dream 50.1 17.9 190
## 324 Chinstrap Dream 49.0 19.6 212
## 325 Chinstrap Dream 51.5 18.7 187
## 326 Chinstrap Dream 49.8 17.3 198
## 327 Chinstrap Dream 48.1 16.4 199
## 328 Chinstrap Dream 51.4 19.0 201
## 329 Chinstrap Dream 45.7 17.3 193
## 330 Chinstrap Dream 50.7 19.7 203
## 331 Chinstrap Dream 42.5 17.3 187
## 332 Chinstrap Dream 52.2 18.8 197
## 333 Chinstrap Dream 45.2 16.6 191
## 334 Chinstrap Dream 49.3 19.9 203
## 335 Chinstrap Dream 50.2 18.8 202
## 336 Chinstrap Dream 45.6 19.4 194
## 337 Chinstrap Dream 51.9 19.5 206
## 338 Chinstrap Dream 46.8 16.5 189
## 339 Chinstrap Dream 45.7 17.0 195
## 340 Chinstrap Dream 55.8 19.8 207
## 341 Chinstrap Dream 43.5 18.1 202
## 342 Chinstrap Dream 49.6 18.2 193
## 343 Chinstrap Dream 50.8 19.0 210
## 344 Chinstrap Dream 50.2 18.7 198
## body_mass_g sex year
## 1 3750 male 2007
## 2 3800 female 2007
## 3 3250 female 2007
## 4 NA <NA> 2007
## 5 3450 female 2007
## 6 3650 male 2007
## 7 3625 female 2007
## 8 4675 male 2007
## 9 3475 <NA> 2007
## 10 4250 <NA> 2007
## 11 3300 <NA> 2007
## 12 3700 <NA> 2007
## 13 3200 female 2007
## 14 3800 male 2007
## 15 4400 male 2007
## 16 3700 female 2007
## 17 3450 female 2007
## 18 4500 male 2007
## 19 3325 female 2007
## 20 4200 male 2007
## 21 3400 female 2007
## 22 3600 male 2007
## 23 3800 female 2007
## 24 3950 male 2007
## 25 3800 male 2007
## 26 3800 female 2007
## 27 3550 male 2007
## 28 3200 female 2007
## 29 3150 female 2007
## 30 3950 male 2007
## 31 3250 female 2007
## 32 3900 male 2007
## 33 3300 female 2007
## 34 3900 male 2007
## 35 3325 female 2007
## 36 4150 male 2007
## 37 3950 male 2007
## 38 3550 female 2007
## 39 3300 female 2007
## 40 4650 male 2007
## 41 3150 female 2007
## 42 3900 male 2007
## 43 3100 female 2007
## 44 4400 male 2007
## 45 3000 female 2007
## 46 4600 male 2007
## 47 3425 male 2007
## 48 2975 <NA> 2007
## 49 3450 female 2007
## 50 4150 male 2007
## 51 3500 female 2008
## 52 4300 male 2008
## 53 3450 female 2008
## 54 4050 male 2008
## 55 2900 female 2008
## 56 3700 male 2008
## 57 3550 female 2008
## 58 3800 male 2008
## 59 2850 female 2008
## 60 3750 male 2008
## 61 3150 female 2008
## 62 4400 male 2008
## 63 3600 female 2008
## 64 4050 male 2008
## 65 2850 female 2008
## 66 3950 male 2008
## 67 3350 female 2008
## 68 4100 male 2008
## 69 3050 female 2008
## 70 4450 male 2008
## 71 3600 female 2008
## 72 3900 male 2008
## 73 3550 female 2008
## 74 4150 male 2008
## 75 3700 female 2008
## 76 4250 male 2008
## 77 3700 female 2008
## 78 3900 male 2008
## 79 3550 female 2008
## 80 4000 male 2008
## 81 3200 female 2008
## 82 4700 male 2008
## 83 3800 female 2008
## 84 4200 male 2008
## 85 3350 female 2008
## 86 3550 male 2008
## 87 3800 male 2008
## 88 3500 female 2008
## 89 3950 male 2008
## 90 3600 female 2008
## 91 3550 female 2008
## 92 4300 male 2008
## 93 3400 female 2008
## 94 4450 male 2008
## 95 3300 female 2008
## 96 4300 male 2008
## 97 3700 female 2008
## 98 4350 male 2008
## 99 2900 female 2008
## 100 4100 male 2008
## 101 3725 female 2009
## 102 4725 male 2009
## 103 3075 female 2009
## 104 4250 male 2009
## 105 2925 female 2009
## 106 3550 male 2009
## 107 3750 female 2009
## 108 3900 male 2009
## 109 3175 female 2009
## 110 4775 male 2009
## 111 3825 female 2009
## 112 4600 male 2009
## 113 3200 female 2009
## 114 4275 male 2009
## 115 3900 female 2009
## 116 4075 male 2009
## 117 2900 female 2009
## 118 3775 male 2009
## 119 3350 female 2009
## 120 3325 male 2009
## 121 3150 female 2009
## 122 3500 male 2009
## 123 3450 female 2009
## 124 3875 male 2009
## 125 3050 female 2009
## 126 4000 male 2009
## 127 3275 female 2009
## 128 4300 male 2009
## 129 3050 female 2009
## 130 4000 male 2009
## 131 3325 female 2009
## 132 3500 male 2009
## 133 3500 female 2009
## 134 4475 male 2009
## 135 3425 female 2009
## 136 3900 male 2009
## 137 3175 female 2009
## 138 3975 male 2009
## 139 3400 female 2009
## 140 4250 male 2009
## 141 3400 female 2009
## 142 3475 male 2009
## 143 3050 female 2009
## 144 3725 male 2009
## 145 3000 female 2009
## 146 3650 male 2009
## 147 4250 male 2009
## 148 3475 female 2009
## 149 3450 female 2009
## 150 3750 male 2009
## 151 3700 female 2009
## 152 4000 male 2009
## 153 4500 female 2007
## 154 5700 male 2007
## 155 4450 female 2007
## 156 5700 male 2007
## 157 5400 male 2007
## 158 4550 female 2007
## 159 4800 female 2007
## 160 5200 male 2007
## 161 4400 female 2007
## 162 5150 male 2007
## 163 4650 female 2007
## 164 5550 male 2007
## 165 4650 female 2007
## 166 5850 male 2007
## 167 4200 female 2007
## 168 5850 male 2007
## 169 4150 female 2007
## 170 6300 male 2007
## 171 4800 female 2007
## 172 5350 male 2007
## 173 5700 male 2007
## 174 5000 female 2007
## 175 4400 female 2007
## 176 5050 male 2007
## 177 5000 female 2007
## 178 5100 male 2007
## 179 4100 <NA> 2007
## 180 5650 male 2007
## 181 4600 female 2007
## 182 5550 male 2007
## 183 5250 male 2007
## 184 4700 female 2007
## 185 5050 female 2007
## 186 6050 male 2007
## 187 5150 female 2008
## 188 5400 male 2008
## 189 4950 female 2008
## 190 5250 male 2008
## 191 4350 female 2008
## 192 5350 male 2008
## 193 3950 female 2008
## 194 5700 male 2008
## 195 4300 female 2008
## 196 4750 male 2008
## 197 5550 male 2008
## 198 4900 female 2008
## 199 4200 female 2008
## 200 5400 male 2008
## 201 5100 female 2008
## 202 5300 male 2008
## 203 4850 female 2008
## 204 5300 male 2008
## 205 4400 female 2008
## 206 5000 male 2008
## 207 4900 female 2008
## 208 5050 male 2008
## 209 4300 female 2008
## 210 5000 male 2008
## 211 4450 female 2008
## 212 5550 male 2008
## 213 4200 female 2008
## 214 5300 male 2008
## 215 4400 female 2008
## 216 5650 male 2008
## 217 4700 female 2008
## 218 5700 male 2008
## 219 4650 <NA> 2008
## 220 5800 male 2008
## 221 4700 female 2008
## 222 5550 male 2008
## 223 4750 female 2008
## 224 5000 male 2008
## 225 5100 male 2008
## 226 5200 female 2008
## 227 4700 female 2008
## 228 5800 male 2008
## 229 4600 female 2008
## 230 6000 male 2008
## 231 4750 female 2008
## 232 5950 male 2008
## 233 4625 female 2009
## 234 5450 male 2009
## 235 4725 female 2009
## 236 5350 male 2009
## 237 4750 female 2009
## 238 5600 male 2009
## 239 4600 female 2009
## 240 5300 male 2009
## 241 4875 female 2009
## 242 5550 male 2009
## 243 4950 female 2009
## 244 5400 male 2009
## 245 4750 female 2009
## 246 5650 male 2009
## 247 4850 female 2009
## 248 5200 male 2009
## 249 4925 male 2009
## 250 4875 female 2009
## 251 4625 female 2009
## 252 5250 male 2009
## 253 4850 female 2009
## 254 5600 male 2009
## 255 4975 female 2009
## 256 5500 male 2009
## 257 4725 <NA> 2009
## 258 5500 male 2009
## 259 4700 female 2009
## 260 5500 male 2009
## 261 4575 female 2009
## 262 5500 male 2009
## 263 5000 female 2009
## 264 5950 male 2009
## 265 4650 female 2009
## 266 5500 male 2009
## 267 4375 female 2009
## 268 5850 male 2009
## 269 4875 <NA> 2009
## 270 6000 male 2009
## 271 4925 female 2009
## 272 NA <NA> 2009
## 273 4850 female 2009
## 274 5750 male 2009
## 275 5200 female 2009
## 276 5400 male 2009
## 277 3500 female 2007
## 278 3900 male 2007
## 279 3650 male 2007
## 280 3525 female 2007
## 281 3725 male 2007
## 282 3950 female 2007
## 283 3250 female 2007
## 284 3750 male 2007
## 285 4150 female 2007
## 286 3700 male 2007
## 287 3800 female 2007
## 288 3775 male 2007
## 289 3700 female 2007
## 290 4050 male 2007
## 291 3575 female 2007
## 292 4050 male 2007
## 293 3300 male 2007
## 294 3700 female 2007
## 295 3450 female 2007
## 296 4400 male 2007
## 297 3600 female 2007
## 298 3400 male 2007
## 299 2900 female 2007
## 300 3800 male 2007
## 301 3300 female 2007
## 302 4150 male 2007
## 303 3400 female 2008
## 304 3800 male 2008
## 305 3700 female 2008
## 306 4550 male 2008
## 307 3200 female 2008
## 308 4300 male 2008
## 309 3350 female 2008
## 310 4100 male 2008
## 311 3600 male 2008
## 312 3900 female 2008
## 313 3850 female 2008
## 314 4800 male 2008
## 315 2700 female 2008
## 316 4500 male 2008
## 317 3950 male 2008
## 318 3650 female 2008
## 319 3550 male 2008
## 320 3500 female 2008
## 321 3675 female 2009
## 322 4450 male 2009
## 323 3400 female 2009
## 324 4300 male 2009
## 325 3250 male 2009
## 326 3675 female 2009
## 327 3325 female 2009
## 328 3950 male 2009
## 329 3600 female 2009
## 330 4050 male 2009
## 331 3350 female 2009
## 332 3450 male 2009
## 333 3250 female 2009
## 334 4050 male 2009
## 335 3800 male 2009
## 336 3525 female 2009
## 337 3950 male 2009
## 338 3650 female 2009
## 339 3650 female 2009
## 340 4000 male 2009
## 341 3400 female 2009
## 342 3775 male 2009
## 343 4100 male 2009
## 344 3775 female 2009
So that’s a little more like R just vomited data into your console with no description of its size or the types of variables in it. To see that, we would need the str() function in base R.
str(penguins_df)
## 'data.frame': 344 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Tibbles streamline this for us a bit.
You can always get data into a tibble using as_tibble()
. Let’s try that with the penguins_df we just created and double-check that it looks like our original data set again.
penguins_tibble = as_tibble(penguins_df)
print(penguins_tibble)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
Beautiful.
- You can change a
tibble
to adata.frame
using theas.data.frame()
function.- You can change a
data.frame
to atibble
using theas_tibble()
function.- You can create a
tibble
using thetibble()
function [the syntax is largely similar to data.frame(), but with a few handy differences].
Getting back to the data, we see that we have variables for species, island, 4 different morphological variables, sex, and the year of observation. Let’s explore some of the tools that dplyr gives us for working with these data.
If we’re interested in sexual dimorphism of penguins, we’ll certainly need to visualize and analyze some of those morphological variables. But how do we work with a single variable within this data set that has many variables? We can use the select()
function to subset the data.
Let’s look at body mass first, since this seems to be the most direct measure of overall size.
select(penguins, body_mass_g) ## select(tibble, column)
## # A tibble: 344 × 1
## body_mass_g
## <int>
## 1 3750
## 2 3800
## 3 3250
## 4 NA
## 5 3450
## 6 3650
## 7 3625
## 8 4675
## 9 3475
## 10 4250
## # … with 334 more rows
Note that this is still a tibble, even though one could just as easily think of it as one string of numbers (what R calls a vector
). Also, the data are still considered to be integer data. Using select()
never changes the type of data. In the tidyverse, we have to be explicit about any changes we wish to make to the representation of our data. For example, if obtaining the variable as a vector
really is the goal, we can accomplish that using the pull()
function.
pull(penguins, body_mass_g)
## [1] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 3300 3700 3200 3800 4400
## [16] 3700 3450 4500 3325 4200 3400 3600 3800 3950 3800 3800 3550 3200 3150 3950
## [31] 3250 3900 3300 3900 3325 4150 3950 3550 3300 4650 3150 3900 3100 4400 3000
## [46] 4600 3425 2975 3450 4150 3500 4300 3450 4050 2900 3700 3550 3800 2850 3750
## [61] 3150 4400 3600 4050 2850 3950 3350 4100 3050 4450 3600 3900 3550 4150 3700
## [76] 4250 3700 3900 3550 4000 3200 4700 3800 4200 3350 3550 3800 3500 3950 3600
## [91] 3550 4300 3400 4450 3300 4300 3700 4350 2900 4100 3725 4725 3075 4250 2925
## [106] 3550 3750 3900 3175 4775 3825 4600 3200 4275 3900 4075 2900 3775 3350 3325
## [121] 3150 3500 3450 3875 3050 4000 3275 4300 3050 4000 3325 3500 3500 4475 3425
## [136] 3900 3175 3975 3400 4250 3400 3475 3050 3725 3000 3650 4250 3475 3450 3750
## [151] 3700 4000 4500 5700 4450 5700 5400 4550 4800 5200 4400 5150 4650 5550 4650
## [166] 5850 4200 5850 4150 6300 4800 5350 5700 5000 4400 5050 5000 5100 4100 5650
## [181] 4600 5550 5250 4700 5050 6050 5150 5400 4950 5250 4350 5350 3950 5700 4300
## [196] 4750 5550 4900 4200 5400 5100 5300 4850 5300 4400 5000 4900 5050 4300 5000
## [211] 4450 5550 4200 5300 4400 5650 4700 5700 4650 5800 4700 5550 4750 5000 5100
## [226] 5200 4700 5800 4600 6000 4750 5950 4625 5450 4725 5350 4750 5600 4600 5300
## [241] 4875 5550 4950 5400 4750 5650 4850 5200 4925 4875 4625 5250 4850 5600 4975
## [256] 5500 4725 5500 4700 5500 4575 5500 5000 5950 4650 5500 4375 5850 4875 6000
## [271] 4925 NA 4850 5750 5200 5400 3500 3900 3650 3525 3725 3950 3250 3750 4150
## [286] 3700 3800 3775 3700 4050 3575 4050 3300 3700 3450 4400 3600 3400 2900 3800
## [301] 3300 4150 3400 3800 3700 4550 3200 4300 3350 4100 3600 3900 3850 4800 2700
## [316] 4500 3950 3650 3550 3500 3675 4450 3400 4300 3250 3675 3325 3950 3600 4050
## [331] 3350 3450 3250 4050 3800 3525 3950 3650 3650 4000 3400 3775 4100 3775
How does this differ from base R? Well, let’s use our data.frame we created a moment ago to see. We can select a single column from a data.frame in base R using the $
operator.
penguins_df$body_mass_g ## subsetting by variable using base R's $ notation
## [1] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 3300 3700 3200 3800 4400
## [16] 3700 3450 4500 3325 4200 3400 3600 3800 3950 3800 3800 3550 3200 3150 3950
## [31] 3250 3900 3300 3900 3325 4150 3950 3550 3300 4650 3150 3900 3100 4400 3000
## [46] 4600 3425 2975 3450 4150 3500 4300 3450 4050 2900 3700 3550 3800 2850 3750
## [61] 3150 4400 3600 4050 2850 3950 3350 4100 3050 4450 3600 3900 3550 4150 3700
## [76] 4250 3700 3900 3550 4000 3200 4700 3800 4200 3350 3550 3800 3500 3950 3600
## [91] 3550 4300 3400 4450 3300 4300 3700 4350 2900 4100 3725 4725 3075 4250 2925
## [106] 3550 3750 3900 3175 4775 3825 4600 3200 4275 3900 4075 2900 3775 3350 3325
## [121] 3150 3500 3450 3875 3050 4000 3275 4300 3050 4000 3325 3500 3500 4475 3425
## [136] 3900 3175 3975 3400 4250 3400 3475 3050 3725 3000 3650 4250 3475 3450 3750
## [151] 3700 4000 4500 5700 4450 5700 5400 4550 4800 5200 4400 5150 4650 5550 4650
## [166] 5850 4200 5850 4150 6300 4800 5350 5700 5000 4400 5050 5000 5100 4100 5650
## [181] 4600 5550 5250 4700 5050 6050 5150 5400 4950 5250 4350 5350 3950 5700 4300
## [196] 4750 5550 4900 4200 5400 5100 5300 4850 5300 4400 5000 4900 5050 4300 5000
## [211] 4450 5550 4200 5300 4400 5650 4700 5700 4650 5800 4700 5550 4750 5000 5100
## [226] 5200 4700 5800 4600 6000 4750 5950 4625 5450 4725 5350 4750 5600 4600 5300
## [241] 4875 5550 4950 5400 4750 5650 4850 5200 4925 4875 4625 5250 4850 5600 4975
## [256] 5500 4725 5500 4700 5500 4575 5500 5000 5950 4650 5500 4375 5850 4875 6000
## [271] 4925 NA 4850 5750 5200 5400 3500 3900 3650 3525 3725 3950 3250 3750 4150
## [286] 3700 3800 3775 3700 4050 3575 4050 3300 3700 3450 4400 3600 3400 2900 3800
## [301] 3300 4150 3400 3800 3700 4550 3200 4300 3350 4100 3600 3900 3850 4800 2700
## [316] 4500 3950 3650 3550 3500 3675 4450 3400 4300 3250 3675 3325 3950 3600 4050
## [331] 3350 3450 3250 4050 3800 3525 3950 3650 3650 4000 3400 3775 4100 3775
Right, so base R would have transformed our data into a vector without us asking it to do so. This may seem trivial, but in more complicated situations these unintended changes to data upon subsetting can create a lot of problems.
Back to sexual dimorphism…
What we’d really like to do is to look at morphological measures like body mass grouped by sex. Can we get a tibble
with both of those variables?
select(penguins, body_mass_g, sex) ## select(tibble, column)
## # A tibble: 344 × 2
## body_mass_g sex
## <int> <fct>
## 1 3750 male
## 2 3800 female
## 3 3250 female
## 4 NA <NA>
## 5 3450 female
## 6 3650 male
## 7 3625 female
## 8 4675 male
## 9 3475 <NA>
## 10 4250 <NA>
## # … with 334 more rows
Sure! Just add more of the variables in the original tibble
to the select()
function’s arguments to retain them in the selection. There are lots of other interesting options for selecting and combining multiple variables that you can find in the help page for the select() function.
If we’re interested in sexual dimorphism, we’ll undoubtedly want to be able to look at the data for each sex separately at some point. We can do that with the filter()
function, which will take only the rows of our data that meet some criterion. The criterion can be a lot of different things, but usually it’s based on the value of one or more of the variables in the data set. For example, let’s see if we can get the body mass measurements for only females (that is, rows of the data set for which the sex variable is equal to “female”).
To do this, we need to select() the body mass and sex variables and then filter() the data for only the females. There are a couple of ways to package these two steps together.
%>%
operator to forward output from one operation to the next (tidyverse specific!)The first option is easy to follow, but can quickly clutter your R environment with a lot of similarly named intermediate objects. The second option simply nests functions to avoid creating intermediate objects – this is very handy in moderation, but overzealous nesting brings many, many parentheses and much confusion. The last option is the tidyverse’s solution to this tradeoff. Using the “pipe” operator %>%
, we can simply push the output of one operation to the next in a “pipeline” that both removes intermediates and is easy to follow.
# Intermediate Steps #
bodyMassDat = select(penguins, body_mass_g, sex) # select body mass and sex and save as a new tibble
filter(bodyMassDat, sex == "female") # filter bodyMassDat for only the rows for which sex is female
## # A tibble: 165 × 2
## body_mass_g sex
## <int> <fct>
## 1 3800 female
## 2 3250 female
## 3 3450 female
## 4 3625 female
## 5 3200 female
## 6 3700 female
## 7 3450 female
## 8 3325 female
## 9 3400 female
## 10 3800 female
## # … with 155 more rows
# Nested Steps #
filter(select(penguins, body_mass_g, sex), sex == "female") # use select() inside filter() to select variables before filter
## # A tibble: 165 × 2
## body_mass_g sex
## <int> <fct>
## 1 3800 female
## 2 3250 female
## 3 3450 female
## 4 3625 female
## 5 3200 female
## 6 3700 female
## 7 3450 female
## 8 3325 female
## 9 3400 female
## 10 3800 female
## # … with 155 more rows
# Piping #
select(penguins, body_mass_g, sex) %>% filter(sex == "female") # use select, "pipe" output forward to filter()
## # A tibble: 165 × 2
## body_mass_g sex
## <int> <fct>
## 1 3800 female
## 2 3250 female
## 3 3450 female
## 4 3625 female
## 5 3200 female
## 6 3700 female
## 7 3450 female
## 8 3325 female
## 9 3400 female
## 10 3800 female
## # … with 155 more rows
Here I used the ==
operator to specify the criterion that the variable sex
must be equal to “female”. There are lots of filter options available. We could filter rows for which a certain variable is less than or greater than a certain quantity using <
and <=
type operators, and we can find rows where a certain variable is missing using is.na()
. This can be especially helpful with the complement operator !
, for example to get the all the rows where a certian variable is not missing (try filter(data, !is.na(variable))
).
You can also filter by multiple criteria as well using logical operators like and (&
) and or (|
). Maybe we only want to look at data from females on Torgersen Island, for example. We’d need to select()
the island variable from our data set as well, and then filter()
for the rows where sex == female and island == Torgersen.
select(penguins, island, body_mass_g, sex) %>% # use select to get island, body mass, sex from the penguins data set, pipe output forward
filter(sex == "female" & island == "Torgersen") # then filter for rows where sex is female and island is Torgersen
## # A tibble: 24 × 3
## island body_mass_g sex
## <fct> <int> <fct>
## 1 Torgersen 3800 female
## 2 Torgersen 3250 female
## 3 Torgersen 3450 female
## 4 Torgersen 3625 female
## 5 Torgersen 3200 female
## 6 Torgersen 3700 female
## 7 Torgersen 3450 female
## 8 Torgersen 3325 female
## 9 Torgersen 3050 female
## 10 Torgersen 3600 female
## # … with 14 more rows
Note that the “pipeline” starts to get too long for one line. Just make sure to place returns after a %>%
and R will recognize that you are still mid-pipeline and continue to the next line.
You may have also noticed that when we used the intermediate steps or nested steps approaches, we had to specify the data used for the filter function. When we used a pipe, however, we only needed to specify the filtering criteria– the data is assumed to be the output of the preceding step. This is another handy feature of the “piping” approach.
Often, raw data do not have all of the information we would like to analyse. For example, I would be interested to know if female and male penguins differ not only in their size or mass, but in their body condition. We might imagine a “penguin plumpness” index, where individuals that have large values have greater mass for their length. One (rather crude) way of calculating such a measure might be to divide each penguin’s mass by its flipper length. Then, individuals with larger penguin plumpness indices (PPIs) will be those with more mass given (roughly) their size.
mutate(penguins, pengPlumpInd = body_mass_g / flipper_length_mm) # mutate(data, newVariableName = ...)
## # A tibble: 344 × 9
## species island bill_length_mm bill_d…¹ flipp…² body_…³ sex year pengP…⁴
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 20.7
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 20.4
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 16.7
## 4 Adelie Torgersen NA NA NA NA <NA> 2007 NA
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 17.9
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 19.2
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 20.0
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 24.0
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 18.0
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 22.4
## # … with 334 more rows, and abbreviated variable names ¹bill_depth_mm,
## # ²flipper_length_mm, ³body_mass_g, ⁴pengPlumpInd
Great, so the output of the mutate function is the original data, but with a new variable calculated from existing variables using some function (here, just simple division). Very handy. You can apply all manner of functions using mutate()
, including ones you have written yourself.
If we want to make sure that we don’t calculate PPIs for penguins whose sex was unknown (there are a few in the data!), we could filter before mutating…
filter(penguins, !is.na(sex)) %>% mutate(pengPlumpInd = body_mass_g / flipper_length_mm)
## # A tibble: 333 × 9
## species island bill_length_mm bill_d…¹ flipp…² body_…³ sex year pengP…⁴
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 20.7
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 20.4
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 16.7
## 4 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 17.9
## 5 Adelie Torgersen 39.3 20.6 190 3650 male 2007 19.2
## 6 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 20.0
## 7 Adelie Torgersen 39.2 19.6 195 4675 male 2007 24.0
## 8 Adelie Torgersen 41.1 17.6 182 3200 fema… 2007 17.6
## 9 Adelie Torgersen 38.6 21.2 191 3800 male 2007 19.9
## 10 Adelie Torgersen 34.6 21.1 198 4400 male 2007 22.2
## # … with 323 more rows, and abbreviated variable names ¹bill_depth_mm,
## # ²flipper_length_mm, ³body_mass_g, ⁴pengPlumpInd
And we get a slightly smaller tibble where we’ve not calculated PPI values for individuals whose sex is unknown anyway. Nice! This is the power of piping.
Note again that we can drop the data argument to mutate()
because the pipe tells it to use the output from the filter()
function that comes before.
Now that we know how to select and filter, we could get split our data up to get body masses and PPIs for females and males separately, apply various analyses to summarize the data for each group, and then combine that into a new tibble summarizing sexual size dimorphism. But then we’d probably want to look at each species separately, and it might be worth asking to what extent sexual size dimorphism varies among islands, or if there is variation among years, and whether these patterns are the same in the remaining morphological measures… whew! If all the required splitting and combining is starting to sound tedious, worry not! The tidyverse is here for you.
There are two functions that streamline this “split-apply-combine” approach to data analysis: group_by()
and summarize()
.
The group_by()
function tells R that any tidyverse functions that happen “downstream” in our pipeline should be applied at the level of the group, where the group corresponds to values of some variable (e.g., sex or island).
filter(penguins, !is.na(sex)) %>% group_by(sex) # filter for rows for which sex is *not* NA, and then group the result by sex.
## # A tibble: 333 × 8
## # Groups: sex [2]
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 6 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 8 Adelie Torgersen 41.1 17.6 182 3200 fema… 2007
## 9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
## 10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
## # … with 323 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
Note that this doesn’t change how the data looks (other than the tibble now mentions that it is grouped), but it drastically changes how it interacts with subsequent operations. We can group by multiple variables, and even expressions based on variables. For example, let’s create 6 groups, one for each sex on each island…
filter(penguins, !is.na(sex)) %>% group_by(sex, island) # filter for rows for which sex is *not* NA, and then group the result by sex and island.
## # A tibble: 333 × 8
## # Groups: sex, island [6]
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 6 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 8 Adelie Torgersen 41.1 17.6 182 3200 fema… 2007
## 9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
## 10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
## # … with 323 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
Minor changes to this pipeline would now allow us to quickly break our data down by species, sex, island, etc. With these groups in hand, we’d like to calculate some summary statistics at the level of the groups. This is where the summarize()
function comes in handy.
The summarize()
function creates a new data set with one row for each combination of the grouping variable(s) and one column for each summary statistic we specify. Let’s try using that to get a data set summarizing the mean body mass for each sex.
We’ll filter out the NAs again, group our data by sex, and then create a summary data frame with one row for each sex and one column called “meanBodyMassg” containing the average body mass measurement for individuals in that group (sex).
# filter penguins to remove rows for which sex is unknown, then group the data by sex, and then calculate mean body mass for each group.
filter(penguins, !is.na(sex)) %>% group_by(sex) %>% summarize(meanBodyMassg = mean(body_mass_g))
## # A tibble: 2 × 2
## sex meanBodyMassg
## <fct> <dbl>
## 1 female 3862.
## 2 male 4546.
Cool! So with one small line of code, we can get rid of NAs, group our data by sex, calculate summary statistics for each group, and make a nice tibble of the results. How delightful. Also, I know now that male penguins are, on average, much more massive than females. This is very different than the spiders I am used to studying!
Sex differences in mean body mass are central to our question of sexual size dimorphism, but we’d like to know whether this difference is small or large relative to the natural range of penguin mass. Let’s add a measure of variance to our summary. Simply head back to the pipeline and add the standard deviation in body mass to the summarize()
function’s arguments.
filter(penguins, !is.na(sex)) %>% group_by(sex) %>% summarize(meanBodyMassg = mean(body_mass_g), sdBodyMassg = sd(body_mass_g))
## # A tibble: 2 × 3
## sex meanBodyMassg sdBodyMassg
## <fct> <dbl> <dbl>
## 1 female 3862. 666.
## 2 male 4546. 788.
So while there’s a roughly 700 g difference in the mean mass of females and males, the average penguin differs from their sex-specific average by about that much (666 g for females, 788 g for males.). So while females and males are, on average, pretty different in mass, there’s a lot of overlap. I probably wouldn’t want to go out and start weighing penguins to determine their sex. That’s helpful to know.
We can add as many summary measures as we would like. Let’s go ahead and add the mean and standard deviation for PPI as well.
NOTE: we have to add our PPI calculation back into our pipeline to accomplish this. This is because our pipelines don’t alter the original data set. They simply call on the data set, perform some operation, and then pass the output down the pipeline. Of course, we could save the output data as an object (e.g., a new tibble) at any time. However, the fact that we can accomplish a lot without necessarily needing to do this is part of what makes piping so nice.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(sex) %>%
# summarize groups in terms of mean body manss, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd))
## # A tibble: 2 × 5
## sex meanBodyMassg sdBodyMassg meanPPI sdPPI
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 female 3862. 666. 19.5 2.30
## 2 male 4546. 788. 22.1 2.57
We can even calculate summary measures based on other summary measures in the same line of code, provided we make sure that we ask R to calculate them in order. For example, we might want to know the coefficient of variation (CV) in body mass for each sex (i.e., the standard deviation in body size divided by the mean body size). As long as we ask to calculate this after we calculate the mean and standard deviation, summarize()
can handle this for us. In this way, we can sort of sneak a “mutate” step into our summarizing.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(sex) %>%
# summarize groups in terms of mean body manss, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
CVBodyMassg = sdBodyMassg / meanBodyMassg,
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd))
## # A tibble: 2 × 6
## sex meanBodyMassg sdBodyMassg CVBodyMassg meanPPI sdPPI
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 3862. 666. 0.172 19.5 2.30
## 2 male 4546. 788. 0.173 22.1 2.57
Now we can see that while males are more variable in body mass than females (i.e., their sdBodyMassg
is larger), the variation is really pretty similar once we consider that males are just larger overall (i.e., their CVBodyMassg
is almost identical).
Let’s take this all the way to its limit with these data! We have 3 species, 3 islands, and two sexes we’d like to compare for each. This seems like a logical way to organize the data and summarize them. Let’s also get the sample size for each of these sub-groups – we can do this including the n()
function as one of our summary measures.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(species, island, sex) %>%
# summarize groups in terms of mean body manss, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
CVBodyMassg = sdBodyMassg / meanBodyMassg,
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd),
sampleSize = n())
## `summarise()` has grouped output by 'species', 'island'. You can override using
## the `.groups` argument.
## # A tibble: 10 × 9
## # Groups: species, island [5]
## species island sex meanBodyMa…¹ sdBod…² CVBod…³ meanPPI sdPPI sampl…⁴
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Adelie Biscoe female 3369. 343. 0.102 18.0 1.67 22
## 2 Adelie Biscoe male 4050 356. 0.0878 21.3 1.52 22
## 3 Adelie Dream female 3344. 212. 0.0634 17.8 1.09 27
## 4 Adelie Dream male 4046. 331. 0.0817 21.1 1.68 28
## 5 Adelie Torgersen female 3396. 259. 0.0763 18.0 1.48 24
## 6 Adelie Torgersen male 4035. 372. 0.0923 20.7 1.85 23
## 7 Chinstrap Dream female 3527. 285. 0.0809 18.4 1.46 34
## 8 Chinstrap Dream male 3939. 362. 0.0919 19.7 1.49 34
## 9 Gentoo Biscoe female 4680. 282. 0.0602 22.0 1.19 58
## 10 Gentoo Biscoe male 5485. 313. 0.0571 24.8 1.35 61
## # … with abbreviated variable names ¹meanBodyMassg, ²sdBodyMassg, ³CVBodyMassg,
## # ⁴sampleSize
In some cases it might be useful to sort this summary by some variable of interest. For example, if we were interested in identifying the combination of species-island-sex with the biggest penguins, we could use the arrange
function to sort the output in order of descending body mass.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(species, island, sex) %>%
# summarize groups in terms of mean body manss, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
CVBodyMassg = sdBodyMassg / meanBodyMassg,
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd),
sampleSize = n()) %>%
# sort by descending body mass to get largest at top -- careful that we sort on the new summary variables, not theoriginal variables!
arrange(desc(meanBodyMassg))
## `summarise()` has grouped output by 'species', 'island'. You can override using
## the `.groups` argument.
## # A tibble: 10 × 9
## # Groups: species, island [5]
## species island sex meanBodyMa…¹ sdBod…² CVBod…³ meanPPI sdPPI sampl…⁴
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Gentoo Biscoe male 5485. 313. 0.0571 24.8 1.35 61
## 2 Gentoo Biscoe female 4680. 282. 0.0602 22.0 1.19 58
## 3 Adelie Biscoe male 4050 356. 0.0878 21.3 1.52 22
## 4 Adelie Dream male 4046. 331. 0.0817 21.1 1.68 28
## 5 Adelie Torgersen male 4035. 372. 0.0923 20.7 1.85 23
## 6 Chinstrap Dream male 3939. 362. 0.0919 19.7 1.49 34
## 7 Chinstrap Dream female 3527. 285. 0.0809 18.4 1.46 34
## 8 Adelie Torgersen female 3396. 259. 0.0763 18.0 1.48 24
## 9 Adelie Biscoe female 3369. 343. 0.102 18.0 1.67 22
## 10 Adelie Dream female 3344. 212. 0.0634 17.8 1.09 27
## # … with abbreviated variable names ¹meanBodyMassg, ²sdBodyMassg, ³CVBodyMassg,
## # ⁴sampleSize
A someone with absolutely no penguin-related knowledge, this is pretty neat to see! I didn’t know that Gentoo penguins are generally more massive and plumper for their body size than Adelie and Chinstrap penguins. And I certainly would’ve never guessed that Adelie penguins can be bigger or smaller than Chinstrap penguins, depending on whether the Adelie in question is male (bigger) or female (smaller). It’s also interesting to me that for Adelie penguins, the one species that occurs across all 3 islands, the island with the biggest females (Torgersen) is not the same island with the biggest males (Biscoe). That said, the differences in mean body masses and plumpnesses are very small relative to the their standard deviations, so penguins are virtually identical in mass across the islands.
Also, it’s interesting to note that the most massive penguins also tend to be more massive for their size (our “PPI”). This probably makes sense given that a small change in a penguin’s length produces a larger change in its volume.
If you’re like me, you’re probably itching for a statistical test of whether these differences in size are greater than one would expect by chance. While I won’t get into the details here of how to choose and implement statistical analyses, it’s helpful to know the general approach for working one into your data analysis pipeline.
As an example, let’s see if we can filter()
the data to exclude the rows for which sex is unknown, mutate
the data to calculate PPI, and then use lm()
from base R to fit a linear model predicting PPI as a function of species and sex (we’ll ignore different islands for now, for simplicity). After fitting the model, we need to test the significance of the estimated effects. We’ll use an Analyis of Variance (ANOVA) for this, which is available using the anova()
function in base R. That’s right, we can use base R functions within our pipelines too! If you’re keeping count, doing this without pipes would involve either (1) creating 3 intermediate objects or (2) 3 instances of nesting a function inside another function. Pipes are nice for cases like this.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# fit a linear model where pengPlumpInd is the response variable, and sex, species, and their interaction is the predictor
lm(formula = pengPlumpInd ~ species + sex + species:sex) %>%
# ask r to analyse that linear model using an ANalysis Of VAriance (ANOVA) using the anova() function in base R.
anova()
## Analysis of Variance Table
##
## Response: pengPlumpInd
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 1277.20 638.60 308.2048 < 2.2e-16 ***
## sex 1 561.69 561.69 271.0833 < 2.2e-16 ***
## species:sex 2 38.45 19.23 9.2787 0.0001204 ***
## Residuals 327 677.54 2.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output kindly reminds us what our response variable was (pengPlumpInd) and provides us with a table containing rows for each predictor variable or interaction term and columns for the various parameters calculated during the ANOVA on the linear model. These are all important in their own right, but you’ll most often find yourself reporting the the degrees of freedom (df), the test stastic (an F-value, in this case), and the p-value (Pr(>F)).
It seems like there are, on average, significant species and sex differences in PPI (note that the p-values associated with these terms fall below the conventional cut-off of 0.05). However, there is also a significant interaction between species and sex. This tells us that the difference between species depends on which sex you consider, and vice versa. This is what we inferred from the summary table we made, but it’s good to know that a preliminary statistical analysis supports this intuition.
Finally, we can also run pipelines into plots. There are really neat tools for doing this that are part of the tidyverse, largely within the ggplot2 package. Although I won’t get deep into ggplot here, it’s useful to see the general approach to implementing plotting in a pipeline.
Let’s make sure ggplot2 is installed.
install.packages("ggplot2") # install the package
And now we need to load it.
library(ggplot2) # load it in the current R session.
Let’s try to visualize the result from our ANOVA on the PPI of different species and sexes of penguins above. Ideally, we’d like to have a plot that has our response variable on the y-axis and the predictor on the x-axis. Our response variable is PPI, but we have two predictors, sex and species. Since we set out to look at sexual dimorphism, we probably want the male and female PPIs of a given species right next to each other to emphasize sex differences, and then replicate that visual comparison across the three species. We can do with a grouped boxplot, where we plot PPI for each species, but group the data in each species into that for males and that for females.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# use ggplot() to plot PPI for each species, grouped by sex.
# the first term (ggplot()) just sets up the plot object by saying what the x, y, and grouping variables will be.
# the second term (geom_boxplot()) specifies a "layer" of the plot's appearance-- in this case, a boxplot.
# Note that we don't have to specify our data set for either argument because it is inherited from the pipe.
# Also note that we don't have to re-specify our x, y, and grouping variable in geom_boxplot because it inherits
# these from the first ggplot term (unless we tell it otherwise).
ggplot(aes(x = species, y = pengPlumpInd, by = sex)) + # use plus signs to add on additional ggplot "layers"
geom_boxplot()
We can add an aes(color = sex)
argument to the boxplot function to get the sexes in different colors…
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# use ggplot() to plot PPI for each species, grouped by sex.
# This time, specify that boxplots should differ in color depending on sex.
ggplot(aes(x = species, y = pengPlumpInd, by = sex)) +
geom_boxplot(aes(color = sex))
Note that if we specify how colors are to be applied by groups in geom_boxplot()
, we can omit the “by” argument from the first term because applying the color by sex groups the data by sex anyway.
While we’re at it, let’s add nice axis labels using xlab()
and ylab()
.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# use ggplot() to plot PPI for each species, grouped by sex.
# Can omit the "by" argument in ggplot() because we apply colors by the same grouping structure in geom_boxplot.
ggplot(aes(x = species, y = pengPlumpInd)) +
geom_boxplot(aes(color = sex)) +
xlab("Species") +
ylab("Penguin Plumpness Index (PPI)")
The tools available through ggplot2 are pretty staggering and certainly exceed the scope of what I can cover here. The takeaway here is that these tools are developed to work with piping and the tidyverse approach in general.
It’s occasionally necessary to change the format of the data to work with a particular function or workflow in R. Most of the time, R functions will use data that are in “long format”. Long format data are data where each variable observed appears in only one column, and there is therefore only one value for a given variable in any row. Let’s look at our summary table of the penguins
data as an example.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(species, island, sex) %>%
# summarize groups in terms of mean body mass, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
CVBodyMassg = sdBodyMassg / meanBodyMassg,
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd),
sampleSize = n())
## `summarise()` has grouped output by 'species', 'island'. You can override using
## the `.groups` argument.
## # A tibble: 10 × 9
## # Groups: species, island [5]
## species island sex meanBodyMa…¹ sdBod…² CVBod…³ meanPPI sdPPI sampl…⁴
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Adelie Biscoe female 3369. 343. 0.102 18.0 1.67 22
## 2 Adelie Biscoe male 4050 356. 0.0878 21.3 1.52 22
## 3 Adelie Dream female 3344. 212. 0.0634 17.8 1.09 27
## 4 Adelie Dream male 4046. 331. 0.0817 21.1 1.68 28
## 5 Adelie Torgersen female 3396. 259. 0.0763 18.0 1.48 24
## 6 Adelie Torgersen male 4035. 372. 0.0923 20.7 1.85 23
## 7 Chinstrap Dream female 3527. 285. 0.0809 18.4 1.46 34
## 8 Chinstrap Dream male 3939. 362. 0.0919 19.7 1.49 34
## 9 Gentoo Biscoe female 4680. 282. 0.0602 22.0 1.19 58
## 10 Gentoo Biscoe male 5485. 313. 0.0571 24.8 1.35 61
## # … with abbreviated variable names ¹meanBodyMassg, ²sdBodyMassg, ³CVBodyMassg,
## # ⁴sampleSize
We can see that each variable appears in only one column, and was therefore measured only once per row. For example, there is only one column containing measures for mean body mass, so there is only one such measurement for each row. Thus, this is long-format data. Adding more observations to the data set (e.g., more species, islands, sexes, or combinations thereof) means adding more rows (lengthening the data), each with a single set of values for each variable.
To see how this differs from wide-format data, let’s change it to wide format using pivot_wider()
. This function requires the following arguments:
Let’s try taking names from the variable island
and values from the variable meanBodyMassg
.
# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(species, island, sex) %>%
# summarize groups in terms of mean body manss, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
CVBodyMassg = sdBodyMassg / meanBodyMassg,
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd),
sampleSize = n()) %>%
# be sure to select *only* the variables you need in your wider table, else things will get messy.
select(island, sex, meanBodyMassg) %>%
# use pivot_wider() to turn values of island into their own columns, with the corresponding value of mean body mass contained in each.
pivot_wider(names_from = island, values_from = meanBodyMassg)
## `summarise()` has grouped output by 'species', 'island'. You can override using
## the `.groups` argument.
## Adding missing grouping variables: `species`
## # A tibble: 6 × 5
## # Groups: species [3]
## species sex Biscoe Dream Torgersen
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Adelie female 3369. 3344. 3396.
## 2 Adelie male 4050 4046. 4035.
## 3 Chinstrap female NA 3527. NA
## 4 Chinstrap male NA 3939. NA
## 5 Gentoo female 4680. NA NA
## 6 Gentoo male 5485. NA NA
Now we can see that “island” is no longer a variable that exists in a single column. Instead, every value of “island” is given its own column, and for each combination of species and sex (row) we simply put the mean body mass for that particular island (if any) in the appropriate column. Each row therefore corresponds to a single combination of species and sex but contains multiple observations of mean body mass (one for each island). Adding more islands to this data set would mean adding columns, or “widening” the data.
More often, you will find data in wide format and need to transform it to long format for compatibility with R’s functions. Let’s see if we can get our data back the way we had it! This requires the following arguments to the pivot_longer()
function:
c()
. Another way would be to exclude all the other columns using -c(species, sex)
.# filter() to remove rows where sex is NA,
filter(penguins, !is.na(sex)) %>%
# mutate() to calculate penguin plumpness index from body mass and flipper length
mutate(pengPlumpInd = body_mass_g / flipper_length_mm) %>%
# group data by sex
group_by(species, island, sex) %>%
# summarize groups in terms of mean body manss, sd in body mass, mean PPI, and sd in PPI
summarize(meanBodyMassg = mean(body_mass_g),
sdBodyMassg = sd(body_mass_g),
CVBodyMassg = sdBodyMassg / meanBodyMassg,
meanPPI = mean(pengPlumpInd),
sdPPI = sd(pengPlumpInd),
sampleSize = n()) %>%
# be sure to select *only* the variables you need in your wider table, else things will get messy.
select(island, sex, meanBodyMassg) %>%
# use pivot_wider() to turn values of island into their own columns, with the corresponding value of mean body mass contained in each.
pivot_wider(names_from = island, values_from = meanBodyMassg) %>%
# use pivot_longer() to collapse values for different islands into a single column and create a new column with the body mass measure for
# each island.
pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "meanBodyMassg")
## `summarise()` has grouped output by 'species', 'island'. You can override using
## the `.groups` argument.
## Adding missing grouping variables: `species`
## # A tibble: 18 × 4
## # Groups: species [3]
## species sex island meanBodyMassg
## <fct> <fct> <chr> <dbl>
## 1 Adelie female Biscoe 3369.
## 2 Adelie female Dream 3344.
## 3 Adelie female Torgersen 3396.
## 4 Adelie male Biscoe 4050
## 5 Adelie male Dream 4046.
## 6 Adelie male Torgersen 4035.
## 7 Chinstrap female Biscoe NA
## 8 Chinstrap female Dream 3527.
## 9 Chinstrap female Torgersen NA
## 10 Chinstrap male Biscoe NA
## 11 Chinstrap male Dream 3939.
## 12 Chinstrap male Torgersen NA
## 13 Gentoo female Biscoe 4680.
## 14 Gentoo female Dream NA
## 15 Gentoo female Torgersen NA
## 16 Gentoo male Biscoe 5485.
## 17 Gentoo male Dream NA
## 18 Gentoo male Torgersen NA
Great, now we are back to one column for each variable, and one row for each observation. This is obviously a really silly pipeline at this point, but it illustrates the point.