#install.packages(c("here", "dplyr", "vroom", "tidyr", "purrr"))
library(here)
library(dplyr)
library(vroom)
library(tidyr)
library(purrr)
The amount of data becoming available in ecology has been increasing rather fast in the recent years. Therefore, being able to wrangling, analyzing and visualizing this data, associated with a good scientific question, is timely for ecologists. Here, I will introduce some tidyverse
packages and functions that can make this part of the job a bit easier.
Most information I will present here was gotten from our outstanding R community (StackOverflow, GitHub, Twitter posts - #rstats - and R books), based on daily issues faced during my PhD. Also, everything here can be achieved by using base R. However, some tidyverse
packages can be helpful to avoid unnecessary loops and functions, saving some lines of code and time. The intention of this post is, therefore, to show some useful functions that can facilitate our life but also to gather information in one place for my own purpose :).
In this post, I will be using the BioTime database as an example, which is a biodiversity time series data with multiple taxa and several sites around the world (more details at: http://biotime.st-andrews.ac.uk/). I anticipate that this is not an extensive analysis of the data but just an example of how we can use some tidyverse
packages and functions to any data we want. Please download both data and metadata, in case you want to follow this post.
This post is split into three parts:
1st: Presenting
here
,archive
andvroom
packages and how to wrangling/exploring data withdplyr
andpurrr
2nd: Data visualization with
ggplot2
andcowplot
3rd: Statistical modelling and bootstraps with
tidymodels::broom
here
package
For those using R projects, here
is quite handy. With here
we don’t need to set our working directory (setwd)
anymore. This is particularly interesting whether we are collaborating and have to share our scripts and data with colleagues or need to read files saved at different folders and work on different computers. When using here
we only need to set the path where the files are saved within the R project:
For example, let’s say the project is in “MyFolder” and the data in “data”. Using the traditional way, before starting working on the data, we would have to set up the working directory:
setwd("C:\Users\yourname\Desktop\MyFolder\data")
<- read.csv("myfile.csv") myfile
With here
, instead of setting the working directory (setwd)
and read the file afterwards (read.csv
), we can go straight to the file part.
For instance, the project is in “MyFolder” and we saved your file in “data”. You can simply read your file as:
<- read.csv(here("data", "myfile.csv"), header=T, sep=";")
myfile
or
<- read.csv(here("data\myfile.csv", header=T, sep=";")) myfile
If you have a file in a folder within the “data” folder you just need to add this new “path”:
<- read.csv(here("data", "differentfolder",
myfile2 "mysecondfile.csv"), header=T, sep=";")
You can use this for everything you want and it helps to keep files organized within your project. For instance, to save figures, you simply create a new folder to store figures and change the path (i.e. “data” by “figures”):
ggsave(here("figures", file = "fig1.png"), fig_object)
File size
Specially if you are storing your projects on GitHub, large files can be annoying, since GitHub allows us to store files up to 100 mb. To tackle this issue, we can either use the lfs storage method (more details at: https://git-lfs.github.com) or compress the data as .zip
or .gz
files. Sometimes, though, the file is too large that even after compressing it, you still need to use the lsf method to push it to GitHub - which is the biotime case. Anyway, it is good to know how to compress your files in case you need it for a different data.
There are different ways to do that in R. I’m going to use here the archive
package. This package is a good alternative for both compressing and reading data.
As the biotime file is already zipped, let’s use the metadata as an example.
First we read the file
<- read.csv(here("posts", "rstats",
metadata "biotime_meta.csv"), header=T, sep=",")
- Note that I’m using the path where my project is saved at.
To compress this file you have to use the archive_write
function:
#renv::install("archive")
library(archive)
write_csv(metadata, archive_write("metadata.zip", "metadata.csv"))
I saved the file as .zip, but you can use .gz or other formats. More information at: https://github.com/jimhester/archive
The biotime file is already compressed (134 mb). If you extract the .csv, the size will increase to more than 1gb. Working with compressed files can save a lot of time and space when pushing it to GitHub, as I said before.
vroom
If you don’t want to extract the .csv file but read it in the compressed format (e.g. .zip, .gz etc), there are different ways to do so. I’m going to use the vroom
package to do the job (faster) - other options would be readr
package or fread()
function in data.table
package.
For reading the .zip file we simply change from read.csv
(base R) to vroom
.
#devtools::install_dev("vroom")
library(vroom)
<- vroom(here("posts", "rstats",
biotime_df "biotime.zip"), delim = ",")
<- vroom(here("posts", "rstats",
biotime_metadata "biotime_meta.csv"), delim = ",")
Cleaning and exploring BioTime with dplyr
dplyr
is a very useful package to clean and explore data in R. I’m not going to spend much time talking about this package because there are many other great material available out there. Rather, I will just show some important functions, specially for beginner users. You can check a detailed information about this package at: https://dplyr.tidyverse.org
With dplyr
we can select and rename columns, filter lines, create new variables and make plenty of cool/complex stuff. Throughout this post, I’m going to use mostly select
, filter
, group_by
and mutate/summarize
.
Let’s start by cleaning both data and metadata using some dplyr
functions. There are many columns in both files and I’m going to select only those I’m interested in. For that, we can use select
:
<- biotime_df %>%
biotime select(id = STUDY_ID, year = YEAR, plot = PLOT,
abundance = sum.allrawdata.ABUNDANCE,
biomass = sum.allrawdata.BIOMASS, sp = GENUS_SPECIES)
<- biotime_metadata %>%
metadata select(STUDY_ID:ORGANISMS, CENT_LAT, CENT_LONG) %>%
rename(id = STUDY_ID)
If you are not used to dplyr, this %>%
is called pipe, and we can use it to nest our functions. We first select the object we want, then use the %>%
to nest more functions, separating each one by a new pipe, like I did to create the metadata
object.
Now, in biotime object, we have species’ name, abundance, plot, year and site.
head(biotime)
# A tibble: 6 × 6
id year plot abundance biomass sp
<dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 10 1984 12 1 0 Acer rubrum
2 10 1984 12 3 0 Acer saccharum
3 10 1984 12 1 0 Acer spicatum
4 10 1984 12 12 0 Corylus cornuta
5 10 1984 12 1 0 Populus pinnata
6 10 1984 12 20 0 Acer rubrum
Whereas in metadata we have realm, taxa, habitat, whether the area is protected or not etc.
head(biotime_metadata)
# A tibble: 6 × 38
STUDY_ID REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS TITLE
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr>
1 10 Terre… Temper… Woodla… FALSE Temperat… Terr… woody pl… Wind…
2 18 Terre… Temper… Sagebr… FALSE Deserts … Terr… sagebrus… Mapp…
3 33 Marine Temper… Seawee… FALSE Temperat… Mari… phytopla… Long…
4 39 Terre… Temper… Decidu… FALSE Temperat… Birds birds Bird…
5 41 Terre… Temper… Woodla… FALSE Temperat… Birds birds Time…
6 45 Marine Tropic… Reef FALSE Tropical… Fish tropical… MCR …
# ℹ 29 more variables: AB_BIO <chr>, HAS_PLOT <chr>, DATA_POINTS <dbl>,
# START_YEAR <dbl>, END_YEAR <dbl>, CENT_LAT <dbl>, CENT_LONG <dbl>,
# NUMBER_OF_SPECIES <dbl>, NUMBER_OF_SAMPLES <dbl>, NUMBER_LAT_LONG <dbl>,
# TOTAL <dbl>, GRAIN_SIZE_TEXT <chr>, GRAIN_SQ_KM <dbl>, AREA_SQ_KM <dbl>,
# CONTACT_1 <chr>, CONTACT_2 <chr>, CONT_1_MAIL <chr>, CONT_2_MAIL <chr>,
# LICENSE <chr>, WEB_LINK <chr>, DATA_SOURCE <chr>, METHODS <chr>,
# SUMMARY_METHODS <chr>, LINK_ID <dbl>, COMMENTS <chr>, …
- Note that, in biotime object, I took the opportunity and renamed the columns within
select
. You can also userename
if you want (as I did in metadata object). Also, if you have to rename/fix several column names, you can use thejanitor
package.
We use select
for columns and filter
for rows. Let´s say we are interested only in Terrestrial plants from the metadata. We can go for:
%>%
metadata filter(TAXA == "Terrestrial plants") %>%
head()
# A tibble: 6 × 10
id REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS CENT_LAT
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
1 10 Terre… Temper… Woodla… FALSE Temperat… Terr… woody pl… 47.4
2 18 Terre… Temper… Sagebr… FALSE Deserts … Terr… sagebrus… 44.3
3 60 Terre… Tropic… Tropic… FALSE Tropical… Terr… tropical… 9.15
4 201 Terre… Tropic… Tropic… FALSE Tropical… Terr… Tropical… 2.98
5 202 Terre… Tropic… Tropic… FALSE Tropical… Terr… Tropical… 11.6
6 214 Terre… Temper… Conife… FALSE Temperat… Terr… Trees 45.3
# ℹ 1 more variable: CENT_LONG <dbl>
Of course we can add more arguments to the filter
, for example, only terrestrial plants in tropical climate and so on.
%>%
metadata filter(TAXA == "Terrestrial plants",
== "Tropical") %>%
CLIMATE head()
# A tibble: 6 × 10
id REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS CENT_LAT
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
1 60 Terre… Tropic… Tropic… FALSE Tropical… Terr… tropical… 9.15
2 201 Terre… Tropic… Tropic… FALSE Tropical… Terr… Tropical… 2.98
3 202 Terre… Tropic… Tropic… FALSE Tropical… Terr… Tropical… 11.6
4 241 Terre… Tropic… Tropic… FALSE Tropical… Terr… Woody pl… 9.36
5 302 Terre… Tropic… Semi d… TRUE Tropical… Terr… Trees -22.4
6 322 Terre… Tropic… Semi d… TRUE Tropical… Terr… Trees -22.8
# ℹ 1 more variable: CENT_LONG <dbl>
We can also do the opposite (filter rows out), adding !
to the column’s name:
%>%
metadata filter(!TAXA == "Terrestrial plants",
!CLIMATE == "Tropical") %>%
head()
# A tibble: 6 × 10
id REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS CENT_LAT
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
1 33 Marine Temper… Seawee… FALSE Temperat… Mari… phytopla… 50.2
2 39 Terre… Temper… Decidu… FALSE Temperat… Birds birds 43.9
3 41 Terre… Temper… Woodla… FALSE Temperat… Birds birds 39.5
4 46 Terre… Temper… Woodla… FALSE Temperat… Birds breeding… 51.7
5 47 Terre… Temper… Ponds FALSE Temperat… Birds ducks 50.8
6 51 Terre… Temper… Woodla… FALSE Boreal f… Birds tetraoni… 61.9
# ℹ 1 more variable: CENT_LONG <dbl>
We can filter values higher or lower than:
%>%
metadata filter(id > 100) %>%
head()
# A tibble: 6 × 10
id REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS CENT_LAT
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
1 108 Marine Global Oceani… FALSE Multiple… Birds birds -27.2
2 110 Marine Temper… Estuar… FALSE Temperat… Bent… Benthos 55.5
3 112 Marine Temper… Pelagi… FALSE Multiple… Fish pelagic … 25.0
4 113 Marine Global Oceani… FALSE Multiple… Mari… marine i… 2.70
5 117 Marine Temper… Tidal … FALSE Temperat… Mari… Starfish -39.4
6 119 Marine Temper… Pelagi… FALSE Temperat… Fish Fish 44.0
# ℹ 1 more variable: CENT_LONG <dbl>
Sometimes we need to filter several specific rows. For that, we use %in%
.
%>%
metadata filter(TAXA %in% c("Terrestrial plants",
"Birds",
"Fish")) %>%
head()
# A tibble: 6 × 10
id REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS CENT_LAT
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
1 10 Terre… Temper… Woodla… FALSE Temperat… Terr… woody pl… 47.4
2 18 Terre… Temper… Sagebr… FALSE Deserts … Terr… sagebrus… 44.3
3 39 Terre… Temper… Decidu… FALSE Temperat… Birds birds 43.9
4 41 Terre… Temper… Woodla… FALSE Temperat… Birds birds 39.5
5 45 Marine Tropic… Reef FALSE Tropical… Fish tropical… -17.5
6 46 Terre… Temper… Woodla… FALSE Temperat… Birds breeding… 51.7
# ℹ 1 more variable: CENT_LONG <dbl>
Besides selecting columns and filtering rows we can also create new columns with mutate
. Let’s filter only the study number 10 and create a new column to check how many species and years they have sampled in this study. We can use mutate
and n_distinct()
to achieve that.
%>%
biotime filter(id == 10) %>%
mutate(total_sp = n_distinct(sp),
total_y = n_distinct(year))
# A tibble: 1,406 × 8
id year plot abundance biomass sp total_sp total_y
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int>
1 10 1984 12 1 0 Acer rubrum 25 3
2 10 1984 12 3 0 Acer saccharum 25 3
3 10 1984 12 1 0 Acer spicatum 25 3
4 10 1984 12 12 0 Corylus cornuta 25 3
5 10 1984 12 1 0 Populus pinnata 25 3
6 10 1984 12 20 0 Acer rubrum 25 3
7 10 1984 12 14 0 Acer saccharum 25 3
8 10 1984 12 1 0 Toxicodendron radicans 25 3
9 10 1984 12 5 0 Acer spicatum 25 3
10 10 1984 12 7 0 Corylus cornuta 25 3
# ℹ 1,396 more rows
In this study we have 25 species and three different years.
Now let’s say we want to know how many species we have per year. To get that, we can use the function group_by
.
%>%
biotime filter(id == 10) %>%
group_by(year) %>%
mutate(total_sp = n_distinct(sp))
# A tibble: 1,406 × 7
# Groups: year [3]
id year plot abundance biomass sp total_sp
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <int>
1 10 1984 12 1 0 Acer rubrum 22
2 10 1984 12 3 0 Acer saccharum 22
3 10 1984 12 1 0 Acer spicatum 22
4 10 1984 12 12 0 Corylus cornuta 22
5 10 1984 12 1 0 Populus pinnata 22
6 10 1984 12 20 0 Acer rubrum 22
7 10 1984 12 14 0 Acer saccharum 22
8 10 1984 12 1 0 Toxicodendron radicans 22
9 10 1984 12 5 0 Acer spicatum 22
10 10 1984 12 7 0 Corylus cornuta 22
# ℹ 1,396 more rows
We can see that we have 22 species in 1984, 20 species in 1992, we can’t see the third year though. That’s because mutate
creates a new column and repeats the values for each row we have. To get a summarized overview, we can use summarise
instead.
%>%
biotime filter(id == 10) %>%
group_by(year) %>%
summarise(total_sp = n_distinct(sp))
# A tibble: 3 × 2
year total_sp
<dbl> <int>
1 1984 22
2 1992 20
3 1996 20
Now we can see the number of species per year, but the other columns were deleted. Therefore, both mutate and summaries give us the same result but in different ways, you have to choose which one fits better your purpose when dealing with the data.
We can also combine summarise
and mutate
to get, for example, the frequency of species
%>%
biotime filter(id == 10) %>%
group_by(sp) %>%
summarise (total = n()) %>%
mutate(freq = total / sum(total))
# A tibble: 25 × 3
sp total freq
<chr> <int> <dbl>
1 Acer rubrum 185 0.132
2 Acer saccharum 290 0.206
3 Acer spicatum 146 0.104
4 Amelanchier gravida 1 0.000711
5 Amelanchier laevis 30 0.0213
6 Betula papyrifera 10 0.00711
7 Cornus racemosa 6 0.00427
8 Cornus rugosa 35 0.0249
9 Corylus cornuta 244 0.174
10 Diervilla lonicera 14 0.00996
# ℹ 15 more rows
Note that I’ve grouped by “species”. We could also check the frequency of each species per year, adding the variable “year” in group_by()
.
%>%
biotime filter(id == 10) %>%
group_by(sp,year) %>%
summarise (total = n()) %>%
mutate(freq = total / sum(total))
# A tibble: 62 × 4
# Groups: sp [25]
sp year total freq
<chr> <dbl> <int> <dbl>
1 Acer rubrum 1984 77 0.416
2 Acer rubrum 1992 68 0.368
3 Acer rubrum 1996 40 0.216
4 Acer saccharum 1984 105 0.362
5 Acer saccharum 1992 109 0.376
6 Acer saccharum 1996 76 0.262
7 Acer spicatum 1984 62 0.425
8 Acer spicatum 1992 55 0.377
9 Acer spicatum 1996 29 0.199
10 Amelanchier gravida 1984 1 1
# ℹ 52 more rows
tidyr
- the art of pivoting data
As ecologists, we have to deal with data frames either in long or wide formats all the time. Knowing how to transform the data into both formats, in R, is very handy and safe, because we don’t need to create several .csv files, maintaining the original data intact, decreasing the chances of messing up it.
The most important functions to transform data frames in tidy
are pivot_longer()
and pivot_wider()
. These functions have been improved constantly. Out there, you will find reshape2
package, melt
, spread
and gather
. Those are the old school functions to transform data frames. I’m slowly moving from spread
/gather
into pivot_wider()
/pivot_longer()
. Then, let’s try to transform our data into both wide and long formats using the new functions.
Let’s use the same example (study id 10) to transform the data into the wide format. We use pivot_wider()
, specifying that the species names present in “sp” column will be the columns of the new object (names_from = "sp"
) and the values present in “abundance” will be the row values (values_from = "abundance"
).
%>%
biotime filter(id == 10) %>%
pivot_wider(names_from = "sp", values_from = "abundance")
# A tibble: 47 × 29
id year plot biomass `Acer rubrum` `Acer saccharum` `Acer spicatum`
<dbl> <dbl> <chr> <dbl> <list> <list> <list>
1 10 1984 12 0 <dbl [4]> <dbl [3]> <dbl [2]>
2 10 1984 1 0 <dbl [4]> <dbl [4]> <dbl [4]>
3 10 1984 2 0 <dbl [4]> <dbl [8]> <dbl [8]>
4 10 1984 34 0 <dbl [4]> <dbl [6]> <dbl [8]>
5 10 1984 35 0 <dbl [4]> <dbl [10]> <dbl [5]>
6 10 1984 36 0 <dbl [4]> <dbl [8]> <NULL>
7 10 1984 37 0 <dbl [3]> <dbl [10]> <NULL>
8 10 1984 38 0 <dbl [4]> <dbl [6]> <NULL>
9 10 1984 39 0 <dbl [3]> <dbl [4]> <NULL>
10 10 1984 40 0 <dbl [2]> <dbl [2]> <dbl [6]>
# ℹ 37 more rows
# ℹ 22 more variables: `Corylus cornuta` <list>, `Populus pinnata` <list>,
# `Toxicodendron radicans` <list>, `Diervilla lonicera` <list>,
# `Populus sp` <list>, `Cornus rugosa` <list>, `Prunus virginiana` <list>,
# `Rubus flagellaris` <list>, `Viburnum rafinesquianum` <list>,
# `Quercus rubra` <list>, `Vaccinium angustifolium` <list>,
# `Pinus strobus` <list>, `Ostrya virginiana` <list>, …
Oops.. something went wrong. It transformed our data frame but in an odd way. The warning says that our data is not uniquely identified, meaning that the function can’t recognize each row, separately, in our data frame. To tackle this issue, we can simply bring the row names to a column, identifying each row, with the function rownames_to_column()
. Let´s try again.
%>%
biotime filter(id == 10) %>%
rownames_to_column() %>%
pivot_wider(names_from = "sp", values_from = "abundance")
# A tibble: 1,406 × 30
rowname id year plot biomass `Acer rubrum` `Acer saccharum`
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 10 1984 12 0 1 NA
2 2 10 1984 12 0 NA 3
3 3 10 1984 12 0 NA NA
4 4 10 1984 12 0 NA NA
5 5 10 1984 12 0 NA NA
6 6 10 1984 12 0 20 NA
7 7 10 1984 12 0 NA 14
8 8 10 1984 12 0 NA NA
9 9 10 1984 12 0 NA NA
10 10 10 1984 12 0 NA NA
# ℹ 1,396 more rows
# ℹ 23 more variables: `Acer spicatum` <dbl>, `Corylus cornuta` <dbl>,
# `Populus pinnata` <dbl>, `Toxicodendron radicans` <dbl>,
# `Diervilla lonicera` <dbl>, `Populus sp` <dbl>, `Cornus rugosa` <dbl>,
# `Prunus virginiana` <dbl>, `Rubus flagellaris` <dbl>,
# `Viburnum rafinesquianum` <dbl>, `Quercus rubra` <dbl>,
# `Vaccinium angustifolium` <dbl>, `Pinus strobus` <dbl>, …
Ok. It seems we are getting closer. Now we have the study 10 data frame in the wide format but when the species is absent we have NA’s. To tackle this, we add the argument values_fill
to our pivot_wider()
function.
%>%
biotime filter(id == 10) %>%
rownames_to_column() %>%
pivot_wider(names_from = "sp", values_from = "abundance",
values_fill = list(abundance=0))
# A tibble: 1,406 × 30
rowname id year plot biomass `Acer rubrum` `Acer saccharum`
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 10 1984 12 0 1 0
2 2 10 1984 12 0 0 3
3 3 10 1984 12 0 0 0
4 4 10 1984 12 0 0 0
5 5 10 1984 12 0 0 0
6 6 10 1984 12 0 20 0
7 7 10 1984 12 0 0 14
8 8 10 1984 12 0 0 0
9 9 10 1984 12 0 0 0
10 10 10 1984 12 0 0 0
# ℹ 1,396 more rows
# ℹ 23 more variables: `Acer spicatum` <dbl>, `Corylus cornuta` <dbl>,
# `Populus pinnata` <dbl>, `Toxicodendron radicans` <dbl>,
# `Diervilla lonicera` <dbl>, `Populus sp` <dbl>, `Cornus rugosa` <dbl>,
# `Prunus virginiana` <dbl>, `Rubus flagellaris` <dbl>,
# `Viburnum rafinesquianum` <dbl>, `Quercus rubra` <dbl>,
# `Vaccinium angustifolium` <dbl>, `Pinus strobus` <dbl>, …
Now we have the data transformed into the wide format, with 0’s instead of NA’s. Note that, sometimes, the row names are names instead of numbers (i.e. sometimes we use species names as row names). If you face that, you can create a new column, labeling the rows, using mutate()
and row_number()
instead of rownames_to_column()
. For example:
<- biotime %>%
biotime_wider filter(id == 10) %>%
#rownames_to_column() %>%
mutate(id_rows = row_number()) %>%
pivot_wider(names_from = "sp", values_from = "abundance", values_fill = list(abundance=0))
Now let´s transform our data into the other way round, from wide into long format. For that, we can use the function pivot_longer()
%>%
biotime_wider pivot_longer(names_to = "sp", values_to = "abundance")
Error in `pivot_longer()`:
! `cols` must select at least one column.
Error. When pivoting longer, we have to pay attention to the other columns we may have. In the biotime_wider
object we have id
, year
, plot
, biomass
, id_rows
and species names
. We want to transform only species names
and values
(abundance) though. In this case, we have to omit the other columns we don’t want to transform.
%>%
biotime_wider pivot_longer(-c(id:id_rows),names_to = "sp", values_to = "abundance")
# A tibble: 35,150 × 7
id year plot biomass id_rows sp abundance
<dbl> <dbl> <chr> <dbl> <int> <chr> <dbl>
1 10 1984 12 0 1 Acer rubrum 1
2 10 1984 12 0 1 Acer saccharum 0
3 10 1984 12 0 1 Acer spicatum 0
4 10 1984 12 0 1 Corylus cornuta 0
5 10 1984 12 0 1 Populus pinnata 0
6 10 1984 12 0 1 Toxicodendron radicans 0
7 10 1984 12 0 1 Diervilla lonicera 0
8 10 1984 12 0 1 Populus sp 0
9 10 1984 12 0 1 Cornus rugosa 0
10 10 1984 12 0 1 Prunus virginiana 0
# ℹ 35,140 more rows
With pivot_longer()
and pivot_wider()
you can make a lot of more complex transformations, I’m not going to cover those here though, please check them out at: https://tidyr.tidyverse.org/reference/pivot_wider.html
Merging data and working within lists with purrr
Now let´s try to use this knowledge to create more complex objects, using new features, combining functions etc.
When analyzing ecological data, most of the time, both data and metadata are useful, like in BioTime
example.
Let’s check both data frames
BioTime data
head(biotime)
# A tibble: 6 × 6
id year plot abundance biomass sp
<dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 10 1984 12 1 0 Acer rubrum
2 10 1984 12 3 0 Acer saccharum
3 10 1984 12 1 0 Acer spicatum
4 10 1984 12 12 0 Corylus cornuta
5 10 1984 12 1 0 Populus pinnata
6 10 1984 12 20 0 Acer rubrum
Metadata
head(metadata)
# A tibble: 6 × 10
id REALM CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA ORGANISMS CENT_LAT
<dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <dbl>
1 10 Terre… Temper… Woodla… FALSE Temperat… Terr… woody pl… 47.4
2 18 Terre… Temper… Sagebr… FALSE Deserts … Terr… sagebrus… 44.3
3 33 Marine Temper… Seawee… FALSE Temperat… Mari… phytopla… 50.2
4 39 Terre… Temper… Decidu… FALSE Temperat… Birds birds 43.9
5 41 Terre… Temper… Woodla… FALSE Temperat… Birds birds 39.5
6 45 Marine Tropic… Reef FALSE Tropical… Fish tropical… -17.5
# ℹ 1 more variable: CENT_LONG <dbl>
As you can see, in biotime we have id
, year
, plot
, abundance
, biomass
and species
, whereas in metadata we have id
, realm
, habitat
etc. So, let’s say we are interested in examining terrestrial plants species from tropical forests, but as species are in one data frame and biomes in the other, we first have to merge both data frames. For merging data frames into one we can use either bind_cols() / bind_rows()
functions, for binding columns and rows, respectively, or the join family
available in dplyr
(https://dplyr.tidyverse.org/reference/join.html). Here, we can use the letf_join()
option to merge both data and metadata.
<- left_join(biotime, metadata, by = "id")
merge_df head(merge_df)
# A tibble: 6 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT PROTECTED_AREA
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl>
1 10 1984 12 1 0 Acer… Terr… Temper… Woodla… FALSE
2 10 1984 12 3 0 Acer… Terr… Temper… Woodla… FALSE
3 10 1984 12 1 0 Acer… Terr… Temper… Woodla… FALSE
4 10 1984 12 12 0 Cory… Terr… Temper… Woodla… FALSE
5 10 1984 12 1 0 Popu… Terr… Temper… Woodla… FALSE
6 10 1984 12 20 0 Acer… Terr… Temper… Woodla… FALSE
# ℹ 5 more variables: BIOME_MAP <chr>, TAXA <chr>, ORGANISMS <chr>,
# CENT_LAT <dbl>, CENT_LONG <dbl>
left_join()
merged the data frames based on the study id. This is a safe way to merge data frames because the function will match exactly the same rows present in each data frame based on the argument you give in "by = ..."
; in this case I chose by = "id"
, but depending on your data you can add more variables.
Now we have both data and metadata merged. Then, we can finally check terrestrial plant species from tropical forests.
%>%
merge_df filter(TAXA == "Terrestrial plants",
== "Tropical") %>%
CLIMATE head()
# A tibble: 6 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT PROTECTED_AREA
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl>
1 201 1987 <NA> 2454 0 Aidi… Terr… Tropic… Tropic… FALSE
2 201 1987 <NA> 97 0 Fagr… Terr… Tropic… Tropic… FALSE
3 201 1987 <NA> 112 0 Fahr… Terr… Tropic… Tropic… FALSE
4 201 1987 <NA> 45 0 Ficu… Terr… Tropic… Tropic… FALSE
5 201 1987 <NA> 2913 0 Alan… Terr… Tropic… Tropic… FALSE
6 201 1987 <NA> 3 0 Ficu… Terr… Tropic… Tropic… FALSE
# ℹ 5 more variables: BIOME_MAP <chr>, TAXA <chr>, ORGANISMS <chr>,
# CENT_LAT <dbl>, CENT_LONG <dbl>
or Birds etc
%>%
merge_df filter(TAXA == "Birds",
== "Tropical") %>%
CLIMATE head()
# A tibble: 6 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT PROTECTED_AREA
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl>
1 58 1994 Plot… 2 0 Vire… Terr… Tropic… Long t… FALSE
2 58 1991 Plot9 1 0 Paru… Terr… Tropic… Long t… FALSE
3 58 1994 Plot… 1 0 Neso… Terr… Tropic… Long t… FALSE
4 58 1994 Plot6 7 0 Coer… Terr… Tropic… Long t… FALSE
5 58 1991 Plot6 9 0 Neso… Terr… Tropic… Long t… FALSE
6 58 1994 Plot6 1 0 Marg… Terr… Tropic… Long t… FALSE
# ℹ 5 more variables: BIOME_MAP <chr>, TAXA <chr>, ORGANISMS <chr>,
# CENT_LAT <dbl>, CENT_LONG <dbl>
Although those functions are useful to explore big data sets, most of the time, specially when working with multiple variables, filtering each one and examining them separately can be time consuming. For example, let’s say we want to transform our current data frame (merge_df) from long into wide format, as we did before. The first time we transformed our data, we had a single study (id=10) and taxa. Now we have multiple studies and taxa. If we try pivot_wider()
, plant and bird species names, for example, will be assigned as columns, making things difficult to disentangle afterwards. An useful way to avoid this is to split our data frame into lists based on each taxa we have. Then, we can apply the function we want for all elements of that list at once. We can do that using the package purrr
, which has plenty of functions that make ecologists` life a bit easier in that sense. Let’s try to understand how this package works.
First, let’s create different lists based on different taxa with the split()
function:
<- merge_df %>%
split_taxa split(.$TAXA)
head(split_taxa)
$All
# A tibble: 326,527 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 166 1979 <NA> 5 0 Puffinus gravis Marine Global Oceanic w…
2 166 1979 <NA> 2 0 Puffinus gravis Marine Global Oceanic w…
3 166 1979 <NA> 15 0 Puffinus gravis Marine Global Oceanic w…
4 166 1979 <NA> 23 0 Puffinus gravis Marine Global Oceanic w…
5 166 1979 <NA> 20 0 Puffinus gravis Marine Global Oceanic w…
6 166 1979 <NA> 20 0 Puffinus gravis Marine Global Oceanic w…
7 166 1979 <NA> 60 0 Puffinus gravis Marine Global Oceanic w…
8 166 1979 <NA> 16 0 Puffinus gravis Marine Global Oceanic w…
9 166 1979 <NA> 1 0 Puffinus gravis Marine Global Oceanic w…
10 166 1979 <NA> 150 0 Puffinus gravis Marine Global Oceanic w…
# ℹ 326,517 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
# ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>
$Amphibians
# A tibble: 2,446 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 305 1980 <NA> 12 0 Desmognathus quadr… Terr… Temper… Appala…
2 305 1981 <NA> 4 0 Desmognathus aeneus Terr… Temper… Appala…
3 305 1981 <NA> 20 0 Desmognathus monti… Terr… Temper… Appala…
4 305 1976 <NA> 8 0 Desmognathus aeneus Terr… Temper… Appala…
5 305 1976 <NA> 27 0 Desmognathus monti… Terr… Temper… Appala…
6 305 1976 <NA> 15 0 Desmognathus ochro… Terr… Temper… Appala…
7 305 1976 <NA> 7 0 Desmognathus quadr… Terr… Temper… Appala…
8 305 1977 <NA> 7 0 Desmognathus aeneus Terr… Temper… Appala…
9 305 1977 <NA> 22 0 Desmognathus monti… Terr… Temper… Appala…
10 305 1977 <NA> 18 0 Desmognathus ochro… Terr… Temper… Appala…
# ℹ 2,436 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
# ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>
$Benthos
# A tibble: 1,711,785 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 76 1992 <NA> 2 0 Aphroditoidea Mari… Temper… Estuar…
2 76 1992 <NA> 5 0 Ascidiella scabra Mari… Temper… Estuar…
3 76 1992 <NA> 4 0 Asterias rubens Mari… Temper… Estuar…
4 76 1992 <NA> 9 0 Calliostoma zyziph… Mari… Temper… Estuar…
5 76 1992 <NA> 10 0 Crepidula fornicata Mari… Temper… Estuar…
6 76 1992 <NA> 1 0 Gibbula tumida Mari… Temper… Estuar…
7 76 1992 <NA> 1 0 Hippolyte varians Mari… Temper… Estuar…
8 76 1992 <NA> 1 0 Hyas coarctatus Mari… Temper… Estuar…
9 76 1992 <NA> 1 0 Liocarcinus puber Mari… Temper… Estuar…
10 76 1992 <NA> 1 0 Liparis sp Mari… Temper… Estuar…
# ℹ 1,711,775 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
# ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>
$Birds
# A tibble: 1,487,942 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 39 1976 <NA> 7.5 NA Catharus guttatus Terr… Temper… Decidu…
2 39 1976 <NA> 8 NA Catharus ustulatus Terr… Temper… Decidu…
3 39 1976 <NA> 12 NA Dendroica caerules… Terr… Temper… Decidu…
4 39 1977 <NA> 2.5 NA Catharus fuscescens Terr… Temper… Decidu…
5 39 1977 <NA> 5 NA Catharus guttatus Terr… Temper… Decidu…
6 39 1977 <NA> 7 NA Catharus ustulatus Terr… Temper… Decidu…
7 39 1977 <NA> 11 NA Dendroica caerules… Terr… Temper… Decidu…
8 39 1977 <NA> 2 NA Dendroica fusca Terr… Temper… Decidu…
9 39 1977 <NA> 28 NA Empidonax minimus Terr… Temper… Decidu…
10 39 1977 <NA> 5 NA Hylocichla musteli… Terr… Temper… Decidu…
# ℹ 1,487,932 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
# ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>
$Fish
# A tibble: 2,294,445 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 45 2006 <NA> 1 4.80 Canthigaster solan… Mari… Tropic… Reef
2 45 2006 <NA> 2 0.630 Ctenogobiops feroc… Mari… Tropic… Reef
3 45 2006 <NA> 1 26.8 Echidna nebulosa Mari… Tropic… Reef
4 45 2006 <NA> 1 1.87 Epinephelus merra Mari… Tropic… Reef
5 45 2006 <NA> 1 8.31 Paracirrhites arca… Mari… Tropic… Reef
6 45 2006 <NA> 1 0.200 Pseudocheilinus he… Mari… Tropic… Reef
7 45 2006 <NA> 12 70.7 Stegastes nigricans Mari… Tropic… Reef
8 45 2006 <NA> 2 38.7 Acanthurus nigrofu… Mari… Tropic… Reef
9 45 2006 <NA> 1 46.2 Acanthurus trioste… Mari… Tropic… Reef
10 45 2006 <NA> 1 18.3 Anampses twistii Mari… Tropic… Reef
# ℹ 2,294,435 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
# ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>
$`Freshwater invertebrates`
# A tibble: 93,966 × 15
id year plot abundance biomass sp REALM CLIMATE HABITAT
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 237 1984 <NA> 2000 0 Daphnia pulicaria Fres… Temper… Temper…
2 237 1983 <NA> 28000 0 Notholca acuminata Fres… Temper… Temper…
3 237 1984 <NA> 8000 0 Keratella quadrata Fres… Temper… Temper…
4 237 1984 <NA> 19000 0 Notholca Fres… Temper… Temper…
5 237 1984 <NA> 11000 0 Synchaeta Fres… Temper… Temper…
6 237 1984 <NA> 768000 0 Conochiloides nata… Fres… Temper… Temper…
7 237 1984 <NA> 156000 0 Cyclopoida copepod… Fres… Temper… Temper…
8 237 1984 <NA> 16000 0 Daphnia mendotae Fres… Temper… Temper…
9 237 1983 <NA> 194000 0 Copepod nauplii Fres… Temper… Temper…
10 237 1983 <NA> 9000 0 Polyarthra dolicho… Fres… Temper… Temper…
# ℹ 93,956 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
# ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>
To work with lists in purrr
the map
function will be our loyal servant, and they are many: i.e. map
, map2
, map_dfr
, map_dfc
etc.
map
is used when we want to apply a function to each element of one list (it works likelapply
)map2
is used when we want to apply a function to each element of two lists (it works likemapply
)map_dfr
andmap_dfc
are used when we want to combine lists into data frames by rows or columns, respectively.
I will use here map
and map2
(specially in data viz and statistical modelling post); if you want to know more, take a look at:
https://dcl-prog.stanford.edu/purrr-basics.html
and
https://jennybc.github.io/purrr-tutorial/
Let’s start with map()
, filtering studies that have been sampled over more than 10 years.
<- split_taxa %>% map(~ .x %>%
filter_year group_by(id) %>%
mutate(n_year=n_distinct(year)) %>% # here we create a column showing how many years we have for each taxa, grouped by study id
filter(n_year > 10)) # and filter only those studies with more than 10 years
In case you are not used to tidyverse, map()
works like lapply()
in base R. Tilda(~
) represents function(x)
and (.)
would be (x)
. Just to show it, in a very non-sense way, let’s sum up the overall number of individuals of each taxa. To make it simple (and even poorer), let’s select only the abundance column first.
<- filter_year %>%
select_abundance map(~ .x %>% ungroup %>% select(abundance))
head(select_abundance)
$All
# A tibble: 313,561 × 1
abundance
<dbl>
1 5
2 2
3 15
4 23
5 20
6 20
7 60
8 16
9 1
10 150
# ℹ 313,551 more rows
$Amphibians
# A tibble: 327 × 1
abundance
<dbl>
1 12
2 4
3 20
4 8
5 27
6 15
7 7
8 7
9 22
10 18
# ℹ 317 more rows
$Benthos
# A tibble: 1,594,662 × 1
abundance
<dbl>
1 18
2 36
3 2
4 1038
5 158
6 130
7 48
8 36
9 30
10 6
# ℹ 1,594,652 more rows
$Birds
# A tibble: 1,288,906 × 1
abundance
<dbl>
1 7.5
2 8
3 12
4 2.5
5 5
6 7
7 11
8 2
9 28
10 5
# ℹ 1,288,896 more rows
$Fish
# A tibble: 2,026,702 × 1
abundance
<dbl>
1 7
2 5
3 2
4 1
5 1
6 6
7 6
8 52
9 16
10 1
# ℹ 2,026,692 more rows
$`Freshwater invertebrates`
# A tibble: 91,528 × 1
abundance
<dbl>
1 2000
2 28000
3 8000
4 19000
5 11000
6 768000
7 156000
8 16000
9 194000
10 9000
# ℹ 91,518 more rows
Using the base R approach, to sum up a column over lists, we would use something like this:
lapply(select_abundance, function(x) { colSums (x)} )
$All
abundance
10525821
$Amphibians
abundance
115344
$Benthos
abundance
23503327052
$Birds
abundance
14673312
$Fish
abundance
52881387
$`Freshwater invertebrates`
abundance
2898542291
$`Freshwater plants`
abundance
NA
$Mammals
abundance
679713.1
$`Marine invertebrates`
abundance
256799039
$`Marine plants`
abundance
25161115487
$Reptiles
abundance
23555
$`Terrestrial invertebrates`
abundance
1022759
$`Terrestrial plants`
abundance
2427471
With purrr
:
%>%
select_abundance map(~ .x %>%
summarise(row_sums = colSums(.)))
or
%>%
select_abundance map(~summarise(., row_sums = colSums(.)))
$All
# A tibble: 1 × 1
row_sums
<dbl>
1 10525821
$Amphibians
# A tibble: 1 × 1
row_sums
<dbl>
1 115344
$Benthos
# A tibble: 1 × 1
row_sums
<dbl>
1 23503327052.
$Birds
# A tibble: 1 × 1
row_sums
<dbl>
1 14673312.
$Fish
# A tibble: 1 × 1
row_sums
<dbl>
1 52881387.
$`Freshwater invertebrates`
# A tibble: 1 × 1
row_sums
<dbl>
1 2898542291.
$`Freshwater plants`
# A tibble: 1 × 1
row_sums
<dbl>
1 NA
$Mammals
# A tibble: 1 × 1
row_sums
<dbl>
1 679713.
$`Marine invertebrates`
# A tibble: 1 × 1
row_sums
<dbl>
1 256799039.
$`Marine plants`
# A tibble: 1 × 1
row_sums
<dbl>
1 25161115487.
$Reptiles
# A tibble: 1 × 1
row_sums
<dbl>
1 23555
$`Terrestrial invertebrates`
# A tibble: 1 × 1
row_sums
<dbl>
1 1022759
$`Terrestrial plants`
# A tibble: 1 × 1
row_sums
<dbl>
1 2427471.
We filtered our data by n_years > 10
, let’s increase this number selecting only studies with more that 30 years of sampling.
<- split_taxa %>% map(~ .x %>%
bio_list group_by(id) %>%
mutate(nyear = n_distinct(year)) %>%
filter(nyear > 30) %>%
select(id:sp,-nyear))
Now we have some elements in our list with 0, meaning that there is no study with more than 30 years for that specific taxa. We can use either discard()
or keep()
to exclude or keep elements in our list, respectively. Let’s exclude those elements without any data.
<- bio_list %>% discard(., ~nrow(.) == 0) discard_elements
Now, instead of 13 taxa we have seven.
If you want to exclude elements in a list by name, we can use list_modify("listElementName" = zap())
. Note that, previously, we could use list_modify = "NULL"
, but this option is now deprecated. For example, let’s exclude “Benthos” from our object
<- discard_elements %>%
no_benthos list_modify("Benthos" = zap())
Let’s create a more complex object now. As we have both abundance and biomass column, let’s create a new column representing presence/absence (pres_ab). First, though, we have to deal with the NA’s in our data. We can use replace_na()
for that.
<- no_benthos %>%
presence_abs map(~ .x %>%
mutate(i = row_number()) %>%
group_by(id) %>%
replace_na(list(abundance = 0, biomass = 0)) %>%
mutate(pres_ab = abundance+biomass,
pres_ab = case_when(pres_ab > 0 ~ 1,
~ 0))) T
Here I introduced a new (very useful) function case_when()
. In the current example, I created the “pres_ab” column and asked the function to assign 1 (presence) case when the sum of abundance+biomass is higher than 0 and assign 0 when the opposite is true (T ~ 0
) . You can use case_when()
in many different ways, being very useful when we have to create new columns based on other columns or change values within columns.
As in the long format we don’t have species assigned with 0 either in abundance or biomass, “pres_ab” column will be always 1 (my bad haha). Let’s transform the data into the wide format to get presences and absences indeed. If we were not working with lists, we should transform each taxa separately. We can now combine purrr
with pivot_wider()
and transform all elements of our list.
<- presence_abs %>%
wide_list map(~ .x %>%
group_by(id,year,sp) %>%
select(-plot,-abundance, -biomass) %>%
summarise(pres_ab = sum(pres_ab)) %>%
mutate(pres_ab = case_when(pres_ab > 0 ~ 1,
~ 0))%>%
T %>%
ungroup arrange(sp) %>% # sp names in alphabetic order
pivot_wider(names_from = "sp", values_from = "pres_ab",
values_fill = list(pres_ab=0)))
Now we have 0’s and 1’s. As we have species presences absences per year, we can check species gains and losses. For that we transform our data back into the long format (now with 0’s and 1’s) and use case_when()
again to create the new column
As we are working with temporal data, we can use case_when()
combined with lag()
or lead()
to analyse either the previous or next row in our data set, respectively. For example, let’s create a new column to check species gains and losses per year.
Now I will transform the lists into the long format again and check species gains and losses using case_when()
.
<-wide_list %>%
long_list map(~.x %>%
pivot_longer(-c(id,year),names_to = "sp", values_to = "pres_ab"))
Now we can check how each species is changing over time
<- long_list %>%
gain_loss map(~.x %>%
group_by(id,sp) %>%
arrange(year, .by_group = TRUE)%>%
mutate(gain_loss = case_when(row_number() == 1 ~ "NA",
lag(pres_ab == 0) & pres_ab == 1 ~ "gain",
lag(pres_ab == 1) & pres_ab == 0 ~ "loss",
lag(pres_ab == 0) & pres_ab == 0 ~ "persist_ab",
~ "persist_pres"))) T
Here I first grouped the data by study id and species, then I ordered the year, separately, by group. With mutate()
, I created the column gain_loss, putting NA to the first year of each study, since we cant have gains or losses at the first year. Further, I assigned gain when the species was absent at the previous year (lag(pres_ab==0)
) and present in the next one (pres_ab==1
) and the opposite to loss. Also, if the species persisted absent or present, but we could also use “NA” if the idea is to check only gains and losses.
Another nice trick, when dealing with lists, is that sometimes we need to use the name of the elements in a list as a column. For example, here the name of the elements in our list is the taxa’s name. We can use imap
and mutate
to create a new column based on these names. With mutate
(in dplyr 1.0.0
) we can also select the position of the new column with .before
and .after
, otherwise the new column will be the last one.
<- gain_loss %>% imap(., ~mutate(.x, TAXA = .y, .after = year)) taxa_list
Some other tricks with purrr
Sometimes, we have to deal with nested lists. I will not focus on nested lists here, but just out of curiosity, let’s split our list once more, now based on the different studies we have. So we will have lists based on taxa and nested based on studies.
<- wide_list %>%
nest_list map(~split(.,.$id))
Usually, when I have such kind of data, I use nested maps
- not sure if this is the most elegant approach though. For example, let’s say we want to check the species richness for each taxa in each study based on years of sampling. I’d do something like this:
<- nest_list %>% map(map, ~mutate(.,richness = rowSums(.[-1:-2]), .after = year)) double_map
To work with two lists, we can use map2()
. A simple example:
<- as.list(c(a = 1, b = 2 ,c = 3))
list1
<- as.list(c(a = 1, b = 2 ,c = 3))
list2
map2(list1, list2, sum)
$a
[1] 2
$b
[1] 4
$c
[1] 6
I will use more map2()
in the following posts, but this is particular interesting for those working with functional diversity of multiple taxa. You can do the same we did here (split the data by taxa), in both community and trait data, and then use map2()
to calculate functional diversity. It will not work here of course, but something like this might work for you:
library(FD)
<- map2(list_trait, list_comm, dbFD) # list with traits, list with communities and the function you want to apply (in this case, dbFD) t1
and then you can get the CWM, Rao or whatever metric you want from the dbFD()
function
<- t1 %>% map("CWM")
list_CWM
<- t1 %>% map("RaoQ") %>%
list_RaoQ map(~ .x %>% as.data.frame())
Another important trick we can do with purrr
is to load multiple data into R. Sometimes we download ecological data and the .csv
files come separated into multiple zip files. If you have faced that, and had to extract each file one by one, guess what? You can run them all at once with purrr
:
<- list.files(here("data"), pattern = "*.csv", full.names = T) %>%
manyfiles map_df(~vroom(.)) #here("data"), as I explained at the beginning of this post, is where your files are located in your Rproject
In summary, tidyverse
has many functions that can boost our data analysis and how / when to use them will depend on your data and what you want to get from it. I hope to have kicked off some new useful functions and tricks.