#1 Data wrangling in ecology with tidyverse

Author

Diego P.F. Trindade

Published

July 15, 2020

Image credit: Alisson Horst
#install.packages(c("here", "dplyr", "vroom", "tidyr", "purrr"))
library(here)
library(dplyr)
library(vroom)
library(tidyr)
library(purrr)

The amount of data becoming available in ecology has been increasing rather fast in the recent years. Therefore, being able to wrangling, analyzing and visualizing this data, associated with a good scientific question, is timely for ecologists. Here, I will introduce some tidyverse packages and functions that can make this part of the job a bit easier.

Most information I will present here was gotten from our outstanding R community (StackOverflow, GitHub, Twitter posts - #rstats - and R books), based on daily issues faced during my PhD. Also, everything here can be achieved by using base R. However, some tidyverse packages can be helpful to avoid unnecessary loops and functions, saving some lines of code and time. The intention of this post is, therefore, to show some useful functions that can facilitate our life but also to gather information in one place for my own purpose :).

In this post, I will be using the BioTime database as an example, which is a biodiversity time series data with multiple taxa and several sites around the world (more details at: http://biotime.st-andrews.ac.uk/). I anticipate that this is not an extensive analysis of the data but just an example of how we can use some tidyverse packages and functions to any data we want. Please download both data and metadata, in case you want to follow this post.

This post is split into three parts:

here package

For those using R projects, here is quite handy. With here we don’t need to set our working directory (setwd) anymore. This is particularly interesting whether we are collaborating and have to share our scripts and data with colleagues or need to read files saved at different folders and work on different computers. When using here we only need to set the path where the files are saved within the R project:

For example, let’s say the project is in “MyFolder” and the data in “data”. Using the traditional way, before starting working on the data, we would have to set up the working directory:

setwd("C:\Users\yourname\Desktop\MyFolder\data")

myfile <- read.csv("myfile.csv")

With here, instead of setting the working directory (setwd) and read the file afterwards (read.csv), we can go straight to the file part.

For instance, the project is in “MyFolder” and we saved your file in “data”. You can simply read your file as:

myfile <- read.csv(here("data", "myfile.csv"), header=T, sep=";")

or

myfile <- read.csv(here("data\myfile.csv", header=T, sep=";"))

If you have a file in a folder within the “data” folder you just need to add this new “path”:

myfile2 <- read.csv(here("data", "differentfolder", 
                         "mysecondfile.csv"), header=T, sep=";")

You can use this for everything you want and it helps to keep files organized within your project. For instance, to save figures, you simply create a new folder to store figures and change the path (i.e. “data” by “figures”):

ggsave(here("figures", file = "fig1.png"), fig_object)

File size

Specially if you are storing your projects on GitHub, large files can be annoying, since GitHub allows us to store files up to 100 mb. To tackle this issue, we can either use the lfs storage method (more details at: https://git-lfs.github.com) or compress the data as .zip or .gz files. Sometimes, though, the file is too large that even after compressing it, you still need to use the lsf method to push it to GitHub - which is the biotime case. Anyway, it is good to know how to compress your files in case you need it for a different data.

There are different ways to do that in R. I’m going to use here the archive package. This package is a good alternative for both compressing and reading data.

As the biotime file is already zipped, let’s use the metadata as an example.

First we read the file

metadata <- read.csv(here("posts", "rstats", 
                          "biotime_meta.csv"), header=T, sep=",") 
  • Note that I’m using the path where my project is saved at.

To compress this file you have to use the archive_write function:

#renv::install("archive")
library(archive)
write_csv(metadata, archive_write("metadata.zip", "metadata.csv"))

I saved the file as .zip, but you can use .gz or other formats. More information at: https://github.com/jimhester/archive

The biotime file is already compressed (134 mb). If you extract the .csv, the size will increase to more than 1gb. Working with compressed files can save a lot of time and space when pushing it to GitHub, as I said before.

vroom

If you don’t want to extract the .csv file but read it in the compressed format (e.g. .zip, .gz etc), there are different ways to do so. I’m going to use the vroom package to do the job (faster) - other options would be readr package or fread() function in data.table package.

For reading the .zip file we simply change from read.csv(base R) to vroom.

#devtools::install_dev("vroom")
library(vroom)

biotime_df <- vroom(here("posts", "rstats", 
                         "biotime.zip"), delim = ",")

biotime_metadata <- vroom(here("posts", "rstats", 
                               "biotime_meta.csv"), delim = ",")

Cleaning and exploring BioTime with dplyr

dplyr is a very useful package to clean and explore data in R. I’m not going to spend much time talking about this package because there are many other great material available out there. Rather, I will just show some important functions, specially for beginner users. You can check a detailed information about this package at: https://dplyr.tidyverse.org

With dplyr we can select and rename columns, filter lines, create new variables and make plenty of cool/complex stuff. Throughout this post, I’m going to use mostly select, filter, group_by and mutate/summarize.

Let’s start by cleaning both data and metadata using some dplyr functions. There are many columns in both files and I’m going to select only those I’m interested in. For that, we can use select:

biotime <- biotime_df %>% 
  select(id = STUDY_ID, year = YEAR, plot = PLOT, 
         abundance = sum.allrawdata.ABUNDANCE, 
         biomass = sum.allrawdata.BIOMASS, sp = GENUS_SPECIES)


metadata <- biotime_metadata %>% 
  select(STUDY_ID:ORGANISMS, CENT_LAT, CENT_LONG) %>% 
  rename(id = STUDY_ID)

If you are not used to dplyr, this %>% is called pipe, and we can use it to nest our functions. We first select the object we want, then use the %>% to nest more functions, separating each one by a new pipe, like I did to create the metadata object.

Now, in biotime object, we have species’ name, abundance, plot, year and site.

head(biotime)
# A tibble: 6 × 6
     id  year plot  abundance biomass sp             
  <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>          
1    10  1984 12            1       0 Acer rubrum    
2    10  1984 12            3       0 Acer saccharum 
3    10  1984 12            1       0 Acer spicatum  
4    10  1984 12           12       0 Corylus cornuta
5    10  1984 12            1       0 Populus pinnata
6    10  1984 12           20       0 Acer rubrum    

Whereas in metadata we have realm, taxa, habitat, whether the area is protected or not etc.

head(biotime_metadata)
# A tibble: 6 × 38
  STUDY_ID REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS TITLE
     <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>     <chr>
1       10 Terre… Temper… Woodla… FALSE          Temperat… Terr… woody pl… Wind…
2       18 Terre… Temper… Sagebr… FALSE          Deserts … Terr… sagebrus… Mapp…
3       33 Marine Temper… Seawee… FALSE          Temperat… Mari… phytopla… Long…
4       39 Terre… Temper… Decidu… FALSE          Temperat… Birds birds     Bird…
5       41 Terre… Temper… Woodla… FALSE          Temperat… Birds birds     Time…
6       45 Marine Tropic… Reef    FALSE          Tropical… Fish  tropical… MCR …
# ℹ 29 more variables: AB_BIO <chr>, HAS_PLOT <chr>, DATA_POINTS <dbl>,
#   START_YEAR <dbl>, END_YEAR <dbl>, CENT_LAT <dbl>, CENT_LONG <dbl>,
#   NUMBER_OF_SPECIES <dbl>, NUMBER_OF_SAMPLES <dbl>, NUMBER_LAT_LONG <dbl>,
#   TOTAL <dbl>, GRAIN_SIZE_TEXT <chr>, GRAIN_SQ_KM <dbl>, AREA_SQ_KM <dbl>,
#   CONTACT_1 <chr>, CONTACT_2 <chr>, CONT_1_MAIL <chr>, CONT_2_MAIL <chr>,
#   LICENSE <chr>, WEB_LINK <chr>, DATA_SOURCE <chr>, METHODS <chr>,
#   SUMMARY_METHODS <chr>, LINK_ID <dbl>, COMMENTS <chr>, …
  • Note that, in biotime object, I took the opportunity and renamed the columns within select. You can also use rename if you want (as I did in metadata object). Also, if you have to rename/fix several column names, you can use the janitor package.

We use select for columns and filter for rows. Let´s say we are interested only in Terrestrial plants from the metadata. We can go for:

metadata %>% 
filter(TAXA == "Terrestrial plants") %>% 
head()
# A tibble: 6 × 10
     id REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS CENT_LAT
  <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>        <dbl>
1    10 Terre… Temper… Woodla… FALSE          Temperat… Terr… woody pl…    47.4 
2    18 Terre… Temper… Sagebr… FALSE          Deserts … Terr… sagebrus…    44.3 
3    60 Terre… Tropic… Tropic… FALSE          Tropical… Terr… tropical…     9.15
4   201 Terre… Tropic… Tropic… FALSE          Tropical… Terr… Tropical…     2.98
5   202 Terre… Tropic… Tropic… FALSE          Tropical… Terr… Tropical…    11.6 
6   214 Terre… Temper… Conife… FALSE          Temperat… Terr… Trees        45.3 
# ℹ 1 more variable: CENT_LONG <dbl>

Of course we can add more arguments to the filter, for example, only terrestrial plants in tropical climate and so on.

metadata %>% 
filter(TAXA == "Terrestrial plants",
       CLIMATE == "Tropical") %>% 
head()
# A tibble: 6 × 10
     id REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS CENT_LAT
  <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>        <dbl>
1    60 Terre… Tropic… Tropic… FALSE          Tropical… Terr… tropical…     9.15
2   201 Terre… Tropic… Tropic… FALSE          Tropical… Terr… Tropical…     2.98
3   202 Terre… Tropic… Tropic… FALSE          Tropical… Terr… Tropical…    11.6 
4   241 Terre… Tropic… Tropic… FALSE          Tropical… Terr… Woody pl…     9.36
5   302 Terre… Tropic… Semi d… TRUE           Tropical… Terr… Trees       -22.4 
6   322 Terre… Tropic… Semi d… TRUE           Tropical… Terr… Trees       -22.8 
# ℹ 1 more variable: CENT_LONG <dbl>

We can also do the opposite (filter rows out), adding ! to the column’s name:

metadata %>% 
filter(!TAXA == "Terrestrial plants",
       !CLIMATE == "Tropical") %>% 
head()
# A tibble: 6 × 10
     id REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS CENT_LAT
  <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>        <dbl>
1    33 Marine Temper… Seawee… FALSE          Temperat… Mari… phytopla…     50.2
2    39 Terre… Temper… Decidu… FALSE          Temperat… Birds birds         43.9
3    41 Terre… Temper… Woodla… FALSE          Temperat… Birds birds         39.5
4    46 Terre… Temper… Woodla… FALSE          Temperat… Birds breeding…     51.7
5    47 Terre… Temper… Ponds   FALSE          Temperat… Birds ducks         50.8
6    51 Terre… Temper… Woodla… FALSE          Boreal f… Birds tetraoni…     61.9
# ℹ 1 more variable: CENT_LONG <dbl>

We can filter values higher or lower than:

metadata %>% 
filter(id > 100) %>% 
head()
# A tibble: 6 × 10
     id REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS CENT_LAT
  <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>        <dbl>
1   108 Marine Global  Oceani… FALSE          Multiple… Birds birds       -27.2 
2   110 Marine Temper… Estuar… FALSE          Temperat… Bent… Benthos      55.5 
3   112 Marine Temper… Pelagi… FALSE          Multiple… Fish  pelagic …    25.0 
4   113 Marine Global  Oceani… FALSE          Multiple… Mari… marine i…     2.70
5   117 Marine Temper… Tidal … FALSE          Temperat… Mari… Starfish    -39.4 
6   119 Marine Temper… Pelagi… FALSE          Temperat… Fish  Fish         44.0 
# ℹ 1 more variable: CENT_LONG <dbl>

Sometimes we need to filter several specific rows. For that, we use %in%.

metadata %>% 
filter(TAXA %in% c("Terrestrial plants",
                   "Birds",
                   "Fish")) %>% 
head()
# A tibble: 6 × 10
     id REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS CENT_LAT
  <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>        <dbl>
1    10 Terre… Temper… Woodla… FALSE          Temperat… Terr… woody pl…     47.4
2    18 Terre… Temper… Sagebr… FALSE          Deserts … Terr… sagebrus…     44.3
3    39 Terre… Temper… Decidu… FALSE          Temperat… Birds birds         43.9
4    41 Terre… Temper… Woodla… FALSE          Temperat… Birds birds         39.5
5    45 Marine Tropic… Reef    FALSE          Tropical… Fish  tropical…    -17.5
6    46 Terre… Temper… Woodla… FALSE          Temperat… Birds breeding…     51.7
# ℹ 1 more variable: CENT_LONG <dbl>

Besides selecting columns and filtering rows we can also create new columns with mutate. Let’s filter only the study number 10 and create a new column to check how many species and years they have sampled in this study. We can use mutate and n_distinct() to achieve that.

biotime %>% 
  filter(id == 10) %>%
  mutate(total_sp = n_distinct(sp),
         total_y = n_distinct(year))
# A tibble: 1,406 × 8
      id  year plot  abundance biomass sp                     total_sp total_y
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>                     <int>   <int>
 1    10  1984 12            1       0 Acer rubrum                  25       3
 2    10  1984 12            3       0 Acer saccharum               25       3
 3    10  1984 12            1       0 Acer spicatum                25       3
 4    10  1984 12           12       0 Corylus cornuta              25       3
 5    10  1984 12            1       0 Populus pinnata              25       3
 6    10  1984 12           20       0 Acer rubrum                  25       3
 7    10  1984 12           14       0 Acer saccharum               25       3
 8    10  1984 12            1       0 Toxicodendron radicans       25       3
 9    10  1984 12            5       0 Acer spicatum                25       3
10    10  1984 12            7       0 Corylus cornuta              25       3
# ℹ 1,396 more rows

In this study we have 25 species and three different years.

Now let’s say we want to know how many species we have per year. To get that, we can use the function group_by.

biotime %>% 
  filter(id == 10) %>%
  group_by(year) %>% 
  mutate(total_sp = n_distinct(sp))
# A tibble: 1,406 × 7
# Groups:   year [3]
      id  year plot  abundance biomass sp                     total_sp
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>                     <int>
 1    10  1984 12            1       0 Acer rubrum                  22
 2    10  1984 12            3       0 Acer saccharum               22
 3    10  1984 12            1       0 Acer spicatum                22
 4    10  1984 12           12       0 Corylus cornuta              22
 5    10  1984 12            1       0 Populus pinnata              22
 6    10  1984 12           20       0 Acer rubrum                  22
 7    10  1984 12           14       0 Acer saccharum               22
 8    10  1984 12            1       0 Toxicodendron radicans       22
 9    10  1984 12            5       0 Acer spicatum                22
10    10  1984 12            7       0 Corylus cornuta              22
# ℹ 1,396 more rows

We can see that we have 22 species in 1984, 20 species in 1992, we can’t see the third year though. That’s because mutate creates a new column and repeats the values for each row we have. To get a summarized overview, we can use summarise instead.

biotime %>% 
  filter(id == 10) %>%
  group_by(year) %>% 
  summarise(total_sp = n_distinct(sp))
# A tibble: 3 × 2
   year total_sp
  <dbl>    <int>
1  1984       22
2  1992       20
3  1996       20

Now we can see the number of species per year, but the other columns were deleted. Therefore, both mutate and summaries give us the same result but in different ways, you have to choose which one fits better your purpose when dealing with the data.

We can also combine summarise and mutate to get, for example, the frequency of species

biotime %>%
  filter(id == 10) %>%
  group_by(sp) %>% 
  summarise (total = n()) %>%
  mutate(freq = total / sum(total))
# A tibble: 25 × 3
   sp                  total     freq
   <chr>               <int>    <dbl>
 1 Acer rubrum           185 0.132   
 2 Acer saccharum        290 0.206   
 3 Acer spicatum         146 0.104   
 4 Amelanchier gravida     1 0.000711
 5 Amelanchier laevis     30 0.0213  
 6 Betula papyrifera      10 0.00711 
 7 Cornus racemosa         6 0.00427 
 8 Cornus rugosa          35 0.0249  
 9 Corylus cornuta       244 0.174   
10 Diervilla lonicera     14 0.00996 
# ℹ 15 more rows

Note that I’ve grouped by “species”. We could also check the frequency of each species per year, adding the variable “year” in group_by().

biotime %>% 
  filter(id == 10) %>%
  group_by(sp,year) %>% 
  summarise (total = n()) %>%
  mutate(freq = total / sum(total))
# A tibble: 62 × 4
# Groups:   sp [25]
   sp                   year total  freq
   <chr>               <dbl> <int> <dbl>
 1 Acer rubrum          1984    77 0.416
 2 Acer rubrum          1992    68 0.368
 3 Acer rubrum          1996    40 0.216
 4 Acer saccharum       1984   105 0.362
 5 Acer saccharum       1992   109 0.376
 6 Acer saccharum       1996    76 0.262
 7 Acer spicatum        1984    62 0.425
 8 Acer spicatum        1992    55 0.377
 9 Acer spicatum        1996    29 0.199
10 Amelanchier gravida  1984     1 1    
# ℹ 52 more rows

tidyr - the art of pivoting data

As ecologists, we have to deal with data frames either in long or wide formats all the time. Knowing how to transform the data into both formats, in R, is very handy and safe, because we don’t need to create several .csv files, maintaining the original data intact, decreasing the chances of messing up it.

The most important functions to transform data frames in tidy are pivot_longer() and pivot_wider(). These functions have been improved constantly. Out there, you will find reshape2 package, melt, spread and gather. Those are the old school functions to transform data frames. I’m slowly moving from spread/gather into pivot_wider()/pivot_longer(). Then, let’s try to transform our data into both wide and long formats using the new functions.

Let’s use the same example (study id 10) to transform the data into the wide format. We use pivot_wider(), specifying that the species names present in “sp” column will be the columns of the new object (names_from = "sp") and the values present in “abundance” will be the row values (values_from = "abundance").

biotime %>% 
  filter(id == 10) %>%
  pivot_wider(names_from = "sp", values_from = "abundance")
# A tibble: 47 × 29
      id  year plot  biomass `Acer rubrum` `Acer saccharum` `Acer spicatum`
   <dbl> <dbl> <chr>   <dbl> <list>        <list>           <list>         
 1    10  1984 12          0 <dbl [4]>     <dbl [3]>        <dbl [2]>      
 2    10  1984 1           0 <dbl [4]>     <dbl [4]>        <dbl [4]>      
 3    10  1984 2           0 <dbl [4]>     <dbl [8]>        <dbl [8]>      
 4    10  1984 34          0 <dbl [4]>     <dbl [6]>        <dbl [8]>      
 5    10  1984 35          0 <dbl [4]>     <dbl [10]>       <dbl [5]>      
 6    10  1984 36          0 <dbl [4]>     <dbl [8]>        <NULL>         
 7    10  1984 37          0 <dbl [3]>     <dbl [10]>       <NULL>         
 8    10  1984 38          0 <dbl [4]>     <dbl [6]>        <NULL>         
 9    10  1984 39          0 <dbl [3]>     <dbl [4]>        <NULL>         
10    10  1984 40          0 <dbl [2]>     <dbl [2]>        <dbl [6]>      
# ℹ 37 more rows
# ℹ 22 more variables: `Corylus cornuta` <list>, `Populus pinnata` <list>,
#   `Toxicodendron radicans` <list>, `Diervilla lonicera` <list>,
#   `Populus sp` <list>, `Cornus rugosa` <list>, `Prunus virginiana` <list>,
#   `Rubus flagellaris` <list>, `Viburnum rafinesquianum` <list>,
#   `Quercus rubra` <list>, `Vaccinium angustifolium` <list>,
#   `Pinus strobus` <list>, `Ostrya virginiana` <list>, …

Oops.. something went wrong. It transformed our data frame but in an odd way. The warning says that our data is not uniquely identified, meaning that the function can’t recognize each row, separately, in our data frame. To tackle this issue, we can simply bring the row names to a column, identifying each row, with the function rownames_to_column(). Let´s try again.

biotime %>% 
  filter(id == 10) %>%
  rownames_to_column() %>% 
  pivot_wider(names_from = "sp", values_from = "abundance")
# A tibble: 1,406 × 30
   rowname    id  year plot  biomass `Acer rubrum` `Acer saccharum`
   <chr>   <dbl> <dbl> <chr>   <dbl>         <dbl>            <dbl>
 1 1          10  1984 12          0             1               NA
 2 2          10  1984 12          0            NA                3
 3 3          10  1984 12          0            NA               NA
 4 4          10  1984 12          0            NA               NA
 5 5          10  1984 12          0            NA               NA
 6 6          10  1984 12          0            20               NA
 7 7          10  1984 12          0            NA               14
 8 8          10  1984 12          0            NA               NA
 9 9          10  1984 12          0            NA               NA
10 10         10  1984 12          0            NA               NA
# ℹ 1,396 more rows
# ℹ 23 more variables: `Acer spicatum` <dbl>, `Corylus cornuta` <dbl>,
#   `Populus pinnata` <dbl>, `Toxicodendron radicans` <dbl>,
#   `Diervilla lonicera` <dbl>, `Populus sp` <dbl>, `Cornus rugosa` <dbl>,
#   `Prunus virginiana` <dbl>, `Rubus flagellaris` <dbl>,
#   `Viburnum rafinesquianum` <dbl>, `Quercus rubra` <dbl>,
#   `Vaccinium angustifolium` <dbl>, `Pinus strobus` <dbl>, …

Ok. It seems we are getting closer. Now we have the study 10 data frame in the wide format but when the species is absent we have NA’s. To tackle this, we add the argument values_fill to our pivot_wider() function.

biotime %>% 
  filter(id == 10) %>%
  rownames_to_column() %>% 
  pivot_wider(names_from = "sp", values_from = "abundance", 
              values_fill = list(abundance=0))
# A tibble: 1,406 × 30
   rowname    id  year plot  biomass `Acer rubrum` `Acer saccharum`
   <chr>   <dbl> <dbl> <chr>   <dbl>         <dbl>            <dbl>
 1 1          10  1984 12          0             1                0
 2 2          10  1984 12          0             0                3
 3 3          10  1984 12          0             0                0
 4 4          10  1984 12          0             0                0
 5 5          10  1984 12          0             0                0
 6 6          10  1984 12          0            20                0
 7 7          10  1984 12          0             0               14
 8 8          10  1984 12          0             0                0
 9 9          10  1984 12          0             0                0
10 10         10  1984 12          0             0                0
# ℹ 1,396 more rows
# ℹ 23 more variables: `Acer spicatum` <dbl>, `Corylus cornuta` <dbl>,
#   `Populus pinnata` <dbl>, `Toxicodendron radicans` <dbl>,
#   `Diervilla lonicera` <dbl>, `Populus sp` <dbl>, `Cornus rugosa` <dbl>,
#   `Prunus virginiana` <dbl>, `Rubus flagellaris` <dbl>,
#   `Viburnum rafinesquianum` <dbl>, `Quercus rubra` <dbl>,
#   `Vaccinium angustifolium` <dbl>, `Pinus strobus` <dbl>, …

Now we have the data transformed into the wide format, with 0’s instead of NA’s. Note that, sometimes, the row names are names instead of numbers (i.e. sometimes we use species names as row names). If you face that, you can create a new column, labeling the rows, using mutate() and row_number() instead of rownames_to_column(). For example:

biotime_wider <- biotime %>% 
  filter(id == 10) %>%
  #rownames_to_column() %>% 
  mutate(id_rows = row_number()) %>% 
  pivot_wider(names_from = "sp", values_from = "abundance", values_fill = list(abundance=0))

Now let´s transform our data into the other way round, from wide into long format. For that, we can use the function pivot_longer()

biotime_wider %>% 
  pivot_longer(names_to = "sp", values_to = "abundance")
Error in `pivot_longer()`:
! `cols` must select at least one column.

Error. When pivoting longer, we have to pay attention to the other columns we may have. In the biotime_wider object we have id, year, plot, biomass, id_rows and species names. We want to transform only species names and values (abundance) though. In this case, we have to omit the other columns we don’t want to transform.

biotime_wider %>% 
  pivot_longer(-c(id:id_rows),names_to = "sp", values_to = "abundance")
# A tibble: 35,150 × 7
      id  year plot  biomass id_rows sp                     abundance
   <dbl> <dbl> <chr>   <dbl>   <int> <chr>                      <dbl>
 1    10  1984 12          0       1 Acer rubrum                    1
 2    10  1984 12          0       1 Acer saccharum                 0
 3    10  1984 12          0       1 Acer spicatum                  0
 4    10  1984 12          0       1 Corylus cornuta                0
 5    10  1984 12          0       1 Populus pinnata                0
 6    10  1984 12          0       1 Toxicodendron radicans         0
 7    10  1984 12          0       1 Diervilla lonicera             0
 8    10  1984 12          0       1 Populus sp                     0
 9    10  1984 12          0       1 Cornus rugosa                  0
10    10  1984 12          0       1 Prunus virginiana              0
# ℹ 35,140 more rows

With pivot_longer() and pivot_wider() you can make a lot of more complex transformations, I’m not going to cover those here though, please check them out at: https://tidyr.tidyverse.org/reference/pivot_wider.html

Merging data and working within lists with purrr

Now let´s try to use this knowledge to create more complex objects, using new features, combining functions etc.

When analyzing ecological data, most of the time, both data and metadata are useful, like in BioTime example.

Let’s check both data frames

BioTime data

head(biotime)
# A tibble: 6 × 6
     id  year plot  abundance biomass sp             
  <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>          
1    10  1984 12            1       0 Acer rubrum    
2    10  1984 12            3       0 Acer saccharum 
3    10  1984 12            1       0 Acer spicatum  
4    10  1984 12           12       0 Corylus cornuta
5    10  1984 12            1       0 Populus pinnata
6    10  1984 12           20       0 Acer rubrum    

Metadata

head(metadata)
# A tibble: 6 × 10
     id REALM  CLIMATE HABITAT PROTECTED_AREA BIOME_MAP TAXA  ORGANISMS CENT_LAT
  <dbl> <chr>  <chr>   <chr>   <lgl>          <chr>     <chr> <chr>        <dbl>
1    10 Terre… Temper… Woodla… FALSE          Temperat… Terr… woody pl…     47.4
2    18 Terre… Temper… Sagebr… FALSE          Deserts … Terr… sagebrus…     44.3
3    33 Marine Temper… Seawee… FALSE          Temperat… Mari… phytopla…     50.2
4    39 Terre… Temper… Decidu… FALSE          Temperat… Birds birds         43.9
5    41 Terre… Temper… Woodla… FALSE          Temperat… Birds birds         39.5
6    45 Marine Tropic… Reef    FALSE          Tropical… Fish  tropical…    -17.5
# ℹ 1 more variable: CENT_LONG <dbl>

As you can see, in biotime we have id, year, plot, abundance, biomass and species, whereas in metadata we have id, realm, habitat etc. So, let’s say we are interested in examining terrestrial plants species from tropical forests, but as species are in one data frame and biomes in the other, we first have to merge both data frames. For merging data frames into one we can use either bind_cols() / bind_rows() functions, for binding columns and rows, respectively, or the join family available in dplyr (https://dplyr.tidyverse.org/reference/join.html). Here, we can use the letf_join() option to merge both data and metadata.

merge_df <- left_join(biotime, metadata, by = "id")
head(merge_df)
# A tibble: 6 × 15
     id  year plot  abundance biomass sp    REALM CLIMATE HABITAT PROTECTED_AREA
  <dbl> <dbl> <chr>     <dbl>   <dbl> <chr> <chr> <chr>   <chr>   <lgl>         
1    10  1984 12            1       0 Acer… Terr… Temper… Woodla… FALSE         
2    10  1984 12            3       0 Acer… Terr… Temper… Woodla… FALSE         
3    10  1984 12            1       0 Acer… Terr… Temper… Woodla… FALSE         
4    10  1984 12           12       0 Cory… Terr… Temper… Woodla… FALSE         
5    10  1984 12            1       0 Popu… Terr… Temper… Woodla… FALSE         
6    10  1984 12           20       0 Acer… Terr… Temper… Woodla… FALSE         
# ℹ 5 more variables: BIOME_MAP <chr>, TAXA <chr>, ORGANISMS <chr>,
#   CENT_LAT <dbl>, CENT_LONG <dbl>

left_join() merged the data frames based on the study id. This is a safe way to merge data frames because the function will match exactly the same rows present in each data frame based on the argument you give in "by = ..."; in this case I chose by = "id", but depending on your data you can add more variables.

Now we have both data and metadata merged. Then, we can finally check terrestrial plant species from tropical forests.

merge_df %>% 
  filter(TAXA == "Terrestrial plants",
       CLIMATE == "Tropical") %>% 
head()
# A tibble: 6 × 15
     id  year plot  abundance biomass sp    REALM CLIMATE HABITAT PROTECTED_AREA
  <dbl> <dbl> <chr>     <dbl>   <dbl> <chr> <chr> <chr>   <chr>   <lgl>         
1   201  1987 <NA>       2454       0 Aidi… Terr… Tropic… Tropic… FALSE         
2   201  1987 <NA>         97       0 Fagr… Terr… Tropic… Tropic… FALSE         
3   201  1987 <NA>        112       0 Fahr… Terr… Tropic… Tropic… FALSE         
4   201  1987 <NA>         45       0 Ficu… Terr… Tropic… Tropic… FALSE         
5   201  1987 <NA>       2913       0 Alan… Terr… Tropic… Tropic… FALSE         
6   201  1987 <NA>          3       0 Ficu… Terr… Tropic… Tropic… FALSE         
# ℹ 5 more variables: BIOME_MAP <chr>, TAXA <chr>, ORGANISMS <chr>,
#   CENT_LAT <dbl>, CENT_LONG <dbl>

or Birds etc

merge_df %>% 
filter(TAXA == "Birds",
       CLIMATE == "Tropical") %>% 
  head()
# A tibble: 6 × 15
     id  year plot  abundance biomass sp    REALM CLIMATE HABITAT PROTECTED_AREA
  <dbl> <dbl> <chr>     <dbl>   <dbl> <chr> <chr> <chr>   <chr>   <lgl>         
1    58  1994 Plot…         2       0 Vire… Terr… Tropic… Long t… FALSE         
2    58  1991 Plot9         1       0 Paru… Terr… Tropic… Long t… FALSE         
3    58  1994 Plot…         1       0 Neso… Terr… Tropic… Long t… FALSE         
4    58  1994 Plot6         7       0 Coer… Terr… Tropic… Long t… FALSE         
5    58  1991 Plot6         9       0 Neso… Terr… Tropic… Long t… FALSE         
6    58  1994 Plot6         1       0 Marg… Terr… Tropic… Long t… FALSE         
# ℹ 5 more variables: BIOME_MAP <chr>, TAXA <chr>, ORGANISMS <chr>,
#   CENT_LAT <dbl>, CENT_LONG <dbl>

Although those functions are useful to explore big data sets, most of the time, specially when working with multiple variables, filtering each one and examining them separately can be time consuming. For example, let’s say we want to transform our current data frame (merge_df) from long into wide format, as we did before. The first time we transformed our data, we had a single study (id=10) and taxa. Now we have multiple studies and taxa. If we try pivot_wider(), plant and bird species names, for example, will be assigned as columns, making things difficult to disentangle afterwards. An useful way to avoid this is to split our data frame into lists based on each taxa we have. Then, we can apply the function we want for all elements of that list at once. We can do that using the package purrr, which has plenty of functions that make ecologists` life a bit easier in that sense. Let’s try to understand how this package works.

First, let’s create different lists based on different taxa with the split() function:

split_taxa <- merge_df %>% 
             split(.$TAXA)

head(split_taxa)
$All
# A tibble: 326,527 × 15
      id  year plot  abundance biomass sp              REALM  CLIMATE HABITAT   
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>           <chr>  <chr>   <chr>     
 1   166  1979 <NA>          5       0 Puffinus gravis Marine Global  Oceanic w…
 2   166  1979 <NA>          2       0 Puffinus gravis Marine Global  Oceanic w…
 3   166  1979 <NA>         15       0 Puffinus gravis Marine Global  Oceanic w…
 4   166  1979 <NA>         23       0 Puffinus gravis Marine Global  Oceanic w…
 5   166  1979 <NA>         20       0 Puffinus gravis Marine Global  Oceanic w…
 6   166  1979 <NA>         20       0 Puffinus gravis Marine Global  Oceanic w…
 7   166  1979 <NA>         60       0 Puffinus gravis Marine Global  Oceanic w…
 8   166  1979 <NA>         16       0 Puffinus gravis Marine Global  Oceanic w…
 9   166  1979 <NA>          1       0 Puffinus gravis Marine Global  Oceanic w…
10   166  1979 <NA>        150       0 Puffinus gravis Marine Global  Oceanic w…
# ℹ 326,517 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
#   ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>

$Amphibians
# A tibble: 2,446 × 15
      id  year plot  abundance biomass sp                  REALM CLIMATE HABITAT
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>               <chr> <chr>   <chr>  
 1   305  1980 <NA>         12       0 Desmognathus quadr… Terr… Temper… Appala…
 2   305  1981 <NA>          4       0 Desmognathus aeneus Terr… Temper… Appala…
 3   305  1981 <NA>         20       0 Desmognathus monti… Terr… Temper… Appala…
 4   305  1976 <NA>          8       0 Desmognathus aeneus Terr… Temper… Appala…
 5   305  1976 <NA>         27       0 Desmognathus monti… Terr… Temper… Appala…
 6   305  1976 <NA>         15       0 Desmognathus ochro… Terr… Temper… Appala…
 7   305  1976 <NA>          7       0 Desmognathus quadr… Terr… Temper… Appala…
 8   305  1977 <NA>          7       0 Desmognathus aeneus Terr… Temper… Appala…
 9   305  1977 <NA>         22       0 Desmognathus monti… Terr… Temper… Appala…
10   305  1977 <NA>         18       0 Desmognathus ochro… Terr… Temper… Appala…
# ℹ 2,436 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
#   ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>

$Benthos
# A tibble: 1,711,785 × 15
      id  year plot  abundance biomass sp                  REALM CLIMATE HABITAT
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>               <chr> <chr>   <chr>  
 1    76  1992 <NA>          2       0 Aphroditoidea       Mari… Temper… Estuar…
 2    76  1992 <NA>          5       0 Ascidiella scabra   Mari… Temper… Estuar…
 3    76  1992 <NA>          4       0 Asterias rubens     Mari… Temper… Estuar…
 4    76  1992 <NA>          9       0 Calliostoma zyziph… Mari… Temper… Estuar…
 5    76  1992 <NA>         10       0 Crepidula fornicata Mari… Temper… Estuar…
 6    76  1992 <NA>          1       0 Gibbula tumida      Mari… Temper… Estuar…
 7    76  1992 <NA>          1       0 Hippolyte varians   Mari… Temper… Estuar…
 8    76  1992 <NA>          1       0 Hyas coarctatus     Mari… Temper… Estuar…
 9    76  1992 <NA>          1       0 Liocarcinus puber   Mari… Temper… Estuar…
10    76  1992 <NA>          1       0 Liparis sp          Mari… Temper… Estuar…
# ℹ 1,711,775 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
#   ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>

$Birds
# A tibble: 1,487,942 × 15
      id  year plot  abundance biomass sp                  REALM CLIMATE HABITAT
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>               <chr> <chr>   <chr>  
 1    39  1976 <NA>        7.5      NA Catharus guttatus   Terr… Temper… Decidu…
 2    39  1976 <NA>        8        NA Catharus ustulatus  Terr… Temper… Decidu…
 3    39  1976 <NA>       12        NA Dendroica caerules… Terr… Temper… Decidu…
 4    39  1977 <NA>        2.5      NA Catharus fuscescens Terr… Temper… Decidu…
 5    39  1977 <NA>        5        NA Catharus guttatus   Terr… Temper… Decidu…
 6    39  1977 <NA>        7        NA Catharus ustulatus  Terr… Temper… Decidu…
 7    39  1977 <NA>       11        NA Dendroica caerules… Terr… Temper… Decidu…
 8    39  1977 <NA>        2        NA Dendroica fusca     Terr… Temper… Decidu…
 9    39  1977 <NA>       28        NA Empidonax minimus   Terr… Temper… Decidu…
10    39  1977 <NA>        5        NA Hylocichla musteli… Terr… Temper… Decidu…
# ℹ 1,487,932 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
#   ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>

$Fish
# A tibble: 2,294,445 × 15
      id  year plot  abundance biomass sp                  REALM CLIMATE HABITAT
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>               <chr> <chr>   <chr>  
 1    45  2006 <NA>          1   4.80  Canthigaster solan… Mari… Tropic… Reef   
 2    45  2006 <NA>          2   0.630 Ctenogobiops feroc… Mari… Tropic… Reef   
 3    45  2006 <NA>          1  26.8   Echidna nebulosa    Mari… Tropic… Reef   
 4    45  2006 <NA>          1   1.87  Epinephelus merra   Mari… Tropic… Reef   
 5    45  2006 <NA>          1   8.31  Paracirrhites arca… Mari… Tropic… Reef   
 6    45  2006 <NA>          1   0.200 Pseudocheilinus he… Mari… Tropic… Reef   
 7    45  2006 <NA>         12  70.7   Stegastes nigricans Mari… Tropic… Reef   
 8    45  2006 <NA>          2  38.7   Acanthurus nigrofu… Mari… Tropic… Reef   
 9    45  2006 <NA>          1  46.2   Acanthurus trioste… Mari… Tropic… Reef   
10    45  2006 <NA>          1  18.3   Anampses twistii    Mari… Tropic… Reef   
# ℹ 2,294,435 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
#   ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>

$`Freshwater invertebrates`
# A tibble: 93,966 × 15
      id  year plot  abundance biomass sp                  REALM CLIMATE HABITAT
   <dbl> <dbl> <chr>     <dbl>   <dbl> <chr>               <chr> <chr>   <chr>  
 1   237  1984 <NA>       2000       0 Daphnia pulicaria   Fres… Temper… Temper…
 2   237  1983 <NA>      28000       0 Notholca acuminata  Fres… Temper… Temper…
 3   237  1984 <NA>       8000       0 Keratella quadrata  Fres… Temper… Temper…
 4   237  1984 <NA>      19000       0 Notholca            Fres… Temper… Temper…
 5   237  1984 <NA>      11000       0 Synchaeta           Fres… Temper… Temper…
 6   237  1984 <NA>     768000       0 Conochiloides nata… Fres… Temper… Temper…
 7   237  1984 <NA>     156000       0 Cyclopoida copepod… Fres… Temper… Temper…
 8   237  1984 <NA>      16000       0 Daphnia mendotae    Fres… Temper… Temper…
 9   237  1983 <NA>     194000       0 Copepod nauplii     Fres… Temper… Temper…
10   237  1983 <NA>       9000       0 Polyarthra dolicho… Fres… Temper… Temper…
# ℹ 93,956 more rows
# ℹ 6 more variables: PROTECTED_AREA <lgl>, BIOME_MAP <chr>, TAXA <chr>,
#   ORGANISMS <chr>, CENT_LAT <dbl>, CENT_LONG <dbl>

To work with lists in purrr the map function will be our loyal servant, and they are many: i.e. map, map2, map_dfr, map_dfc etc.

  • map is used when we want to apply a function to each element of one list (it works like lapply)

  • map2 is used when we want to apply a function to each element of two lists (it works like mapply)

  • map_dfr and map_dfc are used when we want to combine lists into data frames by rows or columns, respectively.

I will use here map and map2 (specially in data viz and statistical modelling post); if you want to know more, take a look at:

https://dcl-prog.stanford.edu/purrr-basics.html

and

https://jennybc.github.io/purrr-tutorial/

Let’s start with map(), filtering studies that have been sampled over more than 10 years.

filter_year <- split_taxa %>% map(~ .x %>%
                                  group_by(id) %>% 
                                  mutate(n_year=n_distinct(year)) %>% # here we create a column showing how many years we have for each taxa, grouped by study id 
                                  filter(n_year > 10)) # and filter only those studies with more than 10 years

In case you are not used to tidyverse, map() works like lapply() in base R. Tilda(~) represents function(x) and (.) would be (x). Just to show it, in a very non-sense way, let’s sum up the overall number of individuals of each taxa. To make it simple (and even poorer), let’s select only the abundance column first.

select_abundance <- filter_year %>% 
  map(~ .x %>% ungroup %>% select(abundance))

head(select_abundance)
$All
# A tibble: 313,561 × 1
   abundance
       <dbl>
 1         5
 2         2
 3        15
 4        23
 5        20
 6        20
 7        60
 8        16
 9         1
10       150
# ℹ 313,551 more rows

$Amphibians
# A tibble: 327 × 1
   abundance
       <dbl>
 1        12
 2         4
 3        20
 4         8
 5        27
 6        15
 7         7
 8         7
 9        22
10        18
# ℹ 317 more rows

$Benthos
# A tibble: 1,594,662 × 1
   abundance
       <dbl>
 1        18
 2        36
 3         2
 4      1038
 5       158
 6       130
 7        48
 8        36
 9        30
10         6
# ℹ 1,594,652 more rows

$Birds
# A tibble: 1,288,906 × 1
   abundance
       <dbl>
 1       7.5
 2       8  
 3      12  
 4       2.5
 5       5  
 6       7  
 7      11  
 8       2  
 9      28  
10       5  
# ℹ 1,288,896 more rows

$Fish
# A tibble: 2,026,702 × 1
   abundance
       <dbl>
 1         7
 2         5
 3         2
 4         1
 5         1
 6         6
 7         6
 8        52
 9        16
10         1
# ℹ 2,026,692 more rows

$`Freshwater invertebrates`
# A tibble: 91,528 × 1
   abundance
       <dbl>
 1      2000
 2     28000
 3      8000
 4     19000
 5     11000
 6    768000
 7    156000
 8     16000
 9    194000
10      9000
# ℹ 91,518 more rows

Using the base R approach, to sum up a column over lists, we would use something like this:

lapply(select_abundance, function(x) { colSums (x)} )
$All
abundance 
 10525821 

$Amphibians
abundance 
   115344 

$Benthos
  abundance 
23503327052 

$Birds
abundance 
 14673312 

$Fish
abundance 
 52881387 

$`Freshwater invertebrates`
 abundance 
2898542291 

$`Freshwater plants`
abundance 
       NA 

$Mammals
abundance 
 679713.1 

$`Marine invertebrates`
abundance 
256799039 

$`Marine plants`
  abundance 
25161115487 

$Reptiles
abundance 
    23555 

$`Terrestrial invertebrates`
abundance 
  1022759 

$`Terrestrial plants`
abundance 
  2427471 

With purrr:

select_abundance %>% 
  map(~ .x %>%  
       summarise(row_sums = colSums(.)))

or

select_abundance %>% 
  map(~summarise(., row_sums = colSums(.)))
$All
# A tibble: 1 × 1
  row_sums
     <dbl>
1 10525821

$Amphibians
# A tibble: 1 × 1
  row_sums
     <dbl>
1   115344

$Benthos
# A tibble: 1 × 1
      row_sums
         <dbl>
1 23503327052.

$Birds
# A tibble: 1 × 1
   row_sums
      <dbl>
1 14673312.

$Fish
# A tibble: 1 × 1
   row_sums
      <dbl>
1 52881387.

$`Freshwater invertebrates`
# A tibble: 1 × 1
     row_sums
        <dbl>
1 2898542291.

$`Freshwater plants`
# A tibble: 1 × 1
  row_sums
     <dbl>
1       NA

$Mammals
# A tibble: 1 × 1
  row_sums
     <dbl>
1  679713.

$`Marine invertebrates`
# A tibble: 1 × 1
    row_sums
       <dbl>
1 256799039.

$`Marine plants`
# A tibble: 1 × 1
      row_sums
         <dbl>
1 25161115487.

$Reptiles
# A tibble: 1 × 1
  row_sums
     <dbl>
1    23555

$`Terrestrial invertebrates`
# A tibble: 1 × 1
  row_sums
     <dbl>
1  1022759

$`Terrestrial plants`
# A tibble: 1 × 1
  row_sums
     <dbl>
1 2427471.

We filtered our data by n_years > 10, let’s increase this number selecting only studies with more that 30 years of sampling.

bio_list <- split_taxa %>% map(~ .x %>%
                      group_by(id) %>% 
                      mutate(nyear = n_distinct(year)) %>%
                      filter(nyear > 30) %>% 
                      select(id:sp,-nyear))

Now we have some elements in our list with 0, meaning that there is no study with more than 30 years for that specific taxa. We can use either discard() or keep() to exclude or keep elements in our list, respectively. Let’s exclude those elements without any data.

discard_elements <- bio_list %>% discard(., ~nrow(.) == 0)

Now, instead of 13 taxa we have seven.

If you want to exclude elements in a list by name, we can use list_modify("listElementName" = zap()). Note that, previously, we could use list_modify = "NULL", but this option is now deprecated. For example, let’s exclude “Benthos” from our object

no_benthos <- discard_elements %>% 
  list_modify("Benthos" = zap())

Let’s create a more complex object now. As we have both abundance and biomass column, let’s create a new column representing presence/absence (pres_ab). First, though, we have to deal with the NA’s in our data. We can use replace_na() for that.

presence_abs <- no_benthos %>% 
  map(~ .x %>%
                      mutate(i = row_number()) %>% 
                      group_by(id) %>% 
                      replace_na(list(abundance = 0, biomass = 0)) %>% 
                      mutate(pres_ab = abundance+biomass,
                             pres_ab = case_when(pres_ab > 0 ~ 1,
                                   T ~ 0)))

Here I introduced a new (very useful) function case_when(). In the current example, I created the “pres_ab” column and asked the function to assign 1 (presence) case when the sum of abundance+biomass is higher than 0 and assign 0 when the opposite is true (T ~ 0) . You can use case_when() in many different ways, being very useful when we have to create new columns based on other columns or change values within columns.

As in the long format we don’t have species assigned with 0 either in abundance or biomass, “pres_ab” column will be always 1 (my bad haha). Let’s transform the data into the wide format to get presences and absences indeed. If we were not working with lists, we should transform each taxa separately. We can now combine purrr with pivot_wider() and transform all elements of our list.

wide_list <- presence_abs %>% 
  map(~ .x %>%
        group_by(id,year,sp) %>% 
        select(-plot,-abundance, -biomass) %>%
        summarise(pres_ab = sum(pres_ab)) %>%
        mutate(pres_ab = case_when(pres_ab > 0 ~ 1,
                                   T ~ 0))%>% 
        ungroup %>% 
        arrange(sp) %>% # sp names in alphabetic order
        pivot_wider(names_from = "sp", values_from = "pres_ab", 
                    values_fill = list(pres_ab=0)))

Now we have 0’s and 1’s. As we have species presences absences per year, we can check species gains and losses. For that we transform our data back into the long format (now with 0’s and 1’s) and use case_when() again to create the new column

As we are working with temporal data, we can use case_when() combined with lag() or lead() to analyse either the previous or next row in our data set, respectively. For example, let’s create a new column to check species gains and losses per year.

Now I will transform the lists into the long format again and check species gains and losses using case_when().

long_list <-wide_list %>% 
  map(~.x %>% 
        pivot_longer(-c(id,year),names_to = "sp", values_to = "pres_ab"))

Now we can check how each species is changing over time

gain_loss <- long_list %>% 
  map(~.x %>% 
        group_by(id,sp) %>% 
        arrange(year, .by_group = TRUE)%>%
    mutate(gain_loss = case_when(row_number() == 1 ~ "NA",
                                 lag(pres_ab == 0) & pres_ab == 1  ~ "gain", 
                                 lag(pres_ab == 1) & pres_ab == 0 ~ "loss",
                                 lag(pres_ab == 0) & pres_ab == 0  ~ "persist_ab",
                         T ~ "persist_pres")))

Here I first grouped the data by study id and species, then I ordered the year, separately, by group. With mutate(), I created the column gain_loss, putting NA to the first year of each study, since we cant have gains or losses at the first year. Further, I assigned gain when the species was absent at the previous year (lag(pres_ab==0)) and present in the next one (pres_ab==1) and the opposite to loss. Also, if the species persisted absent or present, but we could also use “NA” if the idea is to check only gains and losses.

Another nice trick, when dealing with lists, is that sometimes we need to use the name of the elements in a list as a column. For example, here the name of the elements in our list is the taxa’s name. We can use imap and mutate to create a new column based on these names. With mutate (in dplyr 1.0.0) we can also select the position of the new column with .before and .after, otherwise the new column will be the last one.

taxa_list <- gain_loss %>% imap(., ~mutate(.x, TAXA = .y, .after = year))

Some other tricks with purrr

Sometimes, we have to deal with nested lists. I will not focus on nested lists here, but just out of curiosity, let’s split our list once more, now based on the different studies we have. So we will have lists based on taxa and nested based on studies.

nest_list <- wide_list %>% 
  map(~split(.,.$id))

Usually, when I have such kind of data, I use nested maps - not sure if this is the most elegant approach though. For example, let’s say we want to check the species richness for each taxa in each study based on years of sampling. I’d do something like this:

double_map <- nest_list %>% map(map, ~mutate(.,richness = rowSums(.[-1:-2]), .after = year))

To work with two lists, we can use map2() . A simple example:

list1<- as.list(c(a = 1, b = 2 ,c = 3))

list2<- as.list(c(a = 1, b = 2 ,c = 3))

map2(list1, list2, sum)
$a
[1] 2

$b
[1] 4

$c
[1] 6

I will use more map2() in the following posts, but this is particular interesting for those working with functional diversity of multiple taxa. You can do the same we did here (split the data by taxa), in both community and trait data, and then use map2() to calculate functional diversity. It will not work here of course, but something like this might work for you:

library(FD)
t1 <- map2(list_trait, list_comm, dbFD) # list with traits, list with communities and the function you want to apply (in this case, dbFD)

and then you can get the CWM, Rao or whatever metric you want from the dbFD() function

list_CWM <- t1 %>% map("CWM")

list_RaoQ <- t1 %>% map("RaoQ") %>% 
  map(~ .x %>% as.data.frame())

Another important trick we can do with purrr is to load multiple data into R. Sometimes we download ecological data and the .csv files come separated into multiple zip files. If you have faced that, and had to extract each file one by one, guess what? You can run them all at once with purrr:

manyfiles <- list.files(here("data"), pattern = "*.csv", full.names = T) %>%
        map_df(~vroom(.)) #here("data"), as I explained at the beginning of this post, is where your files are located in your Rproject

In summary, tidyverse has many functions that can boost our data analysis and how / when to use them will depend on your data and what you want to get from it. I hope to have kicked off some new useful functions and tricks.