R Programming

Preamble

Over the past couple of years, I’ve had the privilege to advance my R skills, as well as acquire useful functions that should aid anyone using R for behavioral science. This list is not exhaustive, but a list of my most used functions, packages, and useful tips!

Pipe %>% Operator

The pipe operator, written as %>% takes the output of one function and passes it into another function as an argument. This allows us to link a sequence of analysis steps.

For a mathematical analogy, f(x) can be rewritten as x %>% f

## compute the logarithm of `x`

x <- 1

log(x)
## [1] 0
## compute the logaritm of `x`

x %>% log()
## [1] 0

Why is this useful though? R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand.

# Initialize `x`
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Compute the logarithm of `x`, return suitably lagged and 
# iterated differences, 
# compute the exponential function and round the result
round(exp(diff(log(x))), 1)
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1
# Compute the same computation as above but with pipe in operator

x %>% log() %>%
    diff() %>%
    exp() %>%
    round(1)
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

In short, here are four reasons why you should be using pipes in R:

  1. You’ll structure the sequence of your data operations from left to right, as apposed to from inside and out;

  2. You’ll avoid nested function calls;

  3. You’ll minimize the need for local variables and function definitions;

  4. You’ll make it easy to add steps anywhere in the sequence of operations.

dplyr package

By far my most used package in R is dplyr. See documentation here

dplyr is part of the tidyverse collection of R packages for data science. At it’s core, there are 5 functions which I use (typically chained with the pipe in operator %>%) for every single analysis:

mutate() adds new variables that are functions of existing variables select() picks variables based on their names. filter() picks cases based on their values. summarise() reduces multiple values down to a single summary. arrange() changes the ordering of the rows. group_by allows for group operations in the “split-apply-combine” concept

I’ll demonstrate below using strictly dplyr functions with the datasets PlantGrowth which are the results of an experiment on Plant Growth with 3 conditions and mtcars which are fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

library(dplyr)

summary(PlantGrowth)
##      weight       group   
##  Min.   :3.590   ctrl:10  
##  1st Qu.:4.550   trt1:10  
##  Median :5.155   trt2:10  
##  Mean   :5.073            
##  3rd Qu.:5.530            
##  Max.   :6.310
# calculate the average weight of the plants by condition

PlantGrowth %>%
  group_by(group) %>%
  summarise(mean_growth = mean(weight))
## # A tibble: 3 x 2
##   group mean_growth
##   <fct>       <dbl>
## 1 ctrl         5.03
## 2 trt1         4.66
## 3 trt2         5.53
# create a new column of weight from lbs to kg (1 lb = 0.453kg)
# filter 6 cylinder cars only
# isolate model name, mpg, and wt in kg
# arange the data from lightest to heaviest

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars %>%
  mutate(wt_kg = (wt*1000)*0.453) %>%
  filter(cyl == 6) %>%
  select(mpg, wt_kg) %>%
  arrange(wt_kg)
##                 mpg    wt_kg
## Mazda RX4      21.0 1186.860
## Ferrari Dino   19.7 1254.810
## Mazda RX4 Wag  21.0 1302.375
## Hornet 4 Drive 21.4 1456.395
## Merc 280       19.2 1558.320
## Merc 280C      17.8 1558.320
## Valiant        18.1 1567.380

As you can see, with only a few lines of code, we can chain various cleaning commands together and produce a desirable output. I highly recommend the dplyr package for all data cleaning purposes. Here’s a very nice cheat sheet that you should bookmark.

tidy data

I believe most of the time spent doing data analysis is actually spent doing data cleaning. While data cleaning is typically the first step, it typically must be repeated many times over the course of analysis as new problems come to light or new data is collected. To this end, tidying data is a way to structure datasets to facilitate analysis.

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Every column is a variable.

  2. Every row is an observation.

  3. Every cell is a single value.

While these are the main principles behind tidy data, there’s a lot of nuances and hundreds of data sets that break these rules. Practice is the best lesson here and you’ll find that once you have assembled a tidy data set, it will make conducting statistical analysis and visualizations 100% easier. I’ll provide two examples of non-tidy data followed by a tidy data set.

# FIRST EXAMPLE

head(relig_income)
## # A tibble: 6 x 11
##   religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Agnostic       27        34        60        81        76       137        122
## 2 Atheist        12        27        37        52        35        70         73
## 3 Buddhist       27        21        30        34        33        58         62
## 4 Catholic      418       617       732       670       638      1116        949
## 5 Don’t kn~      15        14        15        11        10        35         21
## 6 Evangeli~     575       869      1064       982       881      1486        949
## # ... with 3 more variables: $100-150k <dbl>, >150k <dbl>,
## #   Don't know/refused <dbl>
# notice the column names, let's fix that

relig_income %>% 
  pivot_longer(-religion, names_to = "income", values_to = "frequency")
## # A tibble: 180 x 3
##    religion income             frequency
##    <chr>    <chr>                  <dbl>
##  1 Agnostic <$10k                     27
##  2 Agnostic $10-20k                   34
##  3 Agnostic $20-30k                   60
##  4 Agnostic $30-40k                   81
##  5 Agnostic $40-50k                   76
##  6 Agnostic $50-75k                  137
##  7 Agnostic $75-100k                 122
##  8 Agnostic $100-150k                109
##  9 Agnostic >150k                     84
## 10 Agnostic Don't know/refused        96
## # ... with 170 more rows

This dataset has three variables: religion, income and frequency. To tidy it, we needed to pivot the non-variable columns into a two-column key-value pair. This action is often described as making a wide dataset longer.

When pivoting variables, we needed to provide the name of the new key-value columns to create. After defining the columns to pivot (every column except for religion), you will need the name of the key column, which is the name of the variable defined by the values of the column headings. In this case, it’s income. The second argument is the name of the value column, frequency.

# SECOND EXAMPLE

head(billboard)
## # A tibble: 6 x 79
##   artist   track    date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>    <chr>    <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac    Baby Do~ 2000-02-26      87    82    72    77    87    94    99    NA
## 2 2Ge+her  The Har~ 2000-09-02      91    87    92    NA    NA    NA    NA    NA
## 3 3 Doors~ Krypton~ 2000-04-08      81    70    68    67    66    57    54    53
## 4 3 Doors~ Loser    2000-10-21      76    76    72    69    67    65    55    59
## 5 504 Boyz Wobble ~ 2000-04-15      57    34    25    17    17    31    36    49
## 6 98^0     Give Me~ 2000-08-19      51    39    34    26    26    19     2     2
## # ... with 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, ...

The above dataset records the date a song first entered the billboard top 100. It has variables for artist, track, date.entered, rank and week. The rank in each week after it enters the top 100 is recorded in 75 columns, wk1 to wk75. This form of storage is not tidy, but it is useful for data entry. It reduces duplication since otherwise each song in each week would need its own row, and song metadata like title and artist would need to be repeated.

billboard %>% 
  pivot_longer(
    wk1:wk76, 
    names_to = "week", 
    values_to = "rank", 
    values_drop_na = TRUE
  ) %>%
  mutate(week = as.integer(gsub("wk", "", week)),
         date = as.Date(as.Date(date.entered) + 7 * (week - 1)),
         date.entered = NULL)
## # A tibble: 5,307 x 5
##    artist  track                    week  rank date      
##    <chr>   <chr>                   <int> <dbl> <date>    
##  1 2 Pac   Baby Don't Cry (Keep...     1    87 2000-02-26
##  2 2 Pac   Baby Don't Cry (Keep...     2    82 2000-03-04
##  3 2 Pac   Baby Don't Cry (Keep...     3    72 2000-03-11
##  4 2 Pac   Baby Don't Cry (Keep...     4    77 2000-03-18
##  5 2 Pac   Baby Don't Cry (Keep...     5    87 2000-03-25
##  6 2 Pac   Baby Don't Cry (Keep...     6    94 2000-04-01
##  7 2 Pac   Baby Don't Cry (Keep...     7    99 2000-04-08
##  8 2Ge+her The Hardest Part Of ...     1    91 2000-09-02
##  9 2Ge+her The Hardest Part Of ...     2    87 2000-09-09
## 10 2Ge+her The Hardest Part Of ...     3    92 2000-09-16
## # ... with 5,297 more rows

To tidy this dataset, we first used pivot_longer() to make the dataset longer. We transform the columns from wk1 to wk76, making a new column for their names: week, and a new value for their values: rank. Next, we use values_drop_na = TRUE to drop any missing values from the rank column. In this data, missing values represent weeks that the song wasn’t in the charts, so it can be safely dropped.

In this case it’s also nice to do a little cleaning, converting the week variable to a number, and figuring out the date corresponding to each week on the charts.

These are a couple examples of how to tidy data, but having worked with hundreds of datasets from different sources, there will always be unique challenges that require creative thinking and patience.