Exploring the Simpson’s Paradox Within the Penguin Dataset

And simultaneously demonstrating the capabilities of Quarto.

This document is a short analysis of the Penguin Dataset. It explores the relationship between bill length and bill depth and show how important it is to consider group effects.

Yan Holtz

Independant 😀


February 26, 2024


Quarto, Paradox, Data Analysis

This Quarto document serves as a practical illustration of the concepts covered in the productive workflow online course. It’s designed primarily for educational purposes, so the focus is on demonstrating Quarto techniques rather than on the rigor of its scientific content.

1 Introduction

This document offers a straightforward analysis of the well-known penguin dataset. It is designed to complement the Productive R Workflow online course.

You can read more about the penguin dataset here.

Let’s load libraries before we start!

# load the tidyverse
library(hrbrthemes)    # ipsum theme for ggplot2 charts
library(patchwork)     # combine charts together
library(DT)            # interactive tables
library(knitr)         # static table with the kable() function
library(plotly)        # interactive graphs

2 Loading data

The dataset has already been loaded and cleaned in the previous step of this pipeline.

Let’s load the clean version, together with a few functions available in functions.R.

# Source functions

# Read the clean dataset
data <- readRDS(file = "../input/clean_data.rds")

Note that bill_length_mm and bill_depth_mm have the following signification.

Bill measurement explanation

In case you’re wondering how the original dataset looks like, here is a searchable version of it, made using the DT package:

datatable(data, options = list(pageLength = 5), filter = "top")

3 Bill Length and Bill Depth

Now, let’s make some descriptive analysis, including summary statistics and graphs.

What’s striking is the slightly negative relationship between bill length and bill depth. One could definitely expect the opposite.

p <- data %>%
    aes(x = bill_length_mm, y = bill_depth_mm)
  ) +
    geom_point(color="#69b3a2") +
      x = "Bill Length (mm)",
      y = "Bill Depth (mm)",
      title = paste("Surprising relationship?")
    ) + 

Relationship between bill length and bill depth. All data points included.

It is also interesting to note that bill length a and bill depth are quite different from one specie to another. The average of a variable can be computed as follow:

\[{\displaystyle Avg={\frac {1}{n}}\sum _{i=1}^{n}a_{i}={\frac {a_{1}+a_{2}+\cdots +a_{n}}{n}}}\]

bill length and bill depth averages are summarized in the 2 tables below.

bill_length_per_specie <- data %>%
 group_by(species) %>% 
  summarise(average_bill_length = mean(bill_length_mm, na.rm = TRUE))

bill_depth_per_specie <- data %>%
 group_by(species) %>% 
  summarise(average_bill_depth = mean(bill_depth_mm, na.rm = TRUE))

bill_length_adelie <- bill_length_per_specie %>%
  filter(species == "Adelie") %>%
  pull(average_bill_length) %>%
species average_bill_length
Adelie 38.80872
Chinstrap 48.83382
Gentoo 47.50488
species average_bill_depth
Adelie 18.34228
Chinstrap 18.42059
Gentoo 14.98211

For instance, the average bill length for the specie Adelie is 38.81.

Now, let’s check the relationship between bill depth and bill length for the specie Adelie on the island Torgersen:

# Use the function in functions.R
p1 <- create_scatterplot(data, "Adelie", "#6689c6")
p2 <- create_scatterplot(data, "Chinstrap", "#e85252")
p3 <- create_scatterplot(data, "Gentoo", "#9a6fb0")

p1 + p2 + p3

There is actually a positive correlation when split by species.