Class 10: Halloween Mini Project

Laura Lu (PID: A17844089)

Data Import
Quick overview of the dataset
Overall Candy Rankings
Winpercent and Pricepercent
Exploring the correlation structure
Principal Component Analysis

As it is nearly Halloween and a half way point in the quarter let’s do a mini project to help us figure out the best candy!

Our come from the 538 website and is available as a CSV file:

Data Import

candy <- read.csv( "candy-data.csv", row.names = 1)
head(candy) 

             chocolate fruity caramel peanutyalmondy nougat crispedricewafer
100 Grand            1      0       1              0      0                1
3 Musketeers         1      0       0              0      1                0
One dime             0      0       0              0      0                0
One quarter          0      0       0              0      0                0
Air Heads            0      1       0              0      0                0
Almond Joy           1      0       0              1      0                0
             hard bar pluribus sugarpercent pricepercent winpercent
100 Grand       0   1        0        0.732        0.860   66.97173
3 Musketeers    0   1        0        0.604        0.511   67.60294
One dime        0   0        0        0.011        0.116   32.26109
One quarter     0   0        0        0.011        0.511   46.11650
Air Heads       0   0        0        0.906        0.511   52.34146
Almond Joy      0   1        0        0.465        0.767   50.34755

flextable::flextable(head(candy,10))

Q1. How many different candy types are in this dataset?

There are 6 different candy types.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

candy %>%
  nrow()

[1] 85

Q2. how many druity candy types are in the dataset?

sum(candy$fruity)

[1] 38

My favorite winpercent

 candy["Twix",]$winpercent 

[1] 81.64291

library(dplyr) 

candy |> 
  filter(rownames(candy)=="Almond Joy") |>
  select(winpercent)

           winpercent
Almond Joy   50.34755

Quick overview of the dataset

library(skimr) 

skim(candy) 


Name	candy
Number of rows	85
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Data summary

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
chocolate	1	0.44	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
fruity	1	0.45	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
caramel	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
peanutyalmondy	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
nougat	1	0.08	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
crispedricewafer	1	0.08	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
hard	1	0.18	0.38	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
bar	1	0.25	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
pluribus	1	0.52	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
sugarpercent	1	0.48	0.28	0.01	0.22	0.47	0.73	0.99	▇▇▇▇▆
pricepercent	1	0.47	0.29	0.01	0.26	0.47	0.65	0.98	▇▇▇▇▆
winpercent	1	50.32	14.71	22.45	39.14	47.83	59.86	84.18	▃▇▆▅▂

skimr::skim(candy)


Name	candy
Number of rows	85
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Data summary

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
chocolate	1	0.44	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
fruity	1	0.45	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
caramel	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
peanutyalmondy	1	0.16	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
nougat	1	0.08	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
crispedricewafer	1	0.08	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
hard	1	0.18	0.38	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
bar	1	0.25	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
pluribus	1	0.52	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
sugarpercent	1	0.48	0.28	0.01	0.22	0.47	0.73	0.99	▇▇▇▇▆
pricepercent	1	0.47	0.29	0.01	0.26	0.47	0.65	0.98	▇▇▇▇▆
winpercent	1	50.32	14.71	22.45	39.14	47.83	59.86	84.18	▃▇▆▅▂

Q6. Is there any variable/column that looks to be on a different scale tothe majority of the other columns in the dataset?

Q7. What do you think a zero and one represent for the candy$chocolate column?

That the candy does not contain chocolate.

Q8. Plot a histogram of winpercent values

library(ggplot2)

ggplot(candy) + 
  aes(winpercent) + 
  geom_histogram(bins=20) 

Q9. Is the distribution of winpercent values symmetrical?

library(ggplot2)

ggplot(candy) + 
  aes(winpercent) + 
  geom_density() 

Q10. Is the center of the distribution above or below 50%?

mean(candy$winpercent)

[1] 50.31676

summary(candy$winpercent)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.45   39.14   47.83   50.32   59.86   84.18 

Q11. On average is chocolate candy higher or lower ranked than fruit candy?

# 1. Find all chocolate candy in the dataset 
# 2. Find their winpercent values 
# 3. Calculate the mean of these values 

# 4-6. Do the same for fruity candy 
# 7. Compare mean winpercents of chocolate vs fruity 
# 8. Pick the highest as the winner 

choc.inds <- candy$chocolate==1
choc.win <- candy[choc.inds, ]$winpercent 
choc.mean <- mean(choc.win) 
choc.mean 

[1] 60.92153

mean(candy[candy$chocolate==1,]$winpercent)

[1] 60.92153

mean(candy[candy$fruity==1,]$winpercent)

[1] 44.11974

fruit.ind <- candy$fruity==1 
fruit.win <- candy[fruit.ind,]$winpercent 
fruit.mean <- mean(fruit.win) 
fruit.mean 

[1] 44.11974

candy |> 
  filter(chocolate==1) |> 
  select(winpercent) 

                            winpercent
100 Grand                     66.97173
3 Musketeers                  67.60294
Almond Joy                    50.34755
Baby Ruth                     56.91455
Charleston Chew               38.97504
Hershey's Kisses              55.37545
Hershey's Krackel             62.28448
Hershey's Milk Chocolate      56.49050
Hershey's Special Dark        59.23612
Junior Mints                  57.21925
Kit Kat                       76.76860
Peanut butter M&M's           71.46505
M&M's                         66.57458
Milk Duds                     55.06407
Milky Way                     73.09956
Milky Way Midnight            60.80070
Milky Way Simply Caramel      64.35334
Mounds                        47.82975
Mr Good Bar                   54.52645
Nestle Butterfinger           70.73564
Nestle Crunch                 66.47068
Peanut M&Ms                   69.48379
Reese's Miniatures            81.86626
Reese's Peanut Butter cup     84.18029
Reese's pieces                73.43499
Reese's stuffed with pieces   72.88790
Rolo                          65.71629
Sixlets                       34.72200
Nestle Smarties               37.88719
Snickers                      76.67378
Snickers Crisper              59.52925
Tootsie Pop                   48.98265
Tootsie Roll Juniors          43.06890
Tootsie Roll Midgies          45.73675
Tootsie Roll Snack Bars       49.65350
Twix                          81.64291
Whoppers                      49.52411

Q12. Is this difference statistically significant?

t.test(choc.win, fruit.win ) 

    Welch Two Sample t-test

data:  choc.win and fruit.win
t = 6.2582, df = 68.882, p-value = 2.871e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 11.44563 22.15795
sample estimates:
mean of x mean of y 
 60.92153  44.11974 

Overall Candy Rankings

Q13. What are the five least liked candy types in this set?

candy |>
  arrange(winpercent) |>
  head(5)

                   chocolate fruity caramel peanutyalmondy nougat
Nik L Nip                  0      1       0              0      0
Boston Baked Beans         0      0       0              1      0
Chiclets                   0      1       0              0      0
Super Bubble               0      1       0              0      0
Jawbusters                 0      1       0              0      0
                   crispedricewafer hard bar pluribus sugarpercent pricepercent
Nik L Nip                         0    0   0        1        0.197        0.976
Boston Baked Beans                0    0   0        1        0.313        0.511
Chiclets                          0    0   0        1        0.046        0.325
Super Bubble                      0    0   0        0        0.162        0.116
Jawbusters                        0    1   0        1        0.093        0.511
                   winpercent
Nik L Nip            22.44534
Boston Baked Beans   23.41782
Chiclets             24.52499
Super Bubble         27.30386
Jawbusters           28.12744

x <- c(5,1,10,4) 
#sort(x)
order(x) 

[1] 2 4 1 3

#(candy$winpercent)

ord.ind <- order( candy$winpercent ) 
head( candy[ord.ind,], 5) 

                   chocolate fruity caramel peanutyalmondy nougat
Nik L Nip                  0      1       0              0      0
Boston Baked Beans         0      0       0              1      0
Chiclets                   0      1       0              0      0
Super Bubble               0      1       0              0      0
Jawbusters                 0      1       0              0      0
                   crispedricewafer hard bar pluribus sugarpercent pricepercent
Nik L Nip                         0    0   0        1        0.197        0.976
Boston Baked Beans                0    0   0        1        0.313        0.511
Chiclets                          0    0   0        1        0.046        0.325
Super Bubble                      0    0   0        0        0.162        0.116
Jawbusters                        0    1   0        1        0.093        0.511
                   winpercent
Nik L Nip            22.44534
Boston Baked Beans   23.41782
Chiclets             24.52499
Super Bubble         27.30386
Jawbusters           28.12744

candy |> 
  arrange(-winpercent) |> 
  head(5) 

                          chocolate fruity caramel peanutyalmondy nougat
Reese's Peanut Butter cup         1      0       0              1      0
Reese's Miniatures                1      0       0              1      0
Twix                              1      0       1              0      0
Kit Kat                           1      0       0              0      0
Snickers                          1      0       1              1      1
                          crispedricewafer hard bar pluribus sugarpercent
Reese's Peanut Butter cup                0    0   0        0        0.720
Reese's Miniatures                       0    0   0        0        0.034
Twix                                     1    0   1        0        0.546
Kit Kat                                  1    0   1        0        0.313
Snickers                                 0    0   1        0        0.546
                          pricepercent winpercent
Reese's Peanut Butter cup        0.651   84.18029
Reese's Miniatures               0.279   81.86626
Twix                             0.906   81.64291
Kit Kat                          0.511   76.76860
Snickers                         0.651   76.67378

Q15. Make a first barplot of candy ranking based on winpercent values.

ggplot(candy) +
  aes(winpercent, rownames(candy)) + 
  geom_col()

Q16. This is quite ugly, use the reorder() function to get the bars sorted by winpercent?

ggplot(candy) + 
  aes(x=winpercent,
      y=reorder(rownames(candy), winpercent)) + 
  geom_col() 

Add some color based on the “type of candy”

my_cols <- rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] <- "chocolate"
my_cols[as.logical(candy$fruity)] <- "pink" 
my_cols[as.logical(candy$bar)] <- "purple" 
my_cols 

 [1] "purple"    "purple"    "black"     "black"     "pink"      "purple"   
 [7] "purple"    "black"     "black"     "pink"      "purple"    "pink"     
[13] "pink"      "pink"      "pink"      "pink"      "pink"      "pink"     
[19] "pink"      "black"     "pink"      "pink"      "chocolate" "purple"   
[25] "purple"    "purple"    "pink"      "chocolate" "purple"    "pink"     
[31] "pink"      "pink"      "chocolate" "chocolate" "pink"      "chocolate"
[37] "purple"    "purple"    "purple"    "purple"    "purple"    "pink"     
[43] "purple"    "purple"    "pink"      "pink"      "purple"    "chocolate"
[49] "black"     "pink"      "pink"      "chocolate" "chocolate" "chocolate"
[55] "chocolate" "pink"      "chocolate" "black"     "pink"      "chocolate"
[61] "pink"      "pink"      "chocolate" "pink"      "purple"    "purple"   
[67] "pink"      "pink"      "pink"      "pink"      "black"     "black"    
[73] "pink"      "pink"      "pink"      "chocolate" "chocolate" "purple"   
[79] "pink"      "purple"    "pink"      "pink"      "pink"      "black"    
[85] "chocolate"

ggplot(candy) + 
  aes(x=winpercent,
      y=reorder(rownames(candy), winpercent)) + 
  geom_col(fill=my_cols) 

Winpercent and Pricepercent

A plot with both variables/columns winpercent and pricepercent

my_cols[as.logical(candy$fruity)]  <- "red"


ggplot(candy) + 
  aes(x=winpercent, 
      y=pricepercent, 
      label=rownames(candy)) + 
  geom_point(color=my_cols) + 
  geom_text() 

library(ggrepel) 


ggplot(candy) + 
  aes(x=winpercent, 
      y=pricepercent, 
      label=rownames(candy)) + 
  geom_point(color=my_cols) + 
  geom_text_repel(max.overlaps = 7) 

Warning: ggrepel: 45 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Exploring the correlation structure

Now that we’ve explored the dataset a little, we’ll see how the variables interact with one another. We’ll use correlation and view the results with the corrplot package to plot a correlation matrix.

library(corrplot) 

corrplot 0.95 loaded

cij <- cor(candy)
corrplot(cij) 

Principal Component Analysis

The function to use is called prcomp() with an optional scale=T/F argument.

pca <- prcomp(candy, scale=TRUE) 
summary(pca)

Importance of components:
                          PC1    PC2    PC3     PC4    PC5     PC6     PC7
Standard deviation     2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
Cumulative Proportion  0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
                           PC8     PC9    PC10    PC11    PC12
Standard deviation     0.74530 0.67824 0.62349 0.43974 0.39760
Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
Cumulative Proportion  0.89998 0.93832 0.97071 0.98683 1.00000

Our main PCA result figure

ggplot(pca$x) + 
  aes(PC1, PC2, label=rownames(pca$x)) + 
  geom_point(col=my_cols)

  geom_text_repel(col=my_cols) 

geom_text_repel: parse = FALSE, na.rm = FALSE, box.padding = 0.25, point.padding = 1e-06, min.segment.length = 0.5, arrow = NULL, force = 1, force_pull = 1, max.time = 0.5, max.iter = 10000, max.overlaps = 10, nudge_x = 0, nudge_y = 0, xlim = c(NA, NA), ylim = c(NA, NA), direction = both, seed = NA, verbose = FALSE
stat_identity: na.rm = FALSE
position_identity 

We should also examine the variable “loadings” or contributions of the original variables to the new PCs

ggplot(pca$rotation) + 
  aes(PC1, rownames(pca$rotation)) + 
  geom_col() 

p <- ggplot(pca$x) + 
  aes(PC1, PC2, label=rownames(pca$x)) + 
  geom_point(col=my_cols) + 
  geom_text_repel(col=my_cols) 

Interactive plots that can be zoomed on and “brushed” over can be made wit the *ploty** package. It’s output is interactive and will not render to PDF

library(plotly) 

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

#plotly(p)

par(mar=c(8,4,2,2))
barplot(pca$rotation[,1], las=2, ylab="PC1 Contribution")