It seems fairly pointless to plot the length of both flaccid and erect states combined, but for completion’s sake, let’s just get it over with.

penis %>% 
  gather(key = state, value = length, length_flaccid, length_erect) %>%
  mutate(state = factor(state, 
                        levels = c("length_flaccid", "length_erect"), 
                        labels =  c("Flaccid", "Erect"))) %>%
  {
    ggplot(., aes(x = length)) +
    geom_histogram(binwidth = .5, alpha = .7, position = "dodge") +
    scale_x_continuous(breaks = seq(0, 100, 2), 
                       minor_breaks = seq(0, 100, .5),
                       labels = label_cm,
                       sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
    labs(title = "World Penis Data", subtitle = "Length",
         x = "Length", y = "Frequency", 
         fill = "State", color = "State",
         caption = plot_caption) +
    theme(legend.position = "top")
  }

By State

penis %>% 
  gather(key = state, value = length, length_flaccid, length_erect) %>%
  mutate(state = factor(state, 
                        levels = c("length_flaccid", "length_erect"), 
                        labels =  c("Flaccid", "Erect"))) %>%
  {
    ggplot(., aes(x = length, fill = state, color = state)) +
    geom_histogram(binwidth = .5, alpha = .7, position = "dodge") +
    scale_x_continuous(breaks = seq(0, 100, 2), 
                       minor_breaks = seq(0, 100, .5),
                       labels = label_cm,
                       sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
    scale_fill_brewer(palette = "Paired") +
    scale_color_brewer(palette = "Paired") +
    labs(title = "World Penis Data", subtitle = "Length by State",
         x = "Length", y = "Frequency", 
         fill = "State", color = "State",
         caption = plot_caption) +
    theme(legend.position = "top")
  }

By State and Region

penis %>% 
  gather(key = state, value = length, length_flaccid, length_erect) %>%
  mutate(state = factor(state, 
                        levels = c("length_flaccid", "length_erect"), 
                        labels =  c("Flaccid", "Erect"))) %>%
  group_by(Region) %>%
  mutate(order_length = mean(length)) %>%
  {
    ggplot(., aes(x = reorder(Region, order_length), y = length, 
                  fill = state, color = state)) +
    geom_boxplot(alpha = .7) +
    coord_flip() +
    scale_y_continuous(breaks = seq(0, 100, 2), 
                       minor_breaks = seq(0, 100, .5),
                       labels = label_cm,
                       sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
    scale_fill_brewer(palette = "Paired") +
    scale_color_brewer(palette = "Paired") +
    labs(title = "World Penis Data", subtitle = "Length by State",
         x = "", y = "Length", fill = "State", color = "State",
         caption = plot_caption) +
    theme(axis.text.y = element_text(size = rel(1.2)),
          legend.position = "top")
  }

# CIs
penis_long %>%
  group_by(Region) %>%
  mutate(order_length = mean(length)) %>%
  {
    ggplot(data = ., aes(x = reorder(Region, order_length), y = length, color = state)) +
    stat_summary(fun.data = mean_cl_normal, geom = "errorbar") +
    stat_summary(fun.y = mean, geom = "point", size = 2) +
    coord_flip() +
    scale_y_continuous(breaks = seq(0, 100, 2), 
                       minor_breaks = seq(0, 100, .5),
                       labels = label_cm,
                       sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
    scale_color_brewer(palette = "Paired") +
    labs(title = "World Penis Data", subtitle = "Length by State (mean with 95% CI)",
         x = "", y = "Length", fill = "State", color = "State",
         caption = plot_caption) +
    theme(axis.text.y = element_text(size = rel(1.2)),
          legend.position = "top")
  }

Growth Factor

I calculated the length-wise growth factor based on \(\frac{\text{erect length}}{\text{flaccid length}}\), which reduces the relationship of erect and flaccid length into a single variable.
Let’s look what that’s all about.

ggplot(data = penis, aes(x = growth_length)) +
  geom_histogram(binwidth = .05, alpha = .7) +
  labs(title = "World Penis Data", subtitle = "Length-wise growth factor",
         x = "Growth Factor", y = "Frequency",
         caption = plot_caption)

Well that seems like it justifies further investigation.

By Region

ggplot(data = penis, aes(x = reorder(Region, growth_length), y = growth_length)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar") +
  stat_summary(fun.y = mean, geom = "point", size = 2) +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 2, .05)) +
  labs(title = "World Penis Data", 
       subtitle = "Length-wise growth factor by region (mean with 95% CI)",
       x = "", y = "Growth Factor",
       caption = plot_caption)

Erect vs. Flaccid

Seems fairly self-explanatory to suspect a relationship there, so let’s see.

ggplot(data = penis, aes(x = length_flaccid, y = length_erect)) +
  geom_smooth(method = lm, se = F, color = "gray") +
  geom_point(size = 2) +
  scale_x_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
  scale_y_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
  labs(title = "World Penis Data",
       subtitle = "Flaccid by Erect Length",
       x = "Flaccid Length", y = "Erect Length",
       caption = plot_caption)

Well that looks like a pretty decent linear relationship with some other effect we haven’t identified yet.
Since there’s only so much data in this dataset, the only thing we can really do is to look at the effect of different regions.

Averaged by Region

penis %>% 
  group_by(Region) %>%
  summarize(length_flaccid = mean(length_flaccid),
            length_erect   = mean(length_erect)) %>%
  ggplot(data = ., aes(x = length_flaccid, y = length_erect, color = Region)) +
  geom_smooth(method = lm, se = F, color = "gray") +
  geom_point(size = 3, color = "black") +
  geom_point(size = 2) +
  scale_x_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
  scale_y_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
  scale_color_brewer(palette = "Paired") +
  labs(title = "World Penis Data",
       subtitle = "Flaccid by Erect Length (averaged by region)",
       x = "Flaccid Length", y = "Erect Length",
       caption = plot_caption)

That seems… reasonably informative. Let’s to that again without the averaging though.

by Region

ggplot(data = penis, aes(x = length_flaccid, y = length_erect, color = Region)) +
  geom_smooth(method = lm, se = F, color = "gray") +
  geom_point(size = 3, color = "black") +
  geom_point(size = 2) +
  scale_x_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
  scale_y_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54, 
                                           labels = label_in)) +
  scale_color_brewer(palette = "Paired") +
  labs(title = "World Penis Data",
       subtitle = "Flaccid by Erect Length",
       x = "Flaccid Length", y = "Erect Length",
       caption = plot_caption)

Correlation

Well, we’ve seen enough of that I guess.
Let’s do some math:

model <- lm(length_erect ~ length_flaccid, data = penis)

tab1 <- sjPlot::sjt.lm(model, 
                       pred.labels = c("Length (Flaccid)"), 
                       depvar.labels = c("Length (Erect)"), 
                       show.se = T, 
                       use.viewer = F, no.output = T)
    Length (Erect)
    B CI std. Error p
(Intercept)   0.44 -0.88 – 1.77 0.67 .509
Length (Flaccid)   1.41 1.27 – 1.55 0.07 <.001
Observations   139
R2 / adj. R2   .747 / .745


So… yeah. There’s a perfectly valid linear relationship with a nice big \(R^2\) and it basically tells us that the expected size of an erect penis will be an increase of roughly 1.4 cm of the flaccid penis.
Seems… realistic, I guess?

Lastly, here’s a version of the above plot for you to get your hands on:

text <- ~paste0("<b>", Region, "</b>",
               "<br /><b>Country:</b> ", Country,
               "<br /><b>N</b> = ", N,
               "<br /><b>Source:</b> ", Source)

plot_ly(data = penis, x = ~length_flaccid, y = ~length_erect, colors = "Paired",
        width = 768) %>%
  add_markers(color = I("black"), size = I(12), hoverinfo = "none") %>%
  add_markers(color = ~Region, hoverinfo = "text", size = I(10),
              text = text) %>%
  layout(title = "World Penis Data:<br />Erect by Flaccid Length",
         xaxis = list(title = "Flaccid Length (cm)"), 
         yaxis = list(title = "Erect Length (cm)"),
         paper_bgcolor = "#ffffff", plot_bgcolor = "#ffffff", hovermode = "closest",
         showlegend = F,
         autosize = T)

Method

Probably the most interesting aspect of our data is the column about how the data was obtained, i.e. if it was self-reported or measured.
The hypothesis in the back of our heads is probably pretty obvious, right? Self-reported values must be slightly bigger than measured values.
Let’s see if that’s true.

penis_means <- penis_long %>%
  group_by(Method) %>%
  summarize(mean = mean(length))


ggplot(penis_long, aes(x = length, fill = Method, color = Method)) +
  geom_density(aes(y = ..count..), alpha = .3) +
  geom_histogram(binwidth = .5, alpha = .8, position = "dodge") +
  geom_vline(data = penis_means, aes(xintercept = mean, color = Method),
             linetype = "dotted", size = 1.5) +
  scale_x_continuous(breaks = seq(0, 100, 2),
                     minor_breaks = seq(0, 100, .5),
                     labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                         labels = label_in)) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "World Penis Data", subtitle = "Length by Method",
       x = "Length", y = "Frequency",
       fill = "Method", color = "Method",
       caption = plot_caption) +
  theme(legend.position = "top")

Method by State

penis_means <- penis_long %>%
  group_by(Method, state) %>%
  summarize(mean = mean(length))

ggplot(penis_long, aes(x = length, fill = Method, color = Method)) +
  geom_density(aes(y = ..count..), alpha = .3) +
  geom_histogram(binwidth = .5, alpha = .8, position = "dodge") +
  geom_vline(data = penis_means, aes(xintercept = mean, color = Method),
             linetype = "dotted", size = 1.5) +
  facet_wrap(~state, scales = "free_x") +
  scale_x_continuous(breaks = seq(0, 100, 2),
                     minor_breaks = seq(0, 100, .5),
                     labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                         labels = label_in)) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "World Penis Data", subtitle = "Length by Method and State",
       x = "Length", y = "Frequency",
       fill = "Method", color = "Method",
       caption = plot_caption) +
  theme(legend.position = "top")

Scatterplot

ggplot(penis, aes(x = length_flaccid, y = length_erect, color = Method)) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(0, 100, 1),
                     minor_breaks = seq(0, 100, .5),
                     labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                         labels = label_in)) +
  scale_y_continuous(breaks = seq(0, 100, 2),
                     minor_breaks = seq(0, 100, .5),
                     labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                         labels = label_in)) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "World Penis Data", subtitle = "Length by Method",
       x = "Flaccid Length", y = "Erect Length",
       fill = "Method", color = "Method",
       caption = plot_caption) +
  theme(legend.position = "top")

Method by Region and State

ggplot(data = penis_long, aes(x = reorder(Region, length), y = length, color = Method)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar") +
  stat_summary(fun.y = mean, geom = "point", size = 1) +
  coord_flip() +
  facet_wrap(~state, ncol = 2, scales = "free_x") +
  scale_y_continuous(breaks = seq(0, 100, 2),
                     minor_breaks = seq(0, 100, .5),
                     labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                         labels = label_in)) +
  scale_color_brewer(palette = "Set2") +
  labs(title = "World Penis Data", subtitle = "Length by Method, Region & State",
       x = "", y = "Length",
       fill = "Method", color = "Method",
       caption = plot_caption) +
  theme(legend.position = "top")

Method: Tests

t-Test and lulz

So all that seems to point in the direction we expected. Let’s do a t-test for good measure.
We’ll use length (both states) and method as variables.

tadaa_t.test(penis_long, length, Method, print = "markdown")
Measured Self reported t p df conf_low conf_high method alternative d power
11.296 12.344 -3.254 < 0.01 276 -1.682 -0.414 Two Sample t-test two.sided 0.405 0.9



Let’s also try an ANOVA of length on method and state, just for giggles:

tadaa_aov(length ~ Method + state, data = penis_long, print = "markdown")
term df sumsq meansq F p.value part.eta.sq cohens.f
Method 1 70.914 70.914 36.646 < 0.001 0.118 0.365
state 1 1316.289 1316.289 680.21 < 0.001 0.712 1.573
Residuals 275 532.159 1.935 NA NA NA NA



So… yeah. Erect vs. flaccid obviously has the biggest effect, but the method definitely plays a role there.

Also, here’s an interaction plot of the means, just because.

penis_means <- penis_long %>%
  group_by(Method, state) %>%
  summarize(mean = mean(length))

ggplot(data = penis_means, aes(x = Method, y = mean, color = state)) +
  geom_point(size = 2) +
  geom_line(aes(group = state), size = 1.5) +
  scale_color_brewer(palette = "Paired") +
  scale_y_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                        labels = label_in)) +
  labs(title = "World Penis Data", subtitle = "Length by Method & State",
       x = "", y = expression(paste("Length (", bar(x), ")")),
       color = "State",
       caption = plot_caption) +
  theme(legend.position = "top")

ggplot(data = penis_means, aes(x = state, y = mean, color = Method)) +
  geom_point(size = 2) +
  geom_line(aes(group = Method), size = 1.5) +
  scale_color_brewer(palette = "Set2") +
  scale_y_continuous(labels = label_cm,
                     sec.axis = sec_axis(trans = ~./2.54,
                                        labels = label_in)) +
  labs(title = "World Penis Data", subtitle = "Length by State & Method",
       x = "", y = expression(paste("Length (", bar(x), ")")),
       color = "State",
       caption = plot_caption) +
  theme(legend.position = "top")

Linear Relationship

So now that we know the method of measurement has a sizeable effect on the reported penis length, how do we know the method is a good predictor? By having a look at the descriptives from earlier we can see quite a few huge differences regarding the regions. Let’s start with looking at the linear relationship between erect penis length and method:

lm1 <- lm(length_erect ~ Method, penis)

sj.m1 <- sjPlot::sjt.lm(lm1, depvar.labels = "Penis length (erect)",
                        show.se = T, use.viewer = F, no.output = T)
    Penis length (erect)
    B CI std. Error p
(Intercept)   13.36 13.01 – 13.71 0.18 <.001
Method (Self reported)   1.34 0.76 – 1.92 0.29 <.001
Observations   139
R2 / adj. R2   .133 / .127


A meager \(R^2\) of just 0.13 ain’t too much, so how do we do when also looking at the regions?

lm2 <- lm(length_erect ~ Method + Region, penis)

sj.m2 <- sjPlot::sjt.lm(lm2, depvar.labels = "Penis length (erect)",
                        show.se = T, use.viewer = F, no.output = T)
    Penis length (erect)
    B CI std. Error p
(Intercept)   14.94 14.42 – 15.47 0.26 <.001
Method (Self reported)   0.68 0.25 – 1.10 0.21 .002
Region
Asia   -3.40 -4.19 – -2.60 0.40 <.001
Australia   0.13 -1.54 – 1.79 0.84 .881
Central America/Caribean   -0.42 -1.28 – 0.43 0.43 .330
Central Asia   -1.71 -3.36 – -0.05 0.83 .043
Europe   -1.07 -1.66 – -0.47 0.30 <.001
North America   -1.27 -2.49 – -0.06 0.61 .040
Pacific Islands   -1.68 -2.73 – -0.63 0.53 .002
South America   0.21 -0.60 – 1.01 0.41 .613
South Asia   -3.46 -4.58 – -2.33 0.57 <.001
Southeast Asia   -4.43 -5.41 – -3.44 0.50 <.001
Western Asia   -1.90 -2.64 – -1.16 0.37 <.001
Observations   139
R2 / adj. R2   .633 / .598


An \(R^2\) of whooping 0.6! Now that’s something to predict with. What this tells us is the following: if you want get an estimate of a man’s junk size, you shouldn’t hand him a measuring tape but just ask where he’s from. Except for Australia, Central and South America. Or something along those lines. Maybe.

Analysis Of Variance

“But dear Mr. Analyst”, you might say, “how on earth are you going to justify the assumption of a linear relationship?” And you’re absolutely right, maybe that’s not quite the way to go. Maybe we should just do an ANOVA and look at the effect sizes:

tadaa_aov(length_erect ~ Method + Region, data = penis, print = "markdown")
term df sumsq meansq F p.value part.eta.sq cohens.f
Method 1 58.312 58.312 45.648 < 0.001 0.266 0.602
Region 11 219.346 19.941 15.61 < 0.001 0.577 1.167
Residuals 126 160.957 1.277 NA NA NA NA



And there you go: while the method still makes a significant difference, the effect of the region seems way more important.

Choropleth-Penismap

A worldmapwise look at penile length distribution.

highchart() %>% 
  hc_add_series_map(worldgeojson, map, value = "length_erect", joinBy = "iso3",
                    name = "Penis Length (erect)") %>% 
  hc_title(text = "Erect Penis Length by Country") %>% 
  hc_tooltip(valueDecimals = 2, valuePrefix = "<b>", valueSuffix = " cm</b>", 
             useHTML = TRUE) %>% 
  hc_colorAxis(stops = idk, min = 10, max = 17.5) %>% 
  hc_legend(enabled = TRUE) %>% 
  hc_mapNavigation(enabled = TRUE)