A while ago, I analyzed the World Penis Data, because let’s face it, the inclined data enthusiast can’t resist a dataset like this, which I found ages ago somewhere on reddit. In the meantime the original page went offline, so there’s no other source of the original data as far as I know. Fortunately I saved the old dataset, which now lives on in our tadaadata/loldata R-package and on this project on data.world, so if you want to play around with the data yourself, you’re all set.

Anyway, long story short: I don’t like the look of my old analysis, because that was done in a time before I bothered with ggplot2-prettification. So now we try this again.

Disclaimer

Since a part of this analysis has gone slightly viral and many people seem to be misinterpreting the visualizations, we’d like to re-emphasize that everything here is based on a collection of data we did not collect ourselves, also we do not have access to individual measurements, because our source data is only a collection of average measurements and the dataset only contains one observation per country. Also, some of these studies are based on measured and some on self-reported data. There is no “measured vs. self-reported” data-point. We only compared the averages of studies with each method to each other. Again, the original data we have can be viewed further down this page and is taken from http://www.everyoneweb.com/worldpenissize.

Also, we do not claim that this is a rigorous scientific study or something. There’s already research to look at.

A Look at the Data Sources

First up, let’s see where the data is from in the first place.

loldata::penis %>%
  group_by(Region) %>%
  tally() %>%
  ggplot(aes(x = reorder(Region, n), y = n)) +
  geom_col(alpha = .7) +
  coord_flip() +
  labs(title = "World Penis Data", 
       subtitle = "Number of Studies in Dataset by Region",
       x = "", y = "Number of Studies", caption = plot_caption)

loldata::penis %>%
  group_by(Region) %>%
  summarize(n = sum(N)) %>%
  ggplot(aes(x = reorder(Region, n), y = n)) +
  geom_col(alpha = .7) +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 10^5, 5000), 
                     minor_breaks = seq(0, 10^5, 1000)) +
  labs(title = "World Penis Data", 
       subtitle = "Total N of Studies by Region",
       x = "", y = "Total N", caption = plot_caption)

As we can see, our data seems to be most reliable for Europe, South America, multiple Asias and Africa.
Oddly enough, we have relatively little data on North America, and our data on the Pacific Islands and Central Asia should probably be taken with a grain of salt. However, I have no idea what an appropriate sample size for data like this would be, so I’m mainly eyeballing and trying to sound knowledgable here.

Let’s look at the N of the individual datapoints (because every row in our dataset represents the data of a specific study). We’ll first look at the overall distribution, and then look at studies with a reported N smaller than 200, just to see how we’re doing in that area.

loldata::penis %>%
  ggplot(data = ., aes(x = N)) +
  geom_histogram(binwidth = 100, alpha = .7) +
  scale_x_continuous(breaks = seq(0, 20000, 1000),
                     minor_breaks = seq(0, 20000, 500)) +
  labs(title = "World Penis Data", subtitle = "Reported N per Study",
       x = "Reported N", y = "Frequency", caption = plot_caption)

loldata::penis %>%
  filter(N <= 200) %>%
  ggplot(data = ., aes(x = N)) +
  geom_histogram(binwidth = 5, alpha = .7) +
  scale_x_continuous(breaks = seq(0, 200, 10),
                     minor_breaks = seq(0, 200, 5)) +
  labs(title = "World Penis Data", 
       subtitle = expression(~Reported ~N ~per ~Study ~(N<=200)),
       x = "Reported N", y = "Frequency", caption = plot_caption)

Seems like we have a few data points based on fairly small samples.
For the remainder of the analysis, we’ll be excluding these data points with an N smaller than 50, because that’s the arbitrary judgement of precision I just came up with.

The Raw Data

If you’re curious about the data itself, here’s an embedded version of the dataset.
We arranged the data into two datasets, one wide and one long, but let’s just look at the difference to explain that.

penis %>%
  select(-Method, -N, -Source, everything(), Method, N, Source) %>%
  datatable(style = "bootstrap", rownames = F, 
            fillContainer = T, 
            height = 100,
            options = list(
              bInfo = F,
              paging = F,
              autoWidth = TRUE,
              sDom  = '<"top">lrt<"bottom">ip'), 
            caption = "Penis Data (wide format)") %>%
  formatRound(columns = c(4, 6:11), digits = 1)

As you can see, with have some more or less detailed geographical information, various measurements (and calculated measurements, as volume is calculated by length and circumference, assuming a perfect cylinder, and the growth factor, which is simply erect divided by flaccid), details on the method the data was obtained (either self-reported or measured, and yes, we’ll look at that in detail), and details of the data source (source study, N).

We also have two variables for both flaccid and erect states, which is fine for comparing both states, but sometimes it can be inconvenient.
That’s why we data-wrangled the above dataset in a long format, which excludes the growth_ variables but adds a state variable:

penis_long %>%
  select(-Method, -N, -Source, everything(), Method, N, Source) %>%
  datatable(style = "bootstrap", rownames = F, 
            fillContainer = T, 
            height = 100,
            options = list(
              bInfo = F,
              paging = F,
              autoWidth = TRUE,
              sDom  = '<"top">lrt<"bottom">ip'), 
            caption = "Penis Data (long format)") %>%
  formatRound(columns = 3:5, digits = 1)

Some Basic Descriptives

All units in metric (either \(cm\) or \(cm^3\)).

penis %>% 
  gather(Variable, value, contains("length"), 
         contains("circumf"), contains("volume"), contains("growth")) %>%
  group_by(Variable) %>%
  summarize(N      = n(),
            min    = min(value),
            q1     = quantile(value, .25),
            median = median(value),
            mean   = mean(value),
            q3     = quantile(value, .75),
            max    = max(value),
            sd     = sd(value)) %>%
  datatable(style = "bootstrap", rownames = F, 
            options = list(
              bInfo = F,
              paging = F,
              sDom  = '<"top">lrt<"bottom">ip'), 
            caption = "Penis Data: Descriptives of Numerical Variables") %>%
  formatRound(columns = 3:9, digits = 2)

So this is what the data looks like.
As already mentioned, this is the original source where we got the data from, in case anyone asks… again.

The Data, for your Convenience

And lastly, here’s a full frame version of the dataset, searchable, filterable and everything:

penis %>%
  select(-Method, -N, -Source, everything(), Method, N, Source) %>%
  datatable(style = "bootstrap", rownames = F, 
            options = list(height = 200),
            caption = "Penis Data (wide format, full frame)") %>%
  formatRound(columns = c(4, 6:11), digits = 2)