Nitty Gritties of Explanatory Data Analysis in R (EDA)

Set up

library(kableExtra)
library(tidyverse)
library(tvthemes)
library(ggthemes)
library(scales)
library(magrittr)

out_new<-vroom::vroom("movies.csv")
out_new |> 
  head(10) |> 
  kable(table.attr = "style = \"color: black;\"") |> 
  kable_styling(fixed_thead = T) |> 
  scroll_box(height = "400px")
...1 release_date movie production_budget domestic_gross worldwide_gross distributor mpaa_rating genre
1 6/22/2007 Evan Almighty 1.75e+08 100289690 174131329 Universal PG Comedy
2 7/28/1995 Waterworld 1.75e+08 88246220 264246220 Universal PG-13 Action
3 5/12/2017 King Arthur: Legend of the Sword 1.75e+08 39175066 139950708 Warner Bros. PG-13 Adventure
4 12/25/2013 47 Ronin 1.75e+08 38362475 151716815 Universal PG-13 Action
5 6/22/2018 Jurassic World: Fallen Kingdom 1.70e+08 416769345 1304866322 Universal PG-13 Action
6 8/1/2014 Guardians of the Galaxy 1.70e+08 333172112 771051335 Walt Disney PG-13 Action
7 5/7/2010 Iron Man 2 1.70e+08 312433331 621156389 Paramount Pictures PG-13 Action
8 4/4/2014 Captain America: The Winter Soldier 1.70e+08 259746958 714401889 Walt Disney PG-13 Action
9 7/11/2014 Dawn of the Planet of the Apes 1.70e+08 208545589 710644566 20th Century Fox PG-13 Adventure
10 11/10/2004 The Polar Express 1.70e+08 186493587 310634169 Warner Bros. G Adventure

first things first ::

what variables do we have?

names(out_new)
#> [1] "...1"              "release_date"      "movie"            
#> [4] "production_budget" "domestic_gross"    "worldwide_gross"  
#> [7] "distributor"       "mpaa_rating"       "genre"

secondly ,what are the datatypes that we have?

map_dfr(out_new,class)
#> # A tibble: 1 x 9
#>   ...1    release_date movie    production_budget domestic_gross worldwide_gross
#>   <chr>   <chr>        <chr>    <chr>             <chr>          <chr>          
#> 1 numeric character    charact~ numeric           numeric        numeric        
#> # i 3 more variables: distributor <chr>, mpaa_rating <chr>, genre <chr>

This data was featured in the FiveThirtyEight article,“Scary Movies Are The Best Investment In Hollywood”.

Now let`s describe the dataset

Header Description
`release_date` month-day-year
`movie` Movie title
`production_budget` Money spent to create the film
`domestic_gross` Gross revenue from USA
`worldwide_gross` Gross worldwide revenue
`distributor` The distribution company
`mpaa_rating` Appropriate age rating by the US-based rating agency
`genre` Film category

let’s do some touches on the dataset

  • Get rid of the blank X1 Variable.
  • Change release date into an actual date.
  • change character variables to factors
  • Calculate the return on investment as the worldwide_gross/production_budget.
  • Calculate the percentage of total gross as domestic revenue.
  • Get the year, month, and day out of the release date.
  • Remove rows where the revenue is $0 (unreleased movies, or data integrity problems), and remove rows missing information about the distributor. Go ahead and remove any data where the rating is unavailable also.

…. but before that lets skim a bit!

out_new |> skimr::skim()
Table 1: Data summary
Name out_new
Number of rows 3401
Number of columns 9
_______________________
Column type frequency:
character 5
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
release_date 0 1.00 8 10 0 1768 0
movie 0 1.00 1 35 0 3400 0
distributor 48 0.99 3 22 0 201 0
mpaa_rating 137 0.96 1 5 0 4 0
genre 0 1.00 5 9 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
...1 0 1 1701 981.93 1 851 1701 2551 3401 <U+2587><U+2587><U+2587><U+2587><U+2587>
production_budget 0 1 33284743 34892390.59 250000 9000000 20000000 45000000 175000000 <U+2587><U+2582><U+2581><U+2581><U+2581>
domestic_gross 0 1 45421793 58825660.56 0 6118683 25533818 60323786 474544677 <U+2587><U+2581><U+2581><U+2581><U+2581>
worldwide_gross 0 1 94115117 140918241.82 0 10618813 40159017 117615211 1304866322 <U+2587><U+2581><U+2581><U+2581><U+2581>
mov <- out_new |>
  select(-1) |>
  mutate(release_date = mdy(release_date)) |> #mdy is the setup of the date variable
  mutate_if(is.character,as.factor) |> 
  mutate(roi = worldwide_gross / production_budget) |>
  mutate(pct_domestic = domestic_gross / worldwide_gross) |>
  mutate(year = year(release_date)) |> 
  mutate(month = month(release_date)) |> 
  mutate(day = as.factor(wday(release_date))) |> 
  arrange(desc(release_date)) |>
  filter(worldwide_gross > 0) |>
  filter(!is.na(distributor)) |>
  filter(!is.na(mpaa_rating))
mov
#> # A tibble: 3,202 x 13
#>    release_date movie           production_budget domestic_gross worldwide_gross
#>    <date>       <fct>                       <dbl>          <dbl>           <dbl>
#>  1 2018-10-12   First Man                60000000       30000050        55500050
#>  2 2018-10-12   Goosebumps 2: ~          35000000       28804812        39904812
#>  3 2018-10-05   Venom                   100000000      171125095       461825095
#>  4 2018-10-05   A Star is Born           36000000      126181246       200881246
#>  5 2018-09-28   Smallfoot                80000000       66361035       137161035
#>  6 2018-09-28   Night School             29000000       66906825        84406825
#>  7 2018-09-28   Hell Fest                 5500000       10751601        12527795
#>  8 2018-09-14   The Predator             88000000       50787159       127987159
#>  9 2018-09-14   White Boy Rick           30000000       23851700        23851700
#> 10 2018-08-17   Mile 22                  35000000       36108758        64708758
#> # i 3,192 more rows
#> # i 8 more variables: distributor <fct>, mpaa_rating <fct>, genre <fct>,
#> #   roi <dbl>, pct_domestic <dbl>, year <dbl>, month <dbl>, day <fct>
  • fair enough , the date variable looks pretty good now !
  • let us look at the distribution of the year variable
ggplot(mov, aes(year)) + 
  geom_histogram(bins=40, fill=avatar_pal()(1))+
  theme_avatar()
  • There doesn’t appear to be much documented before 1975, so let’s restrict (read: filter) the dataset to movies made since 1975. Also, we’re going to be doing some analyses by year, and the data for 2018 is still incomplete, let’s remove all of 2018. Let’s get anything produced in 1975 and after (>=1975) but before 2018.

filter and remove the years described above

mov<-mov |> 
  filter(year>= 1975 & year < 2018)

ggplot(mov, aes(year)) + 
  geom_histogram(bins=40, fill=avatar_pal()(1))+
  theme_avatar()+
  labs(title="distribution of year")
  • that looks awesome ,we can picture that by genre or rating as well
ggplot(mov, aes(year)) + 
  geom_histogram(bins=40, fill=avatar_pal()(1))+
  theme_avatar()+
  facet_wrap(~genre,scales="free")+
  labs(title="distribution of year")
ggplot(mov, aes(year)) + 
  geom_histogram(bins=40, fill=avatar_pal()(1))+
  theme_avatar()+
  facet_wrap(~mpaa_rating,scales="free")+
  labs(title="distribution of year")

Days the movies were released

library(ggthemes)

mov |> 
  count(day, sort=TRUE) |> 
  ggplot(aes(y=n,x=fct_reorder(day,n),fill=day)) + 
  geom_col() + 
  labs(x="", y="Number of movies released", 
       title="Which days are movies released on?") + 
  theme_avatar() + scale_fill_avatar()
  • most movies were watched on a Friday (Friday night maybe)
library(scales)
mov |> 
  ggplot(aes(day, worldwide_gross,fill=day)) + 
  geom_boxplot() + 
  scale_y_log10(labels=dollar_format()) +
  labs(x="Release day",
       y="Worldwide gross revenue", 
       title="Does the day a movie is release affect revenue?") + 
  scale_fill_avatar()+
  theme_avatar()

let us perfom a statistical test for this

  • Does the mean gross differ?
mov |> 
  group_by(day) |> 
  summarise(average=mean(worldwide_gross),
            median = median(worldwide_gross),
            std.dev= sd(worldwide_gross))
#> # A tibble: 7 x 4
#>   day      average     median    std.dev
#>   <fct>      <dbl>      <dbl>      <dbl>
#> 1 1      70256412.  36674010   78293203.
#> 2 2     141521289.  50446776. 233127207.
#> 3 3     177233110.  80013623  228434437.
#> 4 4     130794183.  62076141  178563855.
#> 5 5     194466996. 121318930. 216828529.
#> 6 6      90769834.  41166033  131301918.
#> 7 7      89889497.  41486381  114314385.

run an anova test on this!

model<-aov(worldwide_gross~day,data=mov)
summary(model)
#>               Df    Sum Sq   Mean Sq F value   Pr(>F)    
#> day            6 1.178e+18 1.963e+17   9.961 6.44e-11 ***
#> Residuals   3110 6.129e+19 1.971e+16                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • \(H_0\) : there is no difference in means
  • \(H_1\) : means are different

since p-value is less than 0.05 we reject null hypothesis and conclude that the difference in mean gross is statistically significant

what about month?

library(scales)
mov |> 
  ggplot(aes(factor(month), worldwide_gross,fill=factor(month))) + 
  geom_boxplot() + 
  scale_y_log10(labels=dollar_format()) +
  labs(x="Release month",
       y="Worldwide gross revenue", 
       title="Does the month a movie is release affect revenue?",
       fill="month") + 
  scale_fill_tableau()+
  theme_avatar()
mov |> 
  group_by(month) |> 
  summarize(rev=mean(worldwide_gross))
mov |> 
  mutate(month=factor(month, ordered=FALSE)) %$%
  lm(worldwide_gross~month) |> 
  summary()

What does the worldwide movie market look like by decade? Let’s first group by year and genre and compute the sum of the worldwide gross revenue. After we do that, let’s plot a barplot showing year on the x-axis and the sum of the revenue on the y-axis, where we’re passing the genre variable to the fill aesthetic of the bar.

mov |> 
  group_by(year, genre) |> 
  summarise(revenue=sum(worldwide_gross)) |> 
  ggplot(aes(year, revenue)) + 
  geom_col(aes(fill=genre)) + 
  scale_y_continuous(labels=dollar_format()) + 
  labs(x="", y="Worldwide revenue", title="Worldwide Film Market by Decade")+
  theme_avatar()+
  scale_fill_gravityFalls()

Which genres produce the highest Return on investment?

  • looks like horror movies and drama take the lead .

next up

Let’s make a scatter plot showing the worldwide gross revenue over the production budget and let us facet by genre.

mov |>
  ggplot(aes(production_budget, worldwide_gross)) +
  geom_point(aes(size = roi)) +
  geom_abline(slope = 1, intercept = 0, col = "red") +
  facet_wrap( ~ genre) +
  theme_avatar()+
  scale_x_log10(labels = dollar_format()) +
  scale_y_log10(labels = dollar_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Production Budget", 
       y = "Worldwide gross revenue", 
       size = "Return on Investment")

Generally most of the points lie above the “breakeven” line. This is good – if movies weren’t profitable they wouldn’t keep making them. Proportionally there seem to be many more larger points in the Horror genre, indicative of higher ROI.

which are some of the most profitable movies

mov |> 
  arrange(desc(roi)) |> 
  head(20) |> 
  mutate(movie=fct_reorder(movie, roi)) |>
  ggplot(aes(movie, roi)) +
  geom_col(aes(fill=genre)) + 
  scale_fill_avatar()+
  theme_avatar()+
  labs(x="Movie", 
       y="Return On Investment", 
       title="Top 20 most profitable movies") + 
  coord_flip() + 
  geom_text(aes(label=paste0(round(roi), "x "), hjust=1), col="white")

let’s look at movie ratings

R-rated movies have a lower average revenue but ROI isn’t substantially less. We can see that while G-rated movies have the highest mean revenue, there were relatively few of them produced, and had a lower total revenue. There were more R-rated movies, but PG-13 movies really drove the total revenue worldwide.

mov |>
  group_by(mpaa_rating) |>
  summarize(
    meanrev = mean(worldwide_gross),
    totrev = sum(worldwide_gross),
    roi = mean(roi),
    number = n()
  )
#> # A tibble: 4 x 5
#>   mpaa_rating    meanrev       totrev   roi number
#>   <fct>            <dbl>        <dbl> <dbl>  <int>
#> 1 G           189913348   13863674404  4.42     73
#> 2 PG          147227422.  78324988428  4.64    532
#> 3 PG-13       113477939. 120173136920  3.06   1059
#> 4 R            63627931.  92451383780  4.42   1453

Are there fewer R-rated movies being produced? Not really. Let’s look at the overall number of movies with any particular rating faceted by genre.

mov |> 
  count(mpaa_rating, genre) |> 
  ggplot(aes(mpaa_rating, n,fill=mpaa_rating)) + 
  geom_col() + 
  theme_avatar()+
  scale_fill_avatar()+
  facet_wrap(~genre) +
  labs(x="MPAA Rating",
       y="Number of films", 
       title="Number of films by rating for each genre")

What about the distributions of ratings?

mov |> 
  ggplot(aes(worldwide_gross)) + 
  geom_histogram(fill=avatar_pal()(1)) + 
  facet_wrap(~mpaa_rating) +
  theme_avatar()+
  scale_x_log10(labels=dollar_format()) + 
  labs(x="Worldwide gross revenue", 
       y="Count",
       title="Distribution of revenue by genre")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
mov |> 
  ggplot(aes(mpaa_rating, worldwide_gross,fill=mpaa_rating)) + 
  scale_fill_avatar()+
  geom_boxplot() + 
  theme_avatar()+
  scale_y_log10(labels=dollar_format()) + 
  labs(x="MPAA Rating", y="Worldwide gross revenue", title="Revenue by rating")
  • Yes, on average G-rated movies look to perform better. But there aren’t that many of them being produced, and they aren’t bringing in the lions share of revenue.
mov |> 
  count(mpaa_rating) |> 
  ggplot(aes(mpaa_rating, n,fill=mpaa_rating)) + 
  theme_avatar()+
  scale_fill_avatar()+
  geom_col() + 
  labs(x="MPAA Rating", 
       y="Count",
       title="Total number of movies produced for each rating")
mov |> 
  group_by(mpaa_rating) |> 
  summarize(total_revenue=sum(worldwide_gross)) |> 
  ggplot(aes(mpaa_rating, total_revenue ,fill=mpaa_rating)) + 
  geom_col() + 
  scale_fill_tableau()+
  theme_avatar()+
  scale_y_continuous(label=dollar_format()) + 
  labs(x="MPAA Rating", 
       y="Total worldwide revenue",
       title="Total worldwide revenue for each rating")
  • PG-13 Seems to bring in more revenue worldwide

but wait , is there any association between genre and mpaa_rating?

# Create frequency table, save for reuse
ptable <- mov %>%           # Save table for reuse
  select(mpaa_rating, genre) %>%  # Variables for table
  table() %>%              # Create 2 x 2 table
  print()                  # Show table
#>            genre
#> mpaa_rating Action Adventure Comedy Drama Horror
#>       G          0        62      4     7      0
#>       PG        23       293     77   133      6
#>       PG-13    215        80    319   388     57
#>       R        268        14    356   621    194

CHI-SQUARED TEST

# Get chi-squared test for mpaa_rating and genre
ptable %>% chisq.test()
#> 
#> 	Pearson's Chi-squared test
#> 
#> data:  .
#> X-squared = 1343.7, df = 12, p-value < 2.2e-16
  • great ,p-value is less than 0.05 hence we can tell that genre and mpaa_rating are greatly associated .

let us Join to IMDB reviews dataset and get more insights

imdb <- read_csv("movies_imdb.csv")
head(imdb)

let us inner join the two datasets together

  • do not worry , i will share another tutorial on performing joins exclusively ,otherwise you can check one of my tutorials that compares SQL and R
movimdb <- inner_join(mov, imdb, by="movie")
head(movimdb)

What`s next?

let’s see some correlations here

Correlation

Correlation measures the strength and direction of association between two variables. There are three common correlation tests: the Pearson product moment (Pearson’s r), Spearman’s rank-order (Spearman’s rho), and Kendall’s tau (Kendall’s tau).

Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.

The Spearman’s rho and Kendal’s tao correlations are non-parametric measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.

let’s correlate

df<- mov |> 
  select_if(is.numeric)

# Correlation matrix for data frame
df %>% cor()
#>                   production_budget domestic_gross worldwide_gross         roi
#> production_budget        1.00000000     0.56946432      0.64987504 -0.08666254
#> domestic_gross           0.56946432     1.00000000      0.92468680  0.18692435
#> worldwide_gross          0.64987504     0.92468680      1.00000000  0.16047814
#> roi                     -0.08666254     0.18692435      0.16047814  1.00000000
#> pct_domestic            -0.31400653    -0.17286869     -0.34792146 -0.05975148
#> year                     0.10558412    -0.06889791      0.03379454 -0.07545929
#> month                    0.02019088     0.03681648      0.02885615  0.02167627
#>                   pct_domestic        year       month
#> production_budget  -0.31400653  0.10558412  0.02019088
#> domestic_gross     -0.17286869 -0.06889791  0.03681648
#> worldwide_gross    -0.34792146  0.03379454  0.02885615
#> roi                -0.05975148 -0.07545929  0.02167627
#> pct_domestic        1.00000000 -0.32547167 -0.05291129
#> year               -0.32547167  1.00000000 -0.07293380
#> month              -0.05291129 -0.07293380  1.00000000

yooogh ,that looks a bit messy!

# Fewer decimal places
df %>%
  cor() %>%     # Compute correlations
  round(2) %>%  # Round to 2 decimals
  print()
#>                   production_budget domestic_gross worldwide_gross   roi
#> production_budget              1.00           0.57            0.65 -0.09
#> domestic_gross                 0.57           1.00            0.92  0.19
#> worldwide_gross                0.65           0.92            1.00  0.16
#> roi                           -0.09           0.19            0.16  1.00
#> pct_domestic                  -0.31          -0.17           -0.35 -0.06
#> year                           0.11          -0.07            0.03 -0.08
#> month                          0.02           0.04            0.03  0.02
#>                   pct_domestic  year month
#> production_budget        -0.31  0.11  0.02
#> domestic_gross           -0.17 -0.07  0.04
#> worldwide_gross          -0.35  0.03  0.03
#> roi                      -0.06 -0.08  0.02
#> pct_domestic              1.00 -0.33 -0.05
#> year                     -0.33  1.00 -0.07
#> month                    -0.05 -0.07  1.00

Visualize correlation matrix with corrplot() from corrplot package

library(corrplot)

df %>%
  cor() %>%
  corrplot(
    type   = "upper",     # Matrix: full, upper, or lower
    diag   = F,           # Remove diagonal
    order  = "original",  # Order for labels
    tl.col = "black",     # Font color
    tl.srt = 45           # Label angle
  )
  • production cost ,world wide gross and domestic gross all seem to be inter-correlated
  • but is it significant?
# SINGLE CORRELATION #######################################

# Use cor.test() to test one pair of variables at a time.
# cor.test() gives r, the hypothesis test, and the
# confidence interval. This command uses the "exposition
# pipe," %$%, from magrittr, which passes the columns from
# the data frame (and not the data frame itself)

df %$% cor.test(production_budget,worldwide_gross)
#> 
#> 	Pearson's product-moment correlation
#> 
#> data:  production_budget and worldwide_gross
#> t = 47.722, df = 3115, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.6291207 0.6697034
#> sample estimates:
#>      cor 
#> 0.649875
  • off course yes ,the correlation is statistically significant

  • Separately for each MPAA rating, i will display the mean IMDB rating and mean number of votes cast.

#> # A tibble: 4 x 3
#>   mpaa_rating meanimdb meanvotes
#>   <fct>          <dbl>     <dbl>
#> 1 G               6.54   132015.
#> 2 PG              6.31    81841.
#> 3 PG-13           6.25   102740.
#> 4 R               6.58   107575.

let’s try to visualise the above results using boxplots and compare means

  • as seen from the means ,there seem to quite similar mean ratings here ,let’s run an ANOVA
model<-aov(imdb~mpaa_rating,data=movimdb)
summary(model)
#>               Df Sum Sq Mean Sq F value   Pr(>F)    
#> mpaa_rating    3   63.9  21.314   20.57 3.57e-13 ***
#> Residuals   2587 2680.0   1.036                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

comments

  • p-value is less than 0.05 hence there seem to be significant mean differences here

but which ratings actually differ?

  • lets run a post-hoc analysis
model_tukey<-TukeyHSD(model)
model_tukey
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = imdb ~ mpaa_rating, data = movimdb)
#> 
#> $mpaa_rating
#>                 diff        lwr        upr     p adj
#> PG-G     -0.22557339 -0.5969917 0.14584492 0.4011330
#> PG-13-G  -0.29020880 -0.6507315 0.07031389 0.1634754
#> R-G       0.04403339 -0.3135888 0.40165555 0.9890254
#> PG-13-PG -0.06463541 -0.2176996 0.08842880 0.6984264
#> R-PG      0.26960678  0.1235053 0.41570829 0.0000131
#> R-PG-13   0.33424219  0.2186105 0.44987391 0.0000000
  • the output suggests that the pairs R and PG And R and P-13 seem to have statistically significant mean differences
  • we can also visualise these results below
plot(model_tukey)

let’s do it for the genre as well

  • we can repeat the same analysis using the genre variable now
#> # A tibble: 5 x 3
#>   genre     meanimdb meanvotes
#>   <fct>        <dbl>     <dbl>
#> 1 Action        6.28   154681.
#> 2 Adventure     6.27   130027.
#> 3 Comedy        6.08    71288.
#> 4 Drama         6.88    91101.
#> 5 Horror        5.90    89890.
model<-aov(imdb~genre,data=movimdb)
summary(model)
#>               Df Sum Sq Mean Sq F value Pr(>F)    
#> genre          4  352.3   88.07   95.22 <2e-16 ***
#> Residuals   2586 2391.6    0.92                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

comments

  • p-value is less than 0.05 hence there seem to be significant mean differences here

but which genres actually differ?

  • lets run a post-hoc analysis
model_tukey<-TukeyHSD(model)
model_tukey
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = imdb ~ genre, data = movimdb)
#> 
#> $genre
#>                         diff        lwr         upr     p adj
#> Adventure-Action -0.01295985 -0.2012072  0.17528750 0.9997218
#> Comedy-Action    -0.20902492 -0.3729487 -0.04510117 0.0046218
#> Drama-Action      0.59576901  0.4438527  0.74768536 0.0000000
#> Horror-Action    -0.38546653 -0.6048229 -0.16611019 0.0000168
#> Comedy-Adventure -0.19606507 -0.3706441 -0.02148608 0.0186456
#> Drama-Adventure   0.60872886  0.4453722  0.77208554 0.0000000
#> Horror-Adventure -0.37250669 -0.5999359 -0.14507750 0.0000795
#> Drama-Comedy      0.80479393  0.6701859  0.93940201 0.0000000
#> Horror-Comedy    -0.17644161 -0.3841866  0.03130334 0.1392985
#> Horror-Drama     -0.98123555 -1.1796431 -0.78282803 0.0000000
  • looking at the column p adj ,we note that a lot of pairs have statistically significant differences in mean rating (p adj < 0.05).

IMDB Ratings by Genre by MPAA rating

to explore further let’s do a fill of genre and MPAA-rating as well

how does rating compare with world_wide gross?

  • lets Create a scatter plot of worldwide gross revenue by IMDB rating, with the gross revenue on a log scale.

how does ROI and ratings compare?

Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.