Introduction and Objective

We often ask about the highest earned sold copy for video games, the most popular video game genres, or even which platform has sold the most. In this dataset, it contains a collection of video game sales with more than 100,000 copies. This records video games released in 1990 up to 2020. Containing sold copies varying from each region and a global sales, the year released and what genre each game was. The data contains about over 15,000 observations and 10 variables.

Dataset: https://www.kaggle.com/datasets/gregorut/videogamesales

Field 1 Field 2 Field 3 Field 4 Field 5 Field 6 Field 7 Field 8 Field 9 Field 10
Name Platform Year Genre Publisher NA Sales EU Sales JP Sales Other Sales Global Sales

Importing the downloaded dataset

library(RColorBrewer) # used to make colors

vgsales = read.csv("G:\\My Drive\\Purdue\\Semester 7-8 [Y4]\\07_Statistical Computing\\stats data\\midterm\\vgsales.csv",header=T, na.strings = "N/A")
vgsales = vgsales[,-1]      # removes rank
vgsales = na.omit(vgsales)  # remove any NA
head(vgsales)
##                       Name Platform Year        Genre Publisher NA_Sales
## 1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26

Descriptive Statistics and Graphical Methods

Five Number Summary on Region and Global Sales

This will show the average sold copies from each game, least and most sold, and the standard deviations from the median. All for their own regions and global sales.

summary(vgsales$Global_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.0600  0.1700  0.5409  0.4800 82.7400
summary(vgsales$NA_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0800  0.2656  0.2400 41.4900
summary(vgsales$EU_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0200  0.1477  0.1100 29.0200
summary(vgsales$JP_Sales)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07883  0.04000 10.22000
summary(vgsales$Other_Sales)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.01000  0.04843  0.04000 10.57000

Boxplot on Platforms sales in North America Region

I created subsets for each major video game platform (Nintendo, PS, Xbox, PC, and Others). Then viewing what are the sales each platform makes within their groups in North America.

vgOther = subset(vgsales,
    ( Platform !=  "NES")&( Platform !=  "SNES")&
    ( Platform !=  "N64")&( Platform !=  "GC")&
    ( Platform !=  "Wii")&( Platform !=  "WiiU")&
    ( Platform !=  "DS")&( Platform !=  "3DS")&
    ( Platform !=  "GB")&( Platform !=  "GBA")&
    ( Platform != "PC")&( Platform !=  "PS4")&
    ( Platform !=  "PS3")&( Platform !=  "PS2")&
    ( Platform !=  "PSP")&( Platform !=  "PSV")&
    ( Platform !=  "PS")&( Platform !=  "XB")&
    ( Platform !=  "X360")&( Platform !=  "XOne")
)
vgNintendo = subset(vgsales, (
  ( Platform ==  "NES")|( Platform ==  "SNES")|
  ( Platform ==  "N64")|( Platform ==  "GC")|
  ( Platform ==  "Wii")|( Platform ==  "WiiU")|
  ( Platform ==  "DS")|( Platform ==  "3DS")|
  ( Platform ==  "GB")|( Platform ==  "GBA")))
vgPC = subset(vgsales, Platform == "PC")
vgPS = subset(vgsales, (
  ( Platform ==  "PS4")|( Platform ==  "PS3")|
  ( Platform ==  "PS2")|( Platform ==  "PSP")|
  ( Platform ==  "PSV")|( Platform ==  "PS")))
vgXbox = subset(vgsales, (( Platform ==  "XB")|
  ( Platform ==  "X360")|( Platform ==  "XOne")))
par(las=2)

boxplot(vgNintendo$NA_Sales~vgNintendo$Platform,  col="red", ylim=c(0,30), main="Video Game Sales in NA on Nintendo Platforms", xlab="Nintendo Consoles", ylab="North American Copies Sold (in millions)")

boxplot(vgPS$NA_Sales~vgPS$Platform, col="cyan", ylim=c(0,10),main="Video Game Sales in NA on Playstation Platforms",xlab="Playstation Consoles", ylab="North American Copies Sold (in millions)")

boxplot(vgXbox$NA_Sales~vgXbox$Platform, col="green", ylim=c(0,15) , main="Video Game Sales in NA on Xbox Platforms",xlab="Xbox Consoles", ylab="North American Copies Sold (in millions)")

boxplot(vgPC$NA_Sales, col="grey", ylim=c(0,5) , main="Video Game Sales in NA on PC Platform",xlab="PC Platform", ylab="North American Copies Sold (in millions)")

boxplot(vgOther$NA_Sales~vgOther$Platform, col="pink", ylim=c(0,8) , main="Video Game Sales in NA on Other Platforms",xlab="Other Platforms", ylab="North American Copies Sold (in millions)")

Bar Graph on number of Publishers

This graph shows the various different publishers with the amount of games they have created.
(Note: Visuals on the graph are very crunched together that hide many labels of publishers)

barplot(table(vgsales$Publisher), col=brewer.pal(12,"Set3"), main="Amount of Video Games from Publishers", xlab="Publishers", ylab ="Number of Games", ylim=c(0,1400))

Bar Graph of Platform Frequency

This graph shows all of the platforms of the recorded video games that have been released in the last couple of years.

par(las=2)
barplot(table(vgsales$Platform), col= brewer.pal(12,"Set3"), main="Video Games Platform Distribution", ylim=c(0,2500), xlab="Platforms", ylab="Frequency", width = .5)

Dot Plot of Global Video Game Copies Sales in each Year:

This graph shows the global video game sales that were sold throughout the year.

plot(vgsales$Global_Sales~vgsales$Year, col="purple4", main ="Video Game Global Sales by Year", xlab="Release Year", ylab="Copies Sold (in millions)", ylim=c(0,45))

Histogram on the Video Game Year Release Frequency:

This graph shows the various frequencies of the number of releases of video games throughout the year.

par(las=2)
hist(vgsales$Year, col=brewer.pal(12,"Set3"), main="Video Games Year Release Frequency", xlab="Year",ylim=c(0,3000))

Video Game Genre Frequency and Distribution

Bar Graph on the frequency of genres from the video game observations

par(las=2)
barplot(table(vgsales$Genre), col=brewer.pal(12,"Set3"), main="Genre Distribution Count", ylim=c(0,3500), xlab="", ylab="Frequency")

Genre Distribution via Pie Graph from the video game observations

pie(table(vgsales$Genre), radius = 1, col= brewer.pal(12,"Set3"), main="Video Games Genre Distribution")

Hypothesis Testing and Discussion

Testing Pre 2010 and Post 2010 Global Sales

There was some testing I wanted to see like if there is a difference between the sales of video games after the 2010s or before the 2010s. I Test this by creating subsets of pre 2010 and post 2010 within R and then use the two sample t.test functions in R to see if games after 2010 have a greater difference to games before 2010. I use the global sold copies for viewing how much has been sold in total.

#subset of post 2010
post = subset(vgsales, (
    ( Year ==  "2010")|( Year ==  "2011")|
    ( Year ==  "2012")|( Year ==  "2013")|
    ( Year ==  "2014")|( Year ==  "2015")|
    ( Year ==  "2016")|( Year ==  "2017")|
    ( Year ==  "2020")))
#subset of pre 2010
pre = subset(vgsales, (
    ( Year ==  "1980")|( Year ==  "1981")|
    ( Year ==  "1983")|( Year ==  "1984")|
    ( Year ==  "1985")|( Year ==  "1986")|
    ( Year ==  "1987")|( Year ==  "1988")|
    ( Year ==  "1989")|( Year ==  "1990")|
    ( Year ==  "1991")|( Year ==  "1992")|
    ( Year ==  "1993")|( Year ==  "1994")|
    ( Year ==  "1995")|( Year ==  "1996")|
    ( Year ==  "1997")|( Year ==  "1998")|
    ( Year ==  "1999")|( Year ==  "2000")|
    ( Year ==  "2001")|( Year ==  "2002")|
    ( Year ==  "2003")|( Year ==  "2004")|
    ( Year ==  "2005")|( Year ==  "2006")|
    ( Year ==  "2007")|( Year ==  "2008")|
    ( Year ==  "2009")))

t.test(post$Global_Sales,pre$Global_Sales,alt ="greater")
## 
##  Welch Two Sample t-test
## 
## data:  post$Global_Sales and pre$Global_Sales
## t = -3.0525, df = 13240, p-value = 0.9989
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.1111381        Inf
## sample estimates:
## mean of x mean of y 
## 0.4909233 0.5631427

This Fails to Reject Null Hypothesis since the P value is greater than alpha value (0.99 > 0.05). This means that there is not enough evidence to confirm that games past 2010 sold more than games before 2010. Let’s try to see if there is any difference between the two… This time attempting a two sample test by checking if the means of post 2010 sales are the same as pre 2010 sales.

t.test(post$Global_Sales,pre$Global_Sales,alt="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  post$Global_Sales and pre$Global_Sales
## t = -3.0525, df = 13240, p-value = 0.002274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.11859484 -0.02584405
## sample estimates:
## mean of x mean of y 
## 0.4909233 0.5631427

It Rejects the Null Hypothesis (0.002 < 0.05). This means that there is a difference between the two different means which could say that games before 2010 sold more than those past 2010.

Testing Sales Comparison with North America and Japan

We can also make a hypothesis test to see whether the video games sold in North America are the different as those being sold in Japan.

t.test(vgsales$NA_Sales,vgsales$JP_Sales,alt="two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  vgsales$NA_Sales and vgsales$JP_Sales
## t = 27.109, df = 20880, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1733061 0.2003211
## sample estimates:
## mean of x mean of y 
## 0.2656467 0.0788331

It Rejects the Null Hypothesis (2.2^-16 < 0.05). This means that there is a significant difference between the copies sold in North America and Japan regions.

Testing Platform Sales

Finally, I would make a couple hypotheses regarding if any of the three major video game platforms have any significant differences on video game sales for their platforms. I would test this by using a two sample test comparing the three platforms individually.

Nintendo V PlayStation
t.test(vgNintendo$Global_Sales,vgPS$Global_Sales)
## 
##  Welch Two Sample t-test
## 
## data:  vgNintendo$Global_Sales and vgPS$Global_Sales
## t = 0.91399, df = 9213.4, p-value = 0.3607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03159010  0.08678471
## sample estimates:
## mean of x mean of y 
## 0.5672301 0.5396328

Failed to Reject Null Hypothesis (0.36 > 0.05). This means there isn’t enough evidence to say that there is a significant difference in the number of sales on Nintendo and PlayStation platforms globally.

Nintendo v Xbox
t.test(vgNintendo$Global_Sales,vgPS$Global_Sales)
## 
##  Welch Two Sample t-test
## 
## data:  vgNintendo$Global_Sales and vgPS$Global_Sales
## t = 0.91399, df = 9213.4, p-value = 0.3607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03159010  0.08678471
## sample estimates:
## mean of x mean of y 
## 0.5672301 0.5396328

Failed to Reject Null Hypothesis (0.31 > 0.05). This means there isn’t enough evidence to say that there is a significant difference in the number of sales on Nintendo and Xbox platforms globally. ##### Playstation v Xbox

t.test(vgPS$Global_Sales,vgXbox$Global_Sales)
## 
##  Welch Two Sample t-test
## 
## data:  vgPS$Global_Sales and vgXbox$Global_Sales
## t = -2.1362, df = 3436.7, p-value = 0.03273
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.126640579 -0.005427174
## sample estimates:
## mean of x mean of y 
## 0.5396328 0.6056667

Reject Null Hypothesis (0.03 < 0.05). This means that there is enough evidence to say there is a significant sales difference between Xbox and Playstation platforms

Conclusion

We can look at the statistics of how well sales are made over the years and how many each company has for their platform. We can also verify that there is a sales difference in video games not only on platforms but also including what year they have been released. Other information we can obtain are the various types of genres that were from the observations and gives a distribution on how much each genre takes up. Testing also various hypotheses on where a platform truly has a difference in sales compared to others or if a region has sold more copies than another region. This shows the data for the previous sales of video games is the last couple years with tons of insight we can gain.

References

Smith, G. (2023). Video Game Sales (Version No. 2) [Data set]. Kaggle. https://www.kaggle.com/datasets/gregorut/videogamesales