We often ask about the highest earned sold copy for video games, the most popular video game genres, or even which platform has sold the most. In this dataset, it contains a collection of video game sales with more than 100,000 copies. This records video games released in 1990 up to 2020. Containing sold copies varying from each region and a global sales, the year released and what genre each game was. The data contains about over 15,000 observations and 10 variables.
Dataset: https://www.kaggle.com/datasets/gregorut/videogamesales
Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 | Field 9 | Field 10 |
---|---|---|---|---|---|---|---|---|---|
Name | Platform | Year | Genre | Publisher | NA Sales | EU Sales | JP Sales | Other Sales | Global Sales |
library(RColorBrewer) # used to make colors
vgsales = read.csv("G:\\My Drive\\Purdue\\Semester 7-8 [Y4]\\07_Statistical Computing\\stats data\\midterm\\vgsales.csv",header=T, na.strings = "N/A")
vgsales = vgsales[,-1] # removes rank
vgsales = na.omit(vgsales) # remove any NA
head(vgsales)
## Name Platform Year Genre Publisher NA_Sales
## 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
This will show the average sold copies from each game, least and most sold, and the standard deviations from the median. All for their own regions and global sales.
summary(vgsales$Global_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0600 0.1700 0.5409 0.4800 82.7400
summary(vgsales$NA_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0800 0.2656 0.2400 41.4900
summary(vgsales$EU_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0200 0.1477 0.1100 29.0200
summary(vgsales$JP_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07883 0.04000 10.22000
summary(vgsales$Other_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.01000 0.04843 0.04000 10.57000
I created subsets for each major video game platform (Nintendo, PS, Xbox, PC, and Others). Then viewing what are the sales each platform makes within their groups in North America.
vgOther = subset(vgsales,
( Platform != "NES")&( Platform != "SNES")&
( Platform != "N64")&( Platform != "GC")&
( Platform != "Wii")&( Platform != "WiiU")&
( Platform != "DS")&( Platform != "3DS")&
( Platform != "GB")&( Platform != "GBA")&
( Platform != "PC")&( Platform != "PS4")&
( Platform != "PS3")&( Platform != "PS2")&
( Platform != "PSP")&( Platform != "PSV")&
( Platform != "PS")&( Platform != "XB")&
( Platform != "X360")&( Platform != "XOne")
)
vgNintendo = subset(vgsales, (
( Platform == "NES")|( Platform == "SNES")|
( Platform == "N64")|( Platform == "GC")|
( Platform == "Wii")|( Platform == "WiiU")|
( Platform == "DS")|( Platform == "3DS")|
( Platform == "GB")|( Platform == "GBA")))
vgPC = subset(vgsales, Platform == "PC")
vgPS = subset(vgsales, (
( Platform == "PS4")|( Platform == "PS3")|
( Platform == "PS2")|( Platform == "PSP")|
( Platform == "PSV")|( Platform == "PS")))
vgXbox = subset(vgsales, (( Platform == "XB")|
( Platform == "X360")|( Platform == "XOne")))
par(las=2)
boxplot(vgNintendo$NA_Sales~vgNintendo$Platform, col="red", ylim=c(0,30), main="Video Game Sales in NA on Nintendo Platforms", xlab="Nintendo Consoles", ylab="North American Copies Sold (in millions)")
boxplot(vgPS$NA_Sales~vgPS$Platform, col="cyan", ylim=c(0,10),main="Video Game Sales in NA on Playstation Platforms",xlab="Playstation Consoles", ylab="North American Copies Sold (in millions)")
boxplot(vgXbox$NA_Sales~vgXbox$Platform, col="green", ylim=c(0,15) , main="Video Game Sales in NA on Xbox Platforms",xlab="Xbox Consoles", ylab="North American Copies Sold (in millions)")
boxplot(vgPC$NA_Sales, col="grey", ylim=c(0,5) , main="Video Game Sales in NA on PC Platform",xlab="PC Platform", ylab="North American Copies Sold (in millions)")
boxplot(vgOther$NA_Sales~vgOther$Platform, col="pink", ylim=c(0,8) , main="Video Game Sales in NA on Other Platforms",xlab="Other Platforms", ylab="North American Copies Sold (in millions)")
This graph shows the various different publishers with the amount of
games they have created.
(Note: Visuals on the graph are very crunched together that hide many
labels of publishers)
barplot(table(vgsales$Publisher), col=brewer.pal(12,"Set3"), main="Amount of Video Games from Publishers", xlab="Publishers", ylab ="Number of Games", ylim=c(0,1400))
This graph shows all of the platforms of the recorded video games that have been released in the last couple of years.
par(las=2)
barplot(table(vgsales$Platform), col= brewer.pal(12,"Set3"), main="Video Games Platform Distribution", ylim=c(0,2500), xlab="Platforms", ylab="Frequency", width = .5)
This graph shows the global video game sales that were sold throughout the year.
plot(vgsales$Global_Sales~vgsales$Year, col="purple4", main ="Video Game Global Sales by Year", xlab="Release Year", ylab="Copies Sold (in millions)", ylim=c(0,45))
This graph shows the various frequencies of the number of releases of video games throughout the year.
par(las=2)
hist(vgsales$Year, col=brewer.pal(12,"Set3"), main="Video Games Year Release Frequency", xlab="Year",ylim=c(0,3000))
Bar Graph on the frequency of genres from the video game observations
par(las=2)
barplot(table(vgsales$Genre), col=brewer.pal(12,"Set3"), main="Genre Distribution Count", ylim=c(0,3500), xlab="", ylab="Frequency")
Genre Distribution via Pie Graph from the video game observations
pie(table(vgsales$Genre), radius = 1, col= brewer.pal(12,"Set3"), main="Video Games Genre Distribution")
There was some testing I wanted to see like if there is a difference between the sales of video games after the 2010s or before the 2010s. I Test this by creating subsets of pre 2010 and post 2010 within R and then use the two sample t.test functions in R to see if games after 2010 have a greater difference to games before 2010. I use the global sold copies for viewing how much has been sold in total.
#subset of post 2010
post = subset(vgsales, (
( Year == "2010")|( Year == "2011")|
( Year == "2012")|( Year == "2013")|
( Year == "2014")|( Year == "2015")|
( Year == "2016")|( Year == "2017")|
( Year == "2020")))
#subset of pre 2010
pre = subset(vgsales, (
( Year == "1980")|( Year == "1981")|
( Year == "1983")|( Year == "1984")|
( Year == "1985")|( Year == "1986")|
( Year == "1987")|( Year == "1988")|
( Year == "1989")|( Year == "1990")|
( Year == "1991")|( Year == "1992")|
( Year == "1993")|( Year == "1994")|
( Year == "1995")|( Year == "1996")|
( Year == "1997")|( Year == "1998")|
( Year == "1999")|( Year == "2000")|
( Year == "2001")|( Year == "2002")|
( Year == "2003")|( Year == "2004")|
( Year == "2005")|( Year == "2006")|
( Year == "2007")|( Year == "2008")|
( Year == "2009")))
t.test(post$Global_Sales,pre$Global_Sales,alt ="greater")
##
## Welch Two Sample t-test
##
## data: post$Global_Sales and pre$Global_Sales
## t = -3.0525, df = 13240, p-value = 0.9989
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.1111381 Inf
## sample estimates:
## mean of x mean of y
## 0.4909233 0.5631427
This Fails to Reject Null Hypothesis since the P value is greater than alpha value (0.99 > 0.05). This means that there is not enough evidence to confirm that games past 2010 sold more than games before 2010. Let’s try to see if there is any difference between the two… This time attempting a two sample test by checking if the means of post 2010 sales are the same as pre 2010 sales.
t.test(post$Global_Sales,pre$Global_Sales,alt="two.sided")
##
## Welch Two Sample t-test
##
## data: post$Global_Sales and pre$Global_Sales
## t = -3.0525, df = 13240, p-value = 0.002274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.11859484 -0.02584405
## sample estimates:
## mean of x mean of y
## 0.4909233 0.5631427
It Rejects the Null Hypothesis (0.002 < 0.05). This means that there is a difference between the two different means which could say that games before 2010 sold more than those past 2010.
We can also make a hypothesis test to see whether the video games sold in North America are the different as those being sold in Japan.
t.test(vgsales$NA_Sales,vgsales$JP_Sales,alt="two.sided")
##
## Welch Two Sample t-test
##
## data: vgsales$NA_Sales and vgsales$JP_Sales
## t = 27.109, df = 20880, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1733061 0.2003211
## sample estimates:
## mean of x mean of y
## 0.2656467 0.0788331
It Rejects the Null Hypothesis (2.2^-16 < 0.05). This means that there is a significant difference between the copies sold in North America and Japan regions.
Finally, I would make a couple hypotheses regarding if any of the three major video game platforms have any significant differences on video game sales for their platforms. I would test this by using a two sample test comparing the three platforms individually.
t.test(vgNintendo$Global_Sales,vgPS$Global_Sales)
##
## Welch Two Sample t-test
##
## data: vgNintendo$Global_Sales and vgPS$Global_Sales
## t = 0.91399, df = 9213.4, p-value = 0.3607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03159010 0.08678471
## sample estimates:
## mean of x mean of y
## 0.5672301 0.5396328
Failed to Reject Null Hypothesis (0.36 > 0.05). This means there isn’t enough evidence to say that there is a significant difference in the number of sales on Nintendo and PlayStation platforms globally.
t.test(vgNintendo$Global_Sales,vgPS$Global_Sales)
##
## Welch Two Sample t-test
##
## data: vgNintendo$Global_Sales and vgPS$Global_Sales
## t = 0.91399, df = 9213.4, p-value = 0.3607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03159010 0.08678471
## sample estimates:
## mean of x mean of y
## 0.5672301 0.5396328
Failed to Reject Null Hypothesis (0.31 > 0.05). This means there isn’t enough evidence to say that there is a significant difference in the number of sales on Nintendo and Xbox platforms globally. ##### Playstation v Xbox
t.test(vgPS$Global_Sales,vgXbox$Global_Sales)
##
## Welch Two Sample t-test
##
## data: vgPS$Global_Sales and vgXbox$Global_Sales
## t = -2.1362, df = 3436.7, p-value = 0.03273
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.126640579 -0.005427174
## sample estimates:
## mean of x mean of y
## 0.5396328 0.6056667
Reject Null Hypothesis (0.03 < 0.05). This means that there is enough evidence to say there is a significant sales difference between Xbox and Playstation platforms
We can look at the statistics of how well sales are made over the years and how many each company has for their platform. We can also verify that there is a sales difference in video games not only on platforms but also including what year they have been released. Other information we can obtain are the various types of genres that were from the observations and gives a distribution on how much each genre takes up. Testing also various hypotheses on where a platform truly has a difference in sales compared to others or if a region has sold more copies than another region. This shows the data for the previous sales of video games is the last couple years with tons of insight we can gain.
Smith, G. (2023). Video Game Sales (Version No. 2) [Data set]. Kaggle. https://www.kaggle.com/datasets/gregorut/videogamesales