Sunday, January 24, 2016

Finding out the team which won the EPL football tournament in the year 2013-2014

I am new to R programming and the tutorials I read were very helpful but I needed a use case to understand the subject better. As a football enthusiast, I couldn't find a better dataset than the football data of 2013-14. I am not going to go deep into statistical analysis, all I am trying to do is to experiment the basic functionalities of R. I have used excel and know finding the winner is a very simple task with Excel. To answer the question,why use R then, please google.

To those who are unaware of who won the league in the year 2013-14, it was a club called Manchester City. They lead the table by a total of 86 points.
"There are 20 clubs in the Premier League. During the course of a season (from August to May) each club plays the others twice (a double round-robin system), once at their home stadium and once at that of their opponents, for a total of 38 games. Teams receive three points for a win and one point for a draw. No points are awarded for a loss. Teams are ranked by total points, then goal difference, and then goals scored. If still equal, teams are deemed to occupy the same position. If there is a tie for the championship, for relegation, or for qualification to other competitions, a play-off match at a neutral venue decides rank.[40] The three lowest placed teams are relegated into the Football League Championship, and the top two teams from the Championship, together with the winner of play-offs involving the third to sixth placed Championship clubs, are promoted in their place." - Wikipedia

You can get the data from here. Scroll down to season 2013/2014 and click on 'Premier League' to download the csv file.Move the csv file to your R repository. To know more about the basics of R, I found this guide very helpful.
The basic way in which a winner is determined is as follows,
The basic steps in finding a winner is, 
1. Find out the total points earned at 'Home' for each team
2. Find out the total points earned 'Away' for each team
3. Add them both for each team
4. Find the team with the highest points
Load the csv file into R, I have named the csv file as 'foot.csv'

fd <- read.csv("foot.csv")
Then type,
View(fd) #this command will show the table in R studio.

You will see that most of the columns are irrelevant to you, So we will just use the first 7 columns which seem relevant to our use case
There are 2 ways of doing this,
1. fd <- fd[1:7] will do the trick. Use command 'View(fd)' to visualize your table again.
Also, I realise some columns 5,6 are irrelevant as well. I am not sure if there is a better way to trim down the data, but this is what I do.
fd$FTHG <- NULL
fd$FTAG <- NULL
This command will remove the two columns.

2. Alternatively, you can choose to use the select function in the dplyr package. For this to work you have to install the dplyr package and the library.
Syntax : select(datasetname, columnname1,columnname2,...)
Usage  :  View(newfd<-select(fd,Date,HomeTeam,AwayTeam,FTR)) #Do not use fd$Date,fd$HomeTeam in the select function.)
Use command 'View(fd)' to visualize your table again.

Now I see that there are 380 matches which were played and I need to find out the home wins and the away wins for each of those matches. So I add an extra 2 columns in the table called 'homewins' and 'awaywins' and by default set their value to 0. To add an extra column simply use datasetname$newcolumnname < -new column value
fd$homewins <- 0
fd$awaywins <- 0

Now we need to find out the points each team had obtained. Something similiar to this Excel's IF statement.
#Executing Excel's if statement in R
Excel : if(value in FTR == "H",3,if(value in FTR == "D",1,0))
fd$homewins[fd$FTR %in% "H"]<-3
fd$homewins[fd$FTR %in% "D"]<-1
fd$homewins[fd$FTR %in% "A"]<-0
Again, not sure if there is an easier way to do this. Please let me know if you have a better solution.
Going ahead, do the same for the 'awaywins' column.

Now all we need to do is to find out the total points each team scored in their Home Ground and in the Away Ground.
For this purpose we use the function called aggregate which works like a SUMIF function in excel.
Syntax : aggregate( column to be summed ~ filter column, datasetname, function name(sum,mean..etc))
View(Homewins <- aggreagte(fd$homewins~fd$HomeTeam,fd,sum))
Similiarly, calculate the same for Awaywins
View(Awaywins <- aggreagte(fd$awaywins~fd$AwayTeam,fd,sum))
Now, we need to find out the total points for each team. For this, I merge the two dataframes into one and then sum the values. Merge requires a unique column to be present in both dataframes. In our case the column is the team name. In order for merge to work we need to change the the column name has to be the same in both data frames.
Let us change the column name to "ID"
Syntax: colnames(dataset name)[colnumber] <- "New column name"
colnames(Homewins)[1] <- "ID"colnames(Awaywins)[1] <- "ID"
We then merge the two dataframes
Syntax: merge(dataframe1,dataframe2,...,by="unique identifying colname")
Total <- merge(Awaywins, Homewins,by="ID")
Next we create a new column called sum(whenever you want to add a new column, the syntax is datasetname$newcolumnname <- column value ) and add the homewins and awaywins in this column
Total$winningpoints <- Total$`fd$homewins`+Total$`fd$awaywins`
Now we need to find out the maximum value in this column which will give our winning team for the season
View(subset(Total, winningpoints == max(winningpoints),))

The R script for finding the winner of the premier league is as follows,
fd <- read.csv("pl20142015.csv")
fd <- fd[1:7]
fd$FTHG <- NULL
fd$FTAG <- NULL
fd$homewins <- 0
fd$awaywins <- 0
fd$homewins[fd$FTR %in% "H"]<-3
fd$homewins[fd$FTR %in% "D"]<-1
fd$homewins[fd$FTR %in% "A"]<-0
fd$awaywins[fd$FTR %in% "A"]<-3
fd$awaywins[fd$FTR %in% "D"]<-1
fd$awaywins[fd$FTR %in% "H"]<-0
Homewins <- aggregate(fd$homewins~fd$HomeTeam,fd,sum)
Awaywins <- aggregate(fd$awaywins~fd$AwayTeam,fd,sum)
colnames(Homewins)[1] <- "ID"
colnames(Awaywins)[1] <- "ID"
Total <- merge(Homewins,Awaywins,by="ID")
Total$winningpoints <- Total$`fd$homewins`+Total$`fd$awaywins`
View(subset(Total,winningpoints == max(winningpoints),))
If anyone has a better solution, please let me know. Thank you in advance.








No comments:

Post a Comment