<- "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2012" url
Athletics rankings
Using R to gather data from an athletics rankings website.
As a keen athlete I spent many hours on athletics rankings website https://www.thepowerof10.info. For athletics nerds, the site is fantastic and provides all the information you need to keep up to date with the latest results. However, I wondered if it might be possible to explore some alternative ways of presenting the results…
Getting the data
Initially, I had to figure out a way to first gather the data from the power of 10 website. As you can see in the screenshot below, the rankings are already in a table format, so thankfully it wasn’t too difficult to figure something out.
::include_graphics("images/powerof10homepage.PNG") knitr
This was my first time using the rvest
package, so I’m sure this code can be improved somewhat… It does, however, seem to do the trick. To start, I identified a url from one of the ranking pages that would serve as a starting point to access the site.
To access other pages it was just a case of supplying new values to the arguments in the url. For example to get the 200m rankings I could just swap the “event=100” for “event=200” in the string. To do this, I wrote a simple function that would generate a vector of strings with the desired events, years and for male or female athletes.
library(tidyverse)
library(zoo)
library(knitr)
library(lubridate)
<- function(years, events, gender) {
urlmaker <- "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2012"
url <- as.character(str_extract_all(url, "[0-9]+")[[1]][[2]])
baseevent <- as.character(str_extract_all(url, "[0-9]+")[[1]][[3]])
baseyear <- rep(years, times = length(events))
newyear <- str_replace(url, baseyear, newyear)
list_of_years <- rep(events, each = length(years))
newevent <- str_replace(list_of_years, baseevent, newevent)
yearandevent
<- str_c("sex=", gender, sep = "")
sex <- str_replace(yearandevent, "sex=M", sex)
yearandevent return(yearandevent)
}
I then created a string of events and years I wanted to scrape. In this instance I only wanted male athletes, so also set gender to “M”. The result was a vector of urls as character strings that I could use to access the necessary pages.
<- c("100", "200", "400", "800", "1500", "3000", "5000", "10000")
events <- as.character(seq(from = 2006, to = 2018, by = 1))
years <- "M"
gender
# create list of urls to use for scraping
<- urlmaker(years, events, gender)
urls
print(head(urls, 3))
[1] "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2006"
[2] "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2007"
[3] "https://www.thepowerof10.info/rankings/rankinglist.aspx?event=100&agegroup=ALL&sex=M&year=2008"
Next, I had to figure out a way to actually get the data. Inspecting the html code on the page helped to identify the labels for the tables that I needed.
::include_graphics("images/po10_window.png") knitr
Once I knew how the different elements on the page were identified, it was as easy as copying and pasting the table id into html_nodes()
and converting to a table with html_table()
. After that it was just the usual wrangling you’d expect with a messy data frame.
library(rvest)
<- function(url) {
readtable <- read_html(url)
main <- main %>%
rankings html_nodes(xpath = '//*[@id="cphBody_lblCachedRankingList"]/table') %>%
html_table() %>%
data.frame() %>%
select(1:13) %>%
set_names(c("rank",
"perf",
"windy",
"windspeed",
"PB",
"newpb",
"name",
"agegroup",
"month_year",
"coach",
"club",
"venue",
"date")) %>%
mutate(
year = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[3]]),
event = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[2]])
%>%
) mutate_at(vars(name), list(~ replace(., . == "", NA))) %>%
mutate(name = na.locf(name))
}
Now I simply had to pass the url to the function, and we’ll have the table from the webpage stored as a dataframe.
<- readtable(url)
ranks
%>%
ranks slice(4:6) %>%
select(rank, perf, name, year, club, venue) %>%
kable()
rank | perf | name | year | club | venue |
---|---|---|---|---|---|
1 | 10.02 | Dwain Chambers | 2012 | Belgrave | Olympic Park |
2 | 10.05 | Adam Gemili | 2012 | Blackheath & Bromley | Barcelona, ESP |
3 | 10.13 | James Dasaolu | 2012 | Croydon | Olympic Park |
If I wanted to scrape multiple urls at once I would use map_df
. Here I can pass my vector of urls and map_df
will perform the scraping function readtable
on each element in the vector and append the result together (this can take a while to complete so I’ll just use the first 3 urls from our vector of 104 urls).
<- urls[1:3]
urls_short
<- urls_short %>%
male_rankings map_df(readtable)
Now we can view the top ranked athlete for each year.
%>%
male_rankings filter(rank == "1") %>%
select(year, rank, perf, name, club, venue) %>%
kable()
year | rank | perf | name | club | venue |
---|---|---|---|---|---|
2006 | 1 | 10.07 | Dwain Chambers | Belgrave | Gateshead |
2007 | 1 | 10.06 | Marlon Devonish | Coventry | Lausanne, SUI |
2008 | 1 | 10.00 | Dwain Chambers | Belgrave | Birmingham |
Individual athletes
The PO10 also provides detailed performance history for individual athletes. I wanted to be able to have access to this data as well, however I wanted to avoid downloading every individual athletes data to disk as I imagine that may have taken a while…
Instead the solution I came up with was a function get each individual athlete’s unique identifier number. With this number I could get each athletes individual rankings when required using an “on the fly” scrape, as a single athletes page is not very much data at all.
The screenshot below shows how the code that I needed to extract
::include_graphics("images/po10id.png") knitr
The function below collects the athltes name, unique url, year and event.
<- function(url) {
athleteurl <- read_html(url)
main <- tibble(
athleteinfo name = html_text(html_nodes(main, "td:nth-child(7) a"), "href"),
athleteurl = html_attr(html_nodes(main, "td:nth-child(7) a"), "href"),
year = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[3]]),
event = as.numeric(str_extract_all(url, "[0-9]+")[[1]][[2]])
%>%
) filter(name != "")
}
<- athleteurl(url)
ids
head(ids)
# A tibble: 6 × 4
name athleteurl year event
<chr> <chr> <dbl> <dbl>
1 Dwain Chambers /athletes/profile.aspx?athleteid=31816 2012 100
2 Adam Gemili /athletes/profile.aspx?athleteid=208735 2012 100
3 James Dasaolu /athletes/profile.aspx?athleteid=22721 2012 100
4 Harry Aikines-Aryeetey /athletes/profile.aspx?athleteid=19988 2012 100
5 Mark Lewis-Francis /athletes/profile.aspx?athleteid=21139 2012 100
6 James Alaka /athletes/profile.aspx?athleteid=22255 2012 100
In another function I join these two tables together so I had one data frame that had all the results and rankings as well as each athletes individual id that I could use to get their individual data. I also appended the full address to each individual athletes id.
<- function(ranks, ids, gender) {
finaljoin <- ranks %>%
yeartimes group_by(year, event) %>%
filter(str_detect(rank, "[:alpha:]")) %>%
select(year, rank) %>%
filter(str_detect(rank, "^UK")) %>%
separate(rank, c("topn", "timing"), ": ") %>%
spread(key = topn, value = timing) %>%
ungroup()
<- ranks %>%
cleanrankings filter(!str_detect(rank, "[:alpha:]")) %>%
inner_join(., ids, by = c("name", "year", "event")) %>%
mutate(athleteurl = paste("https://www.thepowerof10.info", athleteurl, sep = ""))
<- left_join(cleanrankings, yeartimes, by = c("event", "year")) %>%
cleanrankings ::clean_names() %>%
janitormutate(gender = gender)
}
This left me with a complete dataframe I could work with, with the options of getting individual athletes data as needed,
<- finaljoin(ranks = ranks, ids = ids, gender = "M") clean_ranks
Adding missing grouping variables: `event`
%>%
clean_ranks head() %>%
select(rank, perf, name, year, club, venue, athleteurl) %>%
kable()
rank | perf | name | year | club | venue | athleteurl |
---|---|---|---|---|---|---|
1 | 10.02 | Dwain Chambers | 2012 | Belgrave | Olympic Park | https://www.thepowerof10.info/athletes/profile.aspx?athleteid=31816 |
2 | 10.05 | Adam Gemili | 2012 | Blackheath & Bromley | Barcelona, ESP | https://www.thepowerof10.info/athletes/profile.aspx?athleteid=208735 |
3 | 10.13 | James Dasaolu | 2012 | Croydon | Olympic Park | https://www.thepowerof10.info/athletes/profile.aspx?athleteid=22721 |
4 | 10.20 | Harry Aikines-Aryeetey | 2012 | Sutton & District | Rovereto, ITA | https://www.thepowerof10.info/athletes/profile.aspx?athleteid=19988 |
5 | 10.21 | Mark Lewis-Francis | 2012 | Birchfield H | Mesa AZ, USA | https://www.thepowerof10.info/athletes/profile.aspx?athleteid=21139 |
6 | 10.22 | James Alaka | 2012 | Blackheath & Bromley | Eugene OR, USA | https://www.thepowerof10.info/athletes/profile.aspx?athleteid=22255 |
The final function I used was to scrape and clean each individuals rankings “on the fly”. The only input is an athletes individual url. The result was another dataframe that contains an individual athletes history of performances.
<- function(athlete) {
individual <- athlete %>%
history read_html() %>%
html_nodes(xpath = '//*[@id = "cphBody_pnlPerformances"]/table') %>%
html_table(fill = TRUE) %>%
2]] %>%
.[[select(-c(X4, X5, X8, X9)) %>%
set_names(c("event",
"perf",
"indoor",
"position",
"heat",
"venue",
"meeting",
"date")) %>%
filter(!str_detect(event, "[:alpha:]")) %>%
mutate(
date = dmy(date),
year = substr(date, 1, 4),
perf_time = case_when(
str_detect(perf, ":") == FALSE ~ str_c("00:00:", perf),
str_length(str_split_fixed(perf, ".", 4)[4]) == 3 ~ str_c(perf, "0"),
str_length(perf) == 6 | str_length(perf) == 7 ~ str_c("00:0", perf),
str_length(perf) == 8 | str_length(perf) == 9 ~ str_c("00:", perf),
TRUE ~ perf
)
)<- read_html(athlete)
name <- html_text(html_nodes(name, css = "h2"), trim = TRUE)
name <- history %>%
history mutate(name = name)
return(history)
}
<- "https://www.thepowerof10.info/athletes/profile.aspx?athleteid=31816"
athlete
individual(athlete) %>%
head(5) %>%
select(name, event, perf, position, venue, meeting, date) %>%
kable()
name | event | perf | position | venue | meeting | date |
---|---|---|---|---|---|---|
Dwain Chambers | 100 | 10.91 | 4 | Sportcity | Müller British Championships inc. Invitation 10000m | 2021-06-25 |
Dwain Chambers | 60 | 6.73 | 1 | Lee Valley | South of England AA U20 / Senior Championships | 2020-02-01 |
Dwain Chambers | 60 | 6.74 | 1 | Lee Valley | South of England AA U20 / Senior Championships | 2020-02-01 |
Dwain Chambers | 60 | 6.76 | 2 | Lee Valley | London U20 / Senior Games | 2020-01-19 |
Dwain Chambers | 60 | 6.77 | 5 | Lee Valley | Scienhealth Athletics Invitational | 2020-01-05 |
So that’s it! I am currently in the process of creating an interactive dashboard to visualise these results. You can see an early version here:
https://harryfish.shinyapps.io/resultsdashboard/
Lots of work left to do, but any comments or feedback are always welcome. The source code is on my github if you want to try anything out.
Thanks for reading!