Introduction
In this post I describe how to create a dataset from scraping information from HTML or a Wiki. Perrypedia is a wiki holding information about Perry Rhodan Cite. “Perry Rhodan is the eponymous hero of a German science fiction novel series which has been published each week since 8 September 1961 in the ‘Romanhefte’ format (digest-sized booklets, usually containing 66 pages, the German equivalent of the now-defunct American pulp magazine) by Pabel-Moewig Verlag, a subsidiary of Bauer Media Group.”
The data needed for finding out answers to my set questions of interest is found in the aforementioned Perrypedia wiki. For each issue meta data is provided in an HTML table, e.g, Issue 1. Meta data includes, series, season, title, subtitle, author, title illustrator, inbound illustrator, release, maincharacters, eventtimeframe, and locations.
The workflow that I follow is:
- Setup
- Save HTML files locally from Perrypedia
- Scrape information from HTML
- Analyze results and improve scraping, repeat step 3
- Amend data with for example gender information
- Save resulting dataset to a CSV file
My personal motivation to perform this task is to learn how to scrape information from websites. Moreover, I would like to learn about the tidyr package to clean data for further analysis.
Legal disclaimer: Dieser Artikel basiert auf einem Artikel der Perrypedia und ist unter den Bedingungen der GNU FDL verfügbar. Autoren und Quelltext sind dort verfügbar.
Setup
I use the following libraries:
- futile.logger for logging
- tidyverse for all relevant packages to manipulate and clean data
- rvest to scrape information from website
- gender
# load the tidyverse and other libraries
library(futile.logger)
library(tidyverse)
library(rvest)
library(stringr)
library(gender)
flog.trace("libraries loaded")
Setup of useful constants:
# set logging threshold
flog.threshold(INFO)
## NULL
# issue numbers to handle
c_issue_load_from <- 2899
c_issue_load_to <- 2898
c_issue_scrape_from <- 2899
c_issue_scrape_to <- 2899
c_url_base <- "https://www.perrypedia.proc.org/wiki/Quelle:PR"
c_folder_html <- "C:/tmp/"
C_folder_csv <- "C:/tmp/"
c_file_issues_cvs <- "PR_main_issues.csv"
c_file_characters_cvs <- "PR_main_characters.csv"
c_file_character_combinations_cvs <- "PR_main_character_combinations.csv"
c_file_locations_cvs <- "PR_main_locations.csv"
c_xpath_base <-
'//*[@id="mw-content-text"]/div[contains(@class, "perrypedia_std_rframe")]/table/tr['
flog.trace("constants set")
Then I setup a new dataframe that represents the information structure:
- season
- title
- subtitle
- author
- release
- maincharacters
- eventtimeframe
- locations
Save HTML files locally from Perrypedia
The benefits of downloading all HTMl files include that I can perform any number of scraping attempts without getting it off the web every time – This saves resources and time – and that I can work on this project while I am offline.
The URLs for each issue of the main series are like this https://www.perrypedia.proc.org/wiki/Quelle:PR2903
, where the last four digits are the number of the issue. To build the URLs I need, I created a function.
pr_get_issue_urls <- function(from = 2900, to = 2901) {
for (i in from:to) {
tmp <- str_c(c_url_base, i)
if (i == from) {
result_vector <- rbind(tmp)
} else {
result_vector <- result_vector %>%
rbind(tmp)
}
}
return(result_vector)
}
#pr_get_issue_urls()
The following function reads a web page and saves it to the given folder.
pr_get_html <- function(issue_number) {
# read html page
read_html <- read_html(paste(c_url_base, issue_number, sep = ""))
flog.trace("got html for url: %s", paste(c_url_base, issue_number, sep = ""))
#save html
write_xml(read_html, file = paste(c_folder_html, issue_number, ".html", sep = ""))
flog.trace("wrote html to: %s", paste(c_folder_html, issue_number, ".html", sep = ""))
}
The following loop then saves all issues from 1 to 2899 to the folder.
for (i in c_issue_load_from : c_issue_load_to) {
pr_get_html(i)
}
flog.info("got HTML from issue %s to issue %s", c_issue_load_from, c_issue_load_to)
## INFO [2017-09-12 10:04:01] got HTML from issue 2899 to issue 2898
Scrape information from HTML
In order to scrape the information that is contained in the HTML page, I need to find the xpath query that is //*[@id="mw-content-text"]/div[contains(@class, "perrypedia_std_rframe")]/table/tr[i]
, where i is a number between 2 and 12. I store the xpath query in the var c_xpath_base
.
pr_main_series <- tibble("series", "season", "title", "subtitle", "author", "title_pic", "inbound_ill", "release", "maincharacters", "eventtimeframe", "locations")
flog.trace("structure setup")
# for all issues to analyze
for (k in c_issue_scrape_from : c_issue_scrape_to) {
# get URL for issue
tmpurl = paste(c_folder_html, k, ".html", sep = "")
# position of table rows
from <- 2
to <- 12
# intiate result vector
result_vector <- c()
# read html page
read_html <- read_html(tmpurl)
# for all attributes
for (i in from:to) {
tmp <- read_html %>%
html_nodes(xpath = str_c(c_xpath_base, i, "]")) %>%
html_text() %>%
str_replace_all("\n", "") %>%
str_split(":")
#flog.trace("try to bind metadata to result_vector in function pr_read_metadata: %s",
# tmp)
result_vector <- tryCatch({
result_vector %>%
rbind(tmp[[1]][[2]])
},
error = function(cond) {
flog.error("could not bind: %s", cond)
# Choose a return value in case of error
#return(NA)
})
}
pr_main_series <- pr_main_series %>%
rbind(result_vector[, 1])
flog.trace("bound %s: ", result_vector[, 1])
}
flog.info("scraped metadata from issue %s to issue %s", c_issue_scrape_from, c_issue_scrape_to)
## INFO [2017-09-12 10:04:01] scraped metadata from issue 2899 to issue 2899
Analyse results and improve scraping, repeat step 3
For some issues (and hence HTML pages and tables) the setting of second’s until twelfth’s attribute does not work. For example, issue 500 is set differently. The information is stored in 3-13. Hence, I amend the script to account for that. Moreover, there are exactly two issues without a release date and some issues without an inbound illustrator.
pr_main_series <- tibble("series", "season", "title", "subtitle", "author", "title_pic", "inbound_ill", "release", "maincharacters", "eventtimeframe", "locations")
flog.trace("structure setup")
# vector with issues that do not have a release date -> need to add a NA
issues_wo_releasedate <- c(1292, 1573)
# vector with issues that do not have an inbound illustrator -> need to add a NA
issues_wo_inill <-
c(1301,
1576,
1795,
1796,
1797,
1799,
1907,
1908,
1909,
1910,
1912,
1913,
1914,
1915,
1917,
1919,
1920,
1921,
1922,
1927,
1929,
1931,
1935,
1936,
1941,
1943,
1951,
1954,
1955,
1956,
1959,
1960,
1961,
1962,
1965,
2383,
2500
)
# for all issues to analyze
for (k in c_issue_scrape_from : c_issue_scrape_to) {
# get URL for issue
tmpurl = paste(c_folder_html, k, ".html", sep = "")
# position of table rows
from <- 2
to <- 12
if (k == 500 |
k == 700 |
k == 800 |
k == 900 |
k == 1000 |
k == 1100 |
k == 1200 |
k == 1300 |
k == 1400 |
k == 1500 |
k == 1600 |
k == 1700 |
k == 1800 |
k == 1900 |
k == 2000 |
k == 2700 |
k == 2750 |
k == 2800) {
from <- 3
to <- 13
flog.trace("must be a start of a new cycle")
}
# intiate result vector
result_vector <- c()
# read html page
read_html <- read_html(tmpurl)
# for all attributes
for (i in from:to) {
# for all issues without an inbound illustrator
if (k %in% issues_wo_inill & i == 8) {
result_vector <- result_vector %>%
rbind(NA)
}
# for all issues without a release date
if (k %in% issues_wo_releasedate & i == 9) {
result_vector <- result_vector %>%
rbind(NA)
}
tmp <- read_html %>%
html_nodes(xpath = str_c(c_xpath_base, i, "]")) %>%
html_text() %>%
str_replace_all("\n", "") %>%
str_split(":")
#flog.trace("try to bind metadata to result_vector in function pr_read_metadata: %s",
# tmp)
result_vector <- tryCatch({
result_vector %>%
rbind(tmp[[1]][[2]])
},
error = function(cond) {
flog.error("could not bind: %s", cond)
# Choose a return value in case of error
#return(NA)
})
}
pr_main_series <- pr_main_series %>%
rbind(result_vector[, 1])
flog.trace("bound %s: ", result_vector[, 1])
}
flog.info("scraped metadata from issue %s to issue %s", c_issue_scrape_from, c_issue_scrape_to)
## INFO [2017-09-12 10:04:01] scraped metadata from issue 2899 to issue 2899
pr_main_series <- pr_main_series[-1,]
names(pr_main_series) <- gsub("\"", "", names(pr_main_series), fixed = TRUE)
flog.trace("cleaned data structure")
Amend data with for example gender information
The resulting data set is captured. Nevertheless, we can compute additional variables with the given variables. I try to capture this additional information:
- The release year from the release information, which is not structured
- Gender information for all authors based on first names
- Extract issue number
- Extract additional data sets for locations and main characters
Release year
Release information is provided as 1) Freitag, 8. September 1961, 2) 1961, or 3) April 1999. To get the year, I try to get the last four characters.
head(pr_main_series$release)
## [1] "Freitag, 10. März 2017"
pr_main_series$release_year <- as.numeric(substr(pr_main_series$release, nchar(pr_main_series$release)-3, nchar(pr_main_series$release)))
pr_main_series %>%
filter(is.na(release_year))
## # A tibble: 0 × 12
## # ... with 12 variables: series <chr>, season <chr>, title <chr>,
## # subtitle <chr>, author <chr>, title_pic <chr>, inbound_ill <chr>,
## # release <chr>, maincharacters <chr>, eventtimeframe <chr>,
## # locations <chr>, release_year <dbl>
# set all NAs to correct values
pr_main_series <-
pr_main_series %>%
mutate(release_year = replace(release_year, grepl('1292', series), 1986))
pr_main_series <-
pr_main_series %>%
mutate(release_year = replace(release_year, grepl('1573', series), 1991))
pr_main_series <-
pr_main_series %>%
mutate(release_year = replace(release_year, grepl('2472', series), 2008))
pr_main_series <-
pr_main_series %>%
mutate(release_year = replace(release_year, grepl('2680', series), 2012))
Extract issue number
pr_main_series$series <- sub("\\).*", "", sub(".*\\(", "", pr_main_series$series))
pr_main_series
## # A tibble: 1 × 12
## series season title
## <chr> <chr> <chr>
## 1 Band 2899 Sternengruft Die Sternengruft
## # ... with 9 more variables: subtitle <chr>, author <chr>,
## # title_pic <chr>, inbound_ill <chr>, release <chr>,
## # maincharacters <chr>, eventtimeframe <chr>, locations <chr>,
## # release_year <dbl>
Extract additional data sets for locations and main characters
# Main characters
pr_main_character <- pr_main_series %>%
select(series, maincharacters, author, gender, author_pos, release_year)
# first split the characters
pr_main_character <- pr_main_character %>%
separate(maincharacters, c("C1", "C2", "C3", "C4", "C5", "C6"), sep = ", ")
# second gather characters (making a wide dataframe into a long one)
pr_main_character <- pr_main_character %>%
gather(character_pos, maincharacter_name, C1, C2, C3, C4, C5, C6, na.rm = TRUE) %>%
arrange(series)
# main character combinations
tmp <- pr_main_character %>%
select(series, maincharacter_name, release_year)
pr_main_character_combinations <-
expand.grid(unique(as.factor(tmp$maincharacter_name)), unique(as.factor(tmp$maincharacter_name)))
#get vector of all titles that each author worked on
lauth <- tapply(as.factor(tmp$series), as.factor(tmp$maincharacter_name), FUN=function(x) paste(x))
myfun <- function(x,y) sum(lauth[[x]] %in% lauth[[y]]) #function
flog.trace("apply function to columns of dataframe, might take a long time")
pr_main_character_combinations$count <- mapply(myfun, x=pr_main_character_combinations$Var2, y=pr_main_character_combinations$Var1)
# remove all combinations with count 0
pr_main_character_combinations <- pr_main_character_combinations %>%
filter(count > 0)
# change var names
pr_main_character_combinations <- pr_main_character_combinations %>%
select(character_1 = Var1, character_2 = Var2, count = count)
rm(tmp, lauth, myfun)
# locations
pr_location <- pr_main_series %>%
select(series, locations, author, gender, author_pos, release_year)
# first split the locations
pr_location <- pr_location %>%
separate(locations, c("l1", "l2", "l3", "l4", "l5", "l6", "l7"), sep = ", ")
# second gather locations (making a wide dataframe into a long one)
pr_location <- pr_location %>%
gather(location_pos, location_name, l1, l2, l3, l4, l5, l6, l7, na.rm = TRUE) %>%
arrange(series)
Save resulting dataset to a CSV file
write_csv(pr_main_series, paste(C_folder_csv, c_file_issues_cvs, sep = ""), append = FALSE)
write_csv(pr_main_character, paste(C_folder_csv, c_file_characters_cvs, sep = ""), append = FALSE)
write_csv(pr_location, paste(C_folder_csv, c_file_locations_cvs, sep = ""), append = FALSE)
write_csv(pr_main_character_combinations, paste(C_folder_csv, c_file_character_combinations_cvs, sep = ""), append = FALSE)
flog.info("wrote csv")
## INFO [2017-09-12 10:04:03] wrote csv
Conclusion
In this blog, I created three Perry Rhodan datasets out of information available on the Perrypedia wiki. I learned and improved my understanding of:
- How to scrape information from web pages
- Use basics like the
%in%
operator - How to clean data
- How to apply the gender package.
You find the generated files on my GitHub Repo.
- Please feel free to ask me any questions or point out better solutions to some of the things I did.
gresch