As I am taking an online class on getting and cleaning data in R, I am learning about different data formats including xml. This seemed like a good opportunity to lear how to convert a simple xml to a tibble and do a little data cleaning.
Wikipedia defines xml as:
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
The course I am taking refered to a very simple xml file from w3school:
<breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description> Two of our famous Belgian Waffles with plenty of real maple syrup </description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description> Light Belgian waffles covered with strawberries and whipped cream </description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description> Light Belgian waffles covered with an assortment of fresh berries and whipped cream </description> <calories>900</calories> </food> <food> <name>French Toast</name> <price>$4.50</price> <description> Thick slices made from our homemade sourdough bread </description> <calories>600</calories> </food> <food> <name>Homestyle Breakfast</name> <price>$6.95</price> <description> Two eggs, bacon or sausage, toast, and our ever-popular hash browns </description> <calories>950</calories> </food> </breakfast_menu>
The above xml file is wrapped in the root element breakfast menu which has four food items has child elements. Each menu has the same four elements being name, price, description and calories. So we have four menus with four elements each, in other words, a dataset amenable to be converted in tabular form.
Tibbles are part of the tidyverse where they are defined as:
a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.
As I am trying to learn more about the Tidyverse, I will be converting the xml to a dataframe and then to a tibble.
The following steps were done to convert a simple xml in to a simple tibble:
- load packages
- get the data
- parse the data
- convert xml to dataframe
- convert the data.frame to a tibble and data cleaning
I uploaded the script for the above here and explained each step below.
1. load the packages
This simple exercise can be done in R base, however, as I am learning about the tidyverse this is the approach I used.
At the time of writing this post, the latest version of tidyverse is 1.2.0 and the core tidyverse includes eight packages being ggplot, dplyr, tidyr, readr, purrr, tibble, stringr and forcats. Loading the tidyverse package implicitly loads all core tidyverse packages. However, as I am learning about which library does what, I explicitly loaded the libraries I needed from the core tidyverse:
- tibble: as stated above, a tibble is modern version of a dataframe. I used the
as_tibble()function from this package.
- dplyr: as described in the package, “dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”. I used the
transmute()function from this package.
- readr: as described in the package, ” the goal of ‘readr’ is to provide a fast and friendly way to read rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.” I used the
parse_number()function from this package.
In addition to packages from the tidyverse, I also needed:
- XML: as described inthe package, “XML is a collection of functions allow us to add, remove and replace children from an XML node andalso to and and remove attributes on an XML node.” I used the
xmlToDataFrame()functions from this package.
- xml2: as described in the package, “xml2 turns an XML document (or node or nodeset) into the equivalent R list.” I used the
read_xml()function from the package. Note that it can handle xml sourced from https sites.
I always start my scripts by clearing all objects from the working space with
rm(list=ls()). Note that this command will however not remove previously loaded packages.
The set-up for this script was the following:
rm(list=ls()) library(tibble) library(dplyr) library(readr) library(XML) library(xml2)
2. get the data
I saved the url into a variable and then use the
read_xml() function to get the data. This could also have been done in a single line.
file_url <- "https://www.w3schools.com/xml/simple.xml" data <- read_xml(file_url)
At this point,
data is a xml_document that holds the content of the xml document including the labels and the text.
3. parse the data
I then used the
xmlParse() function from the XML package to recognize the grammar of xml and extract it in a xml format.
doc <- xmlParse(data)
At this point,
doc is a XMLInternalDocument that contains the same information as
data but in an xml format with the tree structure.
4. convert xml to data.frame
This is when I could convert xml into a data.frame using the
xmlToDataFrame() function from the XML package.
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//food"))
At this point,
df is is a data.frame with five observations and four variables.
5. convert data.frame to a tibble and data cleaning
Once I had the data.frame, I could finally convert it to a tibble with the
as_tibble() function from the tibble package. While I was at it, I decided to do a little data clean-up with the
transmute() function from the dplyr package.
- as the initial price column included prices stored as characters with the dollar sign symbol, I renamed the column to price_dollars through the
transmute()function and removed the dollar sign with the
parse_number()function from the readr package which also automatically converted the class of the variable from character to numeric. Note that this could have been done in R base with
- as the calories column was also stored as characters, I used the
as.numeric()function from R base to convert the class of the variable from character to numeric. Note that this could have been done in tidyverse with
parse_number(calories)but it seemed unnecessary.
tb <- as_tibble(df) %>% transmute( name = name, price_dollars = parse_number(price), description = description, calories = as.numeric(calories) )
Note that the
transmute() function drops all variables from initial tibble, hence the need to include the name and the description columns in the code above. Obviously, this was acceptable for this simple example, but in the case of a larger dataset, another strategy would be needed.
The above is just one way of converting a simple xml to tibble. I am sure numerous other approaches could be taken.