So Frank and I were scraping data off the web (with permission, of course) using R’s XML package when suddenly a wild error appeared! A quick search brought up a few StackOverflow posts and blogs offering common solutions. At first glance, we thought that the URL query string was the culprit of our woes - the xmlParse()` function could not read the unescaped ampersands!

doc <- "http://somesite.com/xml.php?Y2=2005&max=100"
xmlParse(doc)

Error: EntityRef: expecting ';'

However, whether we passed the URL without the escaped ampersands:

"somesite.com/results_xml.php?Y2=2005**&**max=100"

or with the escaped ampersands:

"somesite.com/results_xml.php?Y2=2005**&amp;**max=100"

we still received the same exact error:

Error: EntityRef: expecting ';'

On the verge of nearly pulling our hairs out, we decided to examine the XML data more thoroughly. Only then did we find our problem child:

</img>

Never thought I'd say this, but M&M's just left a bad taste in my mouth.

There were unescaped ampersands in the XML file itself! These sneaky guys were throwing the “EntityRef” error in the xmlParse() function all along. With this in mind, we tweak our approach to replace all unescaped ampersands in the XML file with the appropriate encoding:

So lesson learned: always check to see if you have a valid XML file before attempting to parse the data using any sort of scraper.