AndrewPearson.org

Andrew Pearson's Little Corner of the Internet...

Friday, July 30, 2010

Android HTML Parsing

Another common task in smartphone app development is parsing a webpage. Maybe you want to display a subset of the data on the page to the user of your app. Maybe you are parsing the webpage for information that your app will use internally. Either way, the Android API does not provide an easy way to do this; thus the necessity of this
blogpost.

There are many overtures that one can take to accomplish this task. Some (see: idiots) advocate parsing HTML pages like long strings, using regex's or some other "roll-your-own" approach. Some prescribe using a SAX parser (treating HTML like XML), which is bug-prone (if the HTML isn't properly formed). I recommend using a free HTML parsing library. A good choice is the aptly, yet unoriginally, named HtmlCleaner. Though it doesn't fully support XPATH features (more on this in a bit) like its competitor, TagSoup, it is a bit smaller (this is important because you have to include the library in your app). If you want to use TagSoup instead of HtmlCleaner, I would bet that the steps in the rest of this tutorial are more or less the same, though I have not tested them.

Anyway, let's outline exactly what we want to do.
  • Open up some webpage
  • Programmatically extract some information from it.
  • Do something with that information.

As an example of a webpage to parse, I will once again draw from the development of my archive.org app. Below is a screenshot of the type of page that we will be parsing. The information that we are looking for is the URL and title for each song listed on the page.

We can see the information that we want in the table titled “Audio Files” which itself is in a table titled "Individual Files" a little bit down the page. Viewing the HTML source for the page betrays the tangle of and tags all with various attributes. Though it might appear that it would be difficult to sort through this mess, we can clean things up with just a few lines of code. Below is the code that will parse this page for exactly what we want:

// Create HtmlCleaner object to turn the page into
// XML that we can analyze to get the songs from the page.
HtmlCleaner pageParser = new HtmlCleaner();
CleanerProperties props = pageParser.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);

try {
URLConnection conn = url[0].openConnection();
TagNode node = pageParser.clean(new InputStreamReader(conn.getInputStream()));

// XPath string for locating download links...
// XPATH says "Select out of all 'table' elements with attribute 'class'
// equal to 'fileFormats' which contain element 'tr'..."
String xPathExpression = "//table[@class='fileFormats']//tr";
try {
// Stupid API returns Object[]... Why not TagNodes? We'll cast it later
Object[] downloadNodes = node.evaluateXPath(xPathExpression);

// Iterate through the nodes selected by the XPath statement...
boolean reachedSongs = false;
for(Object linkNode : downloadNodes){
// The song titles and locations are listed between two rows.
// Ignore all other rows to save a little time and battery...
if (!reachedSongs) {
String s = pageParser.getInnerHtml(((TagNode)linkNode).getChildTags()[0]);
if (!s.equals("Audio Files")) {
continue;
} else {
reachedSongs = true;
continue;
}
}else{
if(s.equals("Information")||s.equals(“Other Files”)){
break;
}
}

// Recursively find all nodes which have "href" (link) attributes. Then, store
// the link values in an ArrayList. Create a new ArchiveSongObj with these links
// and the title of the track, which is the inner HTML of the first child node.
TagNode[] links = ((TagNode)linkNode).getElementsHavingAttribute("href", true);
ArrayList stringLinks = new ArrayList();
for(TagNode t: links){
stringLinks.add(t.getAttributeByName("href"));
}
String title = pageParser.getInnerHtml(((TagNode)((TagNode)linkNode).getChildren().get(0))).trim();
System.out.println(title);
System.out.println(stringLinks);
}
} catch (XPatherException e) {
Log.e("ERROR", e.getMessage());
}
} catch (IOException e) {
Log.e("ERROR", e.getMessage());
}

The first thing that we do is set up and HtmlCleaner object. We set a few properties for it, and then are ready to use it. We call its clean() method on the URL's input stream. This returns a TagNode for the root node in the document. A TagNode is a crucial part of the HtmlCleaner API: it represents a node in an XML document and you can use the API to work with its elements, attributes, and children nodes.

The next step greatly simplifies the amount of processing that we have to do on the webpage. Instead of having to worry about EVERY subnode of the root node of the document, we can use an XPath String to ask for only a subset of these nodes. We define the String xPathExpression to be ""//table[@class='fileFormats']//tr". Calling evaluateXPath() with this String basically says "return the set of all subnodes with table elements that have attribute class equal to file format which contain element tr" (We want to find tr elements [table rows] in the table whose class is called "fileFormats"). We receive an array of Objects (which are really TagNodes) from this method.

Now we have a collection of TagNodes which makes up the table with the information that we want. The problem is that the table also has lots of extraneous information that we don't want. In fact, we don't care about anything in the table before the "Audio Files" subheading, and we don't care about anything after those files have been listed. Instead of wasting time (battery power) processing these TagNodes, I define a boolean called reachedSongs that I use to skip over nodes until we get to where we care about the information. The "Audio Files" subheading will be the inner HTML of the first child of one of the nodes returned from our XPath evaluation. After the files, there is a subheading called "Information". We know to break out of our loop after that.

In between the "Audio Files" and "Information" subheadings is where we have to actually analyze our nodes. Each node represents a tr (table row) element. Each row has several td elements: the inner HTML of the first td element is the song title, and any td with an href attribute is a link to a particular version of the song (64kb, VBR, FLAC, etc.). We grab this information for each song.

7 comments:

  1. Thanks for this! There is a lack of a good htmlcleaner tutorials, and yours really helped me get things working.

    ReplyDelete
  2. im looking to just clear out all the html from an rss feed and leave the plaintxt because the html is messing up my app, and cant figure it out =(

    ReplyDelete
  3. hello
    first of all thank u for sharing ur experience with us..
    i've tried ur code on eclipse in an android project and on the field of
    "TagNode[] links = ((TagNode)linkNode).getElementsHavingAttribute("href", true);" i got an error of -- linkNode cannot be resolved to a variable...
    do u have any idea this error code is standing for..
    and another thing it might be related to java scope error which is i am new on.. :) if so sorry for bothering .. but ill be happy if u still can help thanx...

    ReplyDelete
  4. Hi Andrew,
    Your example is supper simple using the htmlcleaner.
    I am a beginner of Java and Android. It takes two day job to run example. if you provides two information. It will be save time for beginner. One is that where is htmlcleaner, how to compile the htmlcleaner source or how to generate library for using it. I found it needs "ant.jar" and "jdom.jar" libraries.
    Now I can try to parse any web page. Thanks you again!,
    Redsock

    ReplyDelete
  5. i want to get links of videos from youtube.i try a lot of code but get no success yet. anyone help me how i get it?

    ReplyDelete
  6. hey where is the link of the page in above code..from u are getting the href of the songs...

    ReplyDelete