AndrewPearson.org

Andrew Pearson's Little Corner of the Internet...

Tuesday, January 17, 2012

How To Bypass The Wikipedia SOPA / PIPA Blackout Page

So wikipedia is blacking itself out from midnight tonight until midnight tomorrow. I was interested to see whether wikipedia had actually blocked traffic to its English language website, so I poked around in the HTML for the main English wikipedia page: http://en.wikipedia.org/wiki/Main_Page . It is actually extremely easy to get around the wikipedia blackout page, and I describe how to do so below. If this works for you, or you have any interesting observations, leave a comment for everyone to see.

What I discovered is that wikipedia just has a simple script to pop up a banner over the regular page which is living right below the banner. It is trivial to remove: just disable javascript on wikipedia or run a script-blocking programming when you view wikipedia. NoScript (http://noscript.net/) should work for Firefox, or NotScripts (https://chrome.google.com/webstore/detail/odjhifogjcknibkahlpidmdajjpkkcfn) for Chrome if you choose to go this route. You could also just block the blackout banner script that I describe below.

If you are interested in the mechanics of this "blackout" banner, read on:

Inspecting the scripts running on the page, I see this one:
http://meta.wikimedia.org/w/index.php?title=Special:BannerLoader&banner=blackout&campaign=English+Wikipedia+Blackout&userlang=en&db=enwiki&sitename=Wikipedia&country=US
Inside, we get this:

insertBanner({
"bannerName": "blackout",
"bannerHtml": "<style>\n#mw-sopaOverlay {\n \/* Opera Mini doens't like position absolute *\/\n \/* iOS doesn't like position fixed *\/\n top: 0;\n left: 0;\n width: 100%;\n height: 100%;\n z-index: 500;\n color: #dedede;\n background: black url(\/\/upload.wikimedia.org\/wikipedia\/commons\/9\/98\/WP_SOPA_Splash_Full.jpg) no-repeat 0 0;\n overflow: auto;\n font-family:Times New Roman;\n}\n\n\/* Monobook requires position, otherwise background is white+book instead of black *\/\nbody.skin-monobook #mw-sopaOverlay {\n position: absolute;\n}\n\n#mw-sopaColumn {\n position: absolute;\n top: 80px;\n left: 420px;\n width: 400px;\n color: #dedede;\n padding-bottom: 30px;\n}\n#mw-sopaHeadline {\n font-size: 1.7em;\n margin-bottom: 0.5em;\n color: #fff;\n overflow: hidden;\n text-align: justify;\n}\n#mw-sopaText { \n margin-bottom: 1.5em;\n text-align: justify;\n}\n#mw-sopaColumn a {\n color: #eee;\n text-decoration: underline;\n}\n#mw-sopaColumn a:hover {\n color: #fff;\n cursor: pointer;\n text-decoration: underline;\n}\n#mw-sopaColumn a.action {\n margin-top: 2px;\n}\n.mw-sopaActionDiv {\n margin-left: 1em;\n margin-bottom: 1em;\n}\n\n.mw-sopaActionHead {\n font-weight: bold;\n}\n.mw-sopaSocial {\n float: left;\n text-align: center;\n margin-right: 12px;\n margin-bottom: 3px;\n font-size: small;\n}\n.mw-sopaSocial a {\n text-decoration: none;\n}\n<\/style>\n\n\n\n<script type=\"text\/javascript\">\n( function ($) {\n\t\/\/ n.b. community has decided for full blackout including Special pages,\n\t\/\/ but it is hard to do stuff in the meantime without this.\n\tvar namespaceWhitelist = ['Special'];\n\n\t\/\/ see Wikipedia:SOPA_initiative\/Blackout_screen_testing\n\tvar pageWhitelist = [\n'Stop Online Piracy Act',\n'PROTECT IP Act',\n'Online Protection and Enforcement of Digital Trade Act',\n'Censorship',\n'Special:CongressLookup',\n'Special:NoticeTemplate',\n'Special:NoticeTemplate\/view',\n'Wikipedia:Text of Creative Commons Attribution-ShareAlike 3.0 Unported License',\n'Wikipedia:General disclaimer',\n'Wikipedia:Contact us',\n'Wikipedia:About',\n'Wikipedia:Copyright violations',\n'Wikipedia:Copyrights',\n'Wikipedia:Five pillars',\n'Digital Millennium Copyright Act',\n'DNS cache poisoning',\n'Censorship',\n'Wikipedia:SOPA initiative',\n'Wikipedia:SOPA initiative\/Action',\n'Wikipedia:SOPA initiative\/Actions by other communities',\n'Wikipedia:SOPA initiative\/Media',\n'Wikipedia:SOPA initiative\/Learn more',\n'Wikipedia:SOPA initiative\/Legal overview',\n'Wikipedia:SOPA initiative\/Take action',\n'SOPA',\n'PIPA',\n'OPEN',\n'Censorship',\n'Special:CongressLookup'\n\t];\n\tvar geoHasUsRep = [\n\t\t'US', \/\/ USA\n\t\t'PR', \/\/ Puerto Rico\n\t\t'VI', \/\/ Virgin Islands\n\t\t'MP', \/\/ Northern Mariana Islands\n\t\t'AS', \/\/ American Samoa\n\t\t'GU' \/\/ Guam\n\t];\n\tvar preload = [];\n\tvar i;\n\n\t\/\/ Exclude some namespaces\n\tif ( $.inArray( wgCanonicalNamespace, namespaceWhitelist ) !== -1 ) {\n\t\treturn;\n\t}\n\n\t\/\/ Exclude some individual pages\n\tfor ( i = 0; i < pageWhitelist.length; i++ ) {\n\t\tif ( pageWhitelist[i] === wgPageName || pageWhitelist[i] === wgPageName.replace( \/_\/g, ' ' ) ) {\n\t\t\treturn;\n\t\t}\n\t}\n\n var urlParams = {};\n (function () {\n var e,\n a = \/\\+\/g, \n r = \/([^&=]+)=?([^&]*)\/g,\n d = function (s) { return decodeURIComponent(s.replace(a, \" \")); },\n q = window.location.search.substring(1);\n\n while (e = r.exec(q)) {\n urlParams[d(e[1])] = d(e[2]);\n }\n })();\n\n var country = 'ZZ';\n if ( urlParams.country ) {\n country = urlParams.country;\n } else if ( window.Geo && window.Geo.country ) {\n country = window.Geo.country;\n }\n \n\tvar hasUsRep = false;\n for ( i = 0; i < geoHasUsRep.length; i++ ) {\n\t\tif ( geoHasUsRep[i] === country ) {\n\t\t hasUsRep = true;\n\t\t break;\n\t\t}\n\t}\n\n\tvar overlay = $('<div id=\"mw-sopaOverlay\"><\/div>');\n\tvar column = $('<div id=\"mw-sopaColumn\"><\/div>');\n\tvar headline = $('<div id=\"mw-sopaHeadline\">Imagine a World<br \/>Without Free Knowledge<\/div>');\n\tvar intro = $('<div id=\"mw-sopaText\"><p>For over a decade, we have spent millions of hours building the largest encyclopedia in human history. Right now, the U.S. Congress is considering legislation that could fatally damage the free and open internet. For 24 hours, to raise awareness, we are blacking out Wikipedia. <a href=\"http:\/\/en.wikipedia.org\/wiki\/Wikipedia:SOPA_initiative\/Learn_more\" target=\"_blank\">Learn more.<\/a><\/p><\/div>');\n\tvar validateZip = function(zip) {\n\t\treturn \/^\\s*[0-9]{5}([- ]?[0-9]{4})?\\s*$\/.test(zip);\n\t};\n\n\tvar action = $('<div id=\"mw-sopaAction\"><\/div>');\n\tif ( hasUsRep ) {\n\t\taction.append( $('<p class=\"mw-sopaActionHead\">Contact your representatives.<\/p><div class=\"mw-sopaActionDiv\"><form action=\"\/wiki\/Special:CongressLookup\" action=\"GET\"><label for=\"zip\">Your zip code:<\/label> <input name=\"zip\" type=\"text\" size=\"5\"> <input id=\"sopa-zipform-submit\" type=\"submit\" value=\"Look up\"><\/form><\/div>' ) );\n\n\t\t\/*\n\t\taction.find('#sopa-zipform-submit').click(\n\t\t\tfunction(e) {\n\t\t\t\tvar enteredZip = action.find('input[name=\"zip\"]').val();\n\t\t\t\tif ( ! validateZip( enteredZip ) ) {\n\t\t\t\t\talert( 'You've entered an invalid zip code.');\n\t\t\t\t\te.preventDefault();\n\t\t\t\t\taction.find('input[name=\"zip\"]').focus();\n\t\t\t\t}\n\t\t\t} );\n\t\t*\/\n\n\t} else {\n var $socialDiv = $('<div>');\n \n \n var socialSites = [\n {\n url: 'https:\/\/www.facebook.com\/sharer.php?u=' + encodeURIComponent( 'http:\/\/tinyurl.com\/7vq4o8g' ),\n title: 'Facebook',\n hi: '\/\/upload.wikimedia.org\/wikipedia\/commons\/b\/b9\/WP_SOPA_sm_icon_facebook_ffffff.png',\n icon: '\/\/upload.wikimedia.org\/wikipedia\/commons\/2\/2a\/WP_SOPA_sm_icon_facebook_dedede.png',\n 'popup': false\n },\n {\n url: 'https:\/\/m.google.com\/app\/plus\/x\/?v=compose&content=' + encodeURIComponent( 'I support the January 18th Wikipedia blackout to protest SOPA and PIPA. Show your support here http:\/\/tinyurl.com\/7vq4o8g' ),\n title: 'Google+',\n hi: '\/\/upload.wikimedia.org\/wikipedia\/commons\/a\/a1\/WP_SOPA_sm_icon_gplus_ffffff.png',\n icon: '\/\/upload.wikimedia.org\/wikipedia\/commons\/0\/08\/WP_SOPA_sm_icon_gplus_dedede.png',\n 'popup': true\n },\n {\n url: 'https:\/\/twitter.com\/intent\/tweet?original_referer=' + encodeURIComponent( window.location ) + '&text=' + encodeURIComponent( 'I support #wikipediablackout! Show your support here http:\/\/tinyurl.com\/7vq4o8g' ),\n title: 'Twitter',\n hi: '\/\/upload.wikimedia.org\/wikipedia\/commons\/8\/8a\/WP_SOPA_sm_icon_twitter_ffffff.png',\n icon: '\/\/upload.wikimedia.org\/wikipedia\/commons\/4\/45\/WP_SOPA_sm_icon_twitter_dedede.png',\n 'popup': false\n }\n ];\n\n for ( i = 0; i < socialSites.length; i++ ) {\n ( function ( site ) {\n function linkify( $item ) {\n var $link = $( '<a><\/a>' )\n .css( 'text-decoration', 'none' )\n .attr( 'href', site.url )\n .append( $item );\n var target = 'wpblackout_' + site.title + '_share';\n if ( site.popup ) {\n $link.click( function() {\n window.open(\n site.url,\n target,\n 'resizable=yes,width=500,height=300,left=' + (screen.availWidth\/2-250) + ',top=' + (screen.availHeight\/2-150)\n );\n return false;\n } );\n } else {\n $link.attr( 'target', target );\n }\n return $link;\n }\n var $icon = $( '<img><\/img>' ).attr( { 'width': 33, 'height': 33, 'src': site.icon } );\n var $iconLink = linkify( $icon );\n preload.push( site.hi );\n var $wordLink = linkify( site.title );\n var $div = $( '<div class=\"mw-sopaSocial\"><\/div>' );\n $div.hover(\n function() {\n $icon.attr( 'src', site.hi );\n $wordLink.css( 'color', '#fff' );\n },\n function() {\n $icon.attr( 'src', site.icon );\n $wordLink.css( 'color', '#dedede' );\n });\n $div.append( $iconLink, $('<br>'), $wordLink );\n $socialDiv.append($div);\n } )( socialSites[i] );\n }\n action.append(\n $( '<p class=\"mw-sopaActionHead\">Make your voice heard<\/p>' ),\n $( '<div class=\"mw-sopaActionDiv\"><\/div>' ).append(\n $socialDiv,\n $( '<div style=\"clear: both;\"><\/div>' )\n )\n );\n\n\t}\n\n\tcolumn.append( headline, intro, action );\n\toverlay.append( column );\n\n\t$('body').children().hide();\n\t$('body').append(overlay);\n\t$('<style id=\"mw-sopa-blackout\">#mw-page-base, #mw-head-base, #content, #mw-head, #mw-panel, #footer { display: none; }<\/style>').appendTo('head');\n\n\tvar preloaded = [];\n\tfor ( i = 0; i < preload.length; i++ ) {\n\t\tpreloaded[i] = new Image();\n\t\tpreloaded[i].src = preload[i];\n\t}\n\n} )(jQuery);\n<\/script>",
"campaign": "English Wikipedia Blackout",
"fundraising": "0",
"autolink": "0",
"landingPages": ""
});


Notice that the third line contains the HTML for the entire blackout banner. The above code is calling a function called insertBanner which is defined here:
http://en.wikipedia.org/w/index.php?title=Special:BannerController&cache=/cn.js&303-4

It contains the following:
function insertBanner( bannerJson ) {
jQuery( 'div#centralNotice' ).prepend( bannerJson.bannerHtml );
if ( bannerJson.autolink ) {
var url = 'https://wikimediafoundation.org/wiki/Special:LandingCheck';
if ( ( bannerJson.landingPages !== null ) && bannerJson.landingPages.length ) {
targets = String( bannerJson.landingPages ).split(',');
url += "?" + jQuery.param( {
'landing_page': targets[Math.floor( Math.random() * targets.length )].replace( /^\s+|\s+$/, '' )
} );
url += "&" + jQuery.param( {
'utm_medium': 'sitenotice', 'utm_campaign': bannerJson.campaign,
'utm_source': bannerJson.bannerName, 'language': wgUserLanguage,
'country': Geo.country
} );
jQuery( '#cn-landingpage-link' ).attr( 'href', url );
}
}
}
function hideBanner() {
jQuery( '#centralNotice' ).hide(); // Hide current banner
var bannerType = $.centralNotice.data.bannerType;
if ( bannerType === undefined ) bannerType = 'default';
setBannerHidingCookie( bannerType ); // Hide future banners of the same type
}
function setBannerHidingCookie( bannerType ) {
var e = new Date();
e.setTime( e.getTime() + (14*24*60*60*1000) ); // two weeks
var work='centralnotice_'+bannerType+'=hide; expires=' + e.toGMTString() + '; path=/';
document.cookie = work;
}
// This function is deprecated
function toggleNotice() {
hideBanner();
}
var wgNoticeToggleState = (document.cookie.indexOf('hidesnmessage=1')==-1);

( function( $ ) {
$.ajaxSetup({ cache: true });
$.centralNotice = {
'data': {
'getVars': {},
'bannerType': 'default'
},
'fn': {
'loadBanner': function( bannerName, campaign, bannerType ) {
// Store the bannerType in case we need to set a banner hiding cookie later
$.centralNotice.data.bannerType = bannerType;
// Get the requested banner
var bannerPageQuery = $.param( {
'banner': bannerName, 'campaign': campaign, 'userlang': wgUserLanguage,
'db': wgDBname, 'sitename': wgSiteName, 'country': Geo.country
} );
var bannerPage = '?title=Special:BannerLoader&' + bannerPageQuery;
var bannerScript = '<script type="text/javascript" src="//meta.wikimedia.org/w/index.php' + bannerPage + '"&rt;</script&rt;';
if ( document.cookie.indexOf( 'centralnotice_'+bannerType+'=hide' ) == -1 ) {
jQuery( '#siteNotice' ).prepend( '<div id="centralNotice" class="' +
( wgNoticeToggleState ? 'expanded' : 'collapsed' ) +
' cn-' + bannerType + '"&rt;'+bannerScript+'</div&rt;' );
}
},
'loadBannerList': function( geoOverride ) {
if ( geoOverride ) {
var geoLocation = geoOverride; // override the geo info
} else {
var geoLocation = Geo.country; // pull the geo info
}
var bannerListQuery = $.param( { 'language': wgContentLanguage, 'project': wgNoticeProject, 'country': geoLocation } );
var bannerListURL = wgScript + '?title=' + encodeURIComponent('Special:BannerListLoader') + '&cache=/cn.js&' + bannerListQuery;
var request = $.ajax( {
url: bannerListURL,
dataType: 'json',
success: $.centralNotice.fn.chooseBanner
} );
},
'chooseBanner': function( bannerList ) {
// Convert the json object to a true array
bannerList = Array.prototype.slice.call( bannerList );

// Make sure there are some banners to choose from
if ( bannerList.length == 0 ) return false;

var groomedBannerList = [];

for( var i = 0; i < bannerList.length; i++ ) {
// Only include this banner if it's intended for the current user
if( ( wgUserName && bannerList[i].display_account ) ||
( !wgUserName && bannerList[i].display_anon == 1 ) )
{
// add the banner to our list once per weight
for( var j=0; j < bannerList[i].weight; j++ ) {
groomedBannerList.push( bannerList[i] );
}
}
}

// Return if there's nothing left after the grooming
if( groomedBannerList.length == 0 ) return false;

// Choose a random key
var pointer = Math.floor( Math.random() * groomedBannerList.length );

// Load a random banner from our groomed list
$.centralNotice.fn.loadBanner(
groomedBannerList[pointer].name,
groomedBannerList[pointer].campaign,
( groomedBannerList[pointer].fundraising ? 'fundraising' : 'default' )
);
},
'getQueryStringVariables': function() {
document.location.search.replace( /\??(?:([^=]+)=([^&]*)&?)/g, function () {
function decode( s ) {
return decodeURIComponent( s.split( "+" ).join( " " ) );
}
$.centralNotice.data.getVars[decode( arguments[1] )] = decode( arguments[2] );
} );
}
}
}
jQuery( document ).ready( function ( $ ) {
// Initialize the query string vars
$.centralNotice.fn.getQueryStringVariables();
if( $.centralNotice.data.getVars['banner'] ) {
// if we're forcing one banner
$.centralNotice.fn.loadBanner( $.centralNotice.data.getVars['banner'] );
} else {
// Look for banners ready to go NOW
$.centralNotice.fn.loadBannerList( $.centralNotice.data.getVars['country'] );
}
} ); //document ready
} )( jQuery );


Basically, all Wikipedia is doing is having the first script call the second script's banner insertion function, passing the special blackout banner as a parameter. I assume that Wikipedia wants this to be relatively easy to get around considering that if your browser allows you to type javascript into the address bar, you can type in a few lines of javascript to get around the effects of this blackout banner.

UPDATE: As I had anticipated, there is a ton of javascript floating around that you can type into your address bar to remove the blackout banner. Here is one example that I found:
javascript:jQuery("#mw-sopaOverlay").css("display", "none !important");jQuery("#content, #mw-page-base, #mw-head-base, #mw-head, #mw-panel, #footer").css("display", "block !important");
All that it does is tell the Wikipedia page to hide the blackout banner section of the HTML and to show the rest of the HTML (which is usually displayed).

Tuesday, January 10, 2012

Fedora Annoyance: Needs XX MB on the /boot filesystem

This post is quite a bit different from my other posts thus far on this blog (which have been about Android programming), but I decided to include it because I find this problem pretty irritating, and I think that yum should have some sort of option to deal with it automatically.

Fedora is known for keeping its packages very up to date. Though this is great for security and functionality, constant kernel updates can become irritating when one's /boot partition fills up. Sometimes, you might see yum spit out the following error during an update:

Transaction Check Error:
installing package kernel-3.1.0-7.fc16.x86_64 needs 9MB on the /boot filesystem

Error Summary
-------------
Disk Requirements:
At least 9MB more space needed on the /boot filesystem.


This means that your update has failed because you have run out of space to install the new kernel. (It is important to note that Fedora doesn't just replace old kernels, it keeps them so that you can boot into them should the new kernel give you any trouble. This is why GRUB (or whatever bootloader you use) will show a few kernels after you've updated a few times.)

Though / or /home might have a tremendous amount of free space, kernels are installed to the /boot filesystem. Since /boot is (almost) always on its own partition, the amount of space on / or /home is irrelevant: when /boot is full, you won't be able to update to a new kernel and your yum update will error out. The solution to this problem is simple: remove old, unused kernels from /boot to make room for new kernels.

First, check which version kernel you are currently running. At the command prompt, type:
uname -r

You should see something like this:
3.1.7-1.fc16.x86_64

The output from uname (uname is a utility to print system information; the -r flag tells uname to print the kernel release that you are running) says that I have the Fedora 16 package of the 3.1.7-1 version of the Linux kernel, compiled for x86_64 processors, installed.

Next, check which kernel versions you currently have installed. At the command prompt, type:
rpm -q kernel

You should see something like this:
kernel-3.1.4-1.fc16.x86_64
kernel-3.1.6-1.fc16.x86_64
kernel-3.1.7-1.fc16.x86_64


The output from rpm (RPM is a package manager; the -q flag queries the package manager for whatever term immediately follows the -q flag) says that three "kernel" packages are installed: 3.1.4, 3.1.6, and 3.1.7. The only thing left to do is to remove one of the kernels to make room for the new one.

It is very important to note that the previous command listed the kernel which uname told us was currently running. We do not want to remove that kernel. Generally, I would recommend removing the oldest kernel version that you have. To do so, simply type the following at the command prompt:

yum remove kernel-version.that.you.found.above
(kernel-version.that.you.found.above, of course, will be something like kernel-3.1.4-1.fc16.x86_64)

You can check to see that you removed the kernel version by querying rpm for "kernel" again.

Now, simply update (ie type in "yum update" at the command prompt) and your new kernel version will install because you have freed up space for it on /boot.

I have personally used this solution on both Fedora 15 and Fedora 16, and I cannot think of a reason why it would not work on other versions of Fedora.

Friday, July 29, 2011

Vibe Vault V3.0

Since we launched at the end of last year, Vibe Vault has been a great success (in our opinions). We're closing in on 15,000 downloads, and have received overwhelmingly positive feedback from our users. It runs well on Android 2.1, 2.2, 2.3, 2.3.3 and 3.0 (we haven't done much testing on the pre-2.1 releases, because hardly any devices run them). There are at least 10 different countries with over 25 Vibe Vault users (we have 82 in Saudi Arabia?). To say the least, it's been cool to see our app catch on.

The glorious and somewhat meteoric rise of Vibe Vault:


Most recently, we released Version 3.0, which includes voting (check it out if you haven't) and facebook and twitter integration. We've also included a large number of bug fixes, performance enhancements, and graphical improvements. Looking at screenshots of V1.1 and V3.0 really show how much the app has changed over the relatively short period of time that it's been released:

Old:

New:

Old:

New:



As for the future, we plan to continue adding new features, speeding things up, and fixing bugs. Particularly, we want to get things working for HTC Inspire users, and find a way to get more people voting on shows. You can check out our source code (GPL v3) at http://code.google.com/p/vibevault/source/browse/. If you don't already have Vibe Vault (or you haven't given us a rating), check us out on the Android Market: https://market.android.com/details?id=com.code.android.vibevault&feature=search_result.

Anyway, thanks for your support, and feel free to leave some feedback.

Saturday, November 13, 2010

Vibe Vault

My friend Sanders and I just launched an awesome, free app on the Android marketplace. It's called Vibe Vault and it lets you stream and download songs from archive.org. Archive.org has over 84,000 free recordings of concerts. The collection spans over 8,000 Grateful Dead shows, and includes a diverse collection of other acts from Lotus to 311 to Elliott Smith.

I use this app every day. On my way to class, I stream shows. I download entire shows (and sometimes single songs) and listen to them on the subway when I have no cell connection. Who needs to buy music or use an iPod when you can listen to FREE music by great artists on your phone? Check it out and let me know what you think.

If you want to check out my buddy Sanders' page on Vibe Vault, you can find it here.

Monday, September 13, 2010

Melodotron

I officially now have an app on the Android market. It is called the melodotron.



The Melodotron turns an Android phone into a dynamic, user interactive, musical instrument. The sounds that it produces are completely dynamic and synthetic (no sampling or anything). As of right now (Version 0.5.2 at the time of this posting), the Melodotron can play notes on 4 octaves, across 3 different types of wavetypes (sine, sawtooth, square), all with an orientation activated tremolo effect. I expect to add more wavetypes, effects, and other improvements as regularly and often as possible. The source code is entirely free.

Check out the Melodtron's main page. The program is completely open source. Better yet, check it out on the Android app market.

Friday, July 30, 2010

Android HTML Parsing

Another common task in smartphone app development is parsing a webpage. Maybe you want to display a subset of the data on the page to the user of your app. Maybe you are parsing the webpage for information that your app will use internally. Either way, the Android API does not provide an easy way to do this; thus the necessity of this
blogpost.

There are many overtures that one can take to accomplish this task. Some (see: idiots) advocate parsing HTML pages like long strings, using regex's or some other "roll-your-own" approach. Some prescribe using a SAX parser (treating HTML like XML), which is bug-prone (if the HTML isn't properly formed). I recommend using a free HTML parsing library. A good choice is the aptly, yet unoriginally, named HtmlCleaner. Though it doesn't fully support XPATH features (more on this in a bit) like its competitor, TagSoup, it is a bit smaller (this is important because you have to include the library in your app). If you want to use TagSoup instead of HtmlCleaner, I would bet that the steps in the rest of this tutorial are more or less the same, though I have not tested them.

Anyway, let's outline exactly what we want to do.
  • Open up some webpage
  • Programmatically extract some information from it.
  • Do something with that information.

As an example of a webpage to parse, I will once again draw from the development of my archive.org app. Below is a screenshot of the type of page that we will be parsing. The information that we are looking for is the URL and title for each song listed on the page.

We can see the information that we want in the table titled “Audio Files” which itself is in a table titled "Individual Files" a little bit down the page. Viewing the HTML source for the page betrays the tangle of and tags all with various attributes. Though it might appear that it would be difficult to sort through this mess, we can clean things up with just a few lines of code. Below is the code that will parse this page for exactly what we want:

// Create HtmlCleaner object to turn the page into
// XML that we can analyze to get the songs from the page.
HtmlCleaner pageParser = new HtmlCleaner();
CleanerProperties props = pageParser.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);

try {
URLConnection conn = url[0].openConnection();
TagNode node = pageParser.clean(new InputStreamReader(conn.getInputStream()));

// XPath string for locating download links...
// XPATH says "Select out of all 'table' elements with attribute 'class'
// equal to 'fileFormats' which contain element 'tr'..."
String xPathExpression = "//table[@class='fileFormats']//tr";
try {
// Stupid API returns Object[]... Why not TagNodes? We'll cast it later
Object[] downloadNodes = node.evaluateXPath(xPathExpression);

// Iterate through the nodes selected by the XPath statement...
boolean reachedSongs = false;
for(Object linkNode : downloadNodes){
// The song titles and locations are listed between two rows.
// Ignore all other rows to save a little time and battery...
if (!reachedSongs) {
String s = pageParser.getInnerHtml(((TagNode)linkNode).getChildTags()[0]);
if (!s.equals("Audio Files")) {
continue;
} else {
reachedSongs = true;
continue;
}
}else{
if(s.equals("Information")||s.equals(“Other Files”)){
break;
}
}

// Recursively find all nodes which have "href" (link) attributes. Then, store
// the link values in an ArrayList. Create a new ArchiveSongObj with these links
// and the title of the track, which is the inner HTML of the first child node.
TagNode[] links = ((TagNode)linkNode).getElementsHavingAttribute("href", true);
ArrayList stringLinks = new ArrayList();
for(TagNode t: links){
stringLinks.add(t.getAttributeByName("href"));
}
String title = pageParser.getInnerHtml(((TagNode)((TagNode)linkNode).getChildren().get(0))).trim();
System.out.println(title);
System.out.println(stringLinks);
}
} catch (XPatherException e) {
Log.e("ERROR", e.getMessage());
}
} catch (IOException e) {
Log.e("ERROR", e.getMessage());
}

The first thing that we do is set up and HtmlCleaner object. We set a few properties for it, and then are ready to use it. We call its clean() method on the URL's input stream. This returns a TagNode for the root node in the document. A TagNode is a crucial part of the HtmlCleaner API: it represents a node in an XML document and you can use the API to work with its elements, attributes, and children nodes.

The next step greatly simplifies the amount of processing that we have to do on the webpage. Instead of having to worry about EVERY subnode of the root node of the document, we can use an XPath String to ask for only a subset of these nodes. We define the String xPathExpression to be ""//table[@class='fileFormats']//tr". Calling evaluateXPath() with this String basically says "return the set of all subnodes with table elements that have attribute class equal to file format which contain element tr" (We want to find tr elements [table rows] in the table whose class is called "fileFormats"). We receive an array of Objects (which are really TagNodes) from this method.

Now we have a collection of TagNodes which makes up the table with the information that we want. The problem is that the table also has lots of extraneous information that we don't want. In fact, we don't care about anything in the table before the "Audio Files" subheading, and we don't care about anything after those files have been listed. Instead of wasting time (battery power) processing these TagNodes, I define a boolean called reachedSongs that I use to skip over nodes until we get to where we care about the information. The "Audio Files" subheading will be the inner HTML of the first child of one of the nodes returned from our XPath evaluation. After the files, there is a subheading called "Information". We know to break out of our loop after that.

In between the "Audio Files" and "Information" subheadings is where we have to actually analyze our nodes. Each node represents a tr (table row) element. Each row has several td elements: the inner HTML of the first td element is the song title, and any td with an href attribute is a link to a particular version of the song (64kb, VBR, FLAC, etc.). We grab this information for each song.

Tuesday, July 20, 2010

Android: Why To Use JSON And How To Use It

It is a pretty common task in smartphone application development to query a website for some information, do a bit of processing, and present it to the user in a different form. There are two different main ways of going about this:

1.) Just download a whole webpage and use an HTML or XML parser to try and extract the information which you want.
2.) Some websites have API's which allow queries that instead of returning webpages, return XML, JSON, or some other way of presenting data.

Clearly, option 2 (if available) makes a lot more sense. Instead of downloading a large webpage (wasting data), parsing the entire thing (wasting battery), and then trying to analyze it (wasting your time going through often improperly written HTML), you can download much smaller, easier to manage text which conforms to XML, JSON, or whatever it is that the API provides. (I will be posting another tutorial about option 1 soon).

You might wonder why you would want to use JSON instead of XML for your smartphone application. After all, XML has been a much-ballyhooed technology buzzword for many years. There is, however, a good (and simple) reason. XML is (usually) bigger. The closing tags in XML do not exist in JSON, and therefore save you a few bytes for each tag that you don't need. JSON can generally express the same data using fewer characters, thus saving the phone from having to transfer more data every time there is a query to a website. This makes JSON a natural choice for quick and efficient website querying (for websites which offer it).

One such website is www.archive.org. Among many other things, archive.org allows people to upload recordings from concerts for other people to download for free. It's pretty awesome. They also have an API which allows you to query their system which will return results in XML, JSON, or a variety of other formats.

I am currently writing an application for browsing archive.org from your phone to find shows and then either download or stream the tracks. I'll show you how I do the first part (finding shows) using JSON and just a few lines of code.

First, you need your JSON query. I am going to query archive.org for "Lotus," asking for a JSON result containing 10 items with their respective date, format, identifier, mediatype, and title. According to the archive.org search API, my query should look like this:

String archiveQuery = "http://www.archive.org/advancedsearch.php?q=Lotus&fl[]=date&fl[]=format&fl[]=identifier&fl[]=mediatype&fl[]=title&sort[]=createdate+desc&sort[]=&sort[]=&rows=10&page=1&output=json&callback=callback&save=yes";

Now that we have our query, we simly open an HTTP connection using the query, grab an input stream of bytes and turn it into a JSON object. On a side note, notice that I am using a BufferedInputStream because its read() call can grab many bytes at once and put them into an internal buffer. A regular InputStream grabs one byte per read() so it has to pester the OS more and is slower and wastes more processing power (which in turn wastes battery life).

InputStream in = null;
String queryResult = "";
try {
URL url = new URL(archiveQuery);
HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
HttpURLConnection httpConn = (HttpURLConnection) urlConn;
httpConn.setAllowUserInteraction(false);
httpConn.connect();
in = httpConn.getInputStream();
BufferedInputStream bis = new BufferedInputStream(in);
ByteArrayBuffer baf = new ByteArrayBuffer(50);
int read = 0;
int bufSize = 512;
byte[] buffer = new byte[bufSize];
while(true){
read = bis.read(buffer);
if(read==-1){
break;
}
baf.append(buffer, 0, read);
}
queryResult = new String(baf.toByteArray());
} catch (MalformedURLException e) {
// DEBUG
Log.e("DEBUG: ", e.toString());
} catch (IOException e) {
// DEBUG
Log.e("DEBUG: ", e.toString());
}

At this point, our JSON response is stored in the String queryResult. It looks kind of like this:

callback({
"responseHeader": {
... *snip* ...
}, "response": {
"numFound": 1496,
"start": 0,
"docs": [{
"mediatype": "audio",
"title": "The Disco Biscuits At Starscape 2010",
"identifier": "TheDiscoBiscuitsAtStarscape2010",
"format": ["Metadata", "Ogg Vorbis", "VBR MP3"]
}, {
"title": "Lotus Live at Bonnaroo Music & Arts Festival on 2010-06-10",
"mediatype": "etree",
"date": "2010-06-10T00:00:00Z",
"identifier": "Lotus2010-06-10TheOtherStageBonnarooMusicArtsFestivalManchester",
"format": ["Checksums", "Flac", "Flac FingerPrint", "Metadata", "Ogg Vorbis", "Text", "VBR MP3"]
}, {
"title": "Lotus Live at Mr. Smalls Theatre on 2006-02-17",
"mediatype": "etree",
"date": "2006-02-17T00:00:00Z",
"identifier": "lotus2006-02-17.matrix",
"format": ["64Kbps M3U", "64Kbps MP3", "64Kbps MP3 ZIP", "Checksums", "Flac", "Flac FingerPrint", "Metadata", "Ogg Vorbis", "Text", "VBR M3U", "VBR MP3", "VBR ZIP"]
}, {
... *snip* ...

We see that the information we want is stored in an array whose key is "docs" and is contained in an item called "response". We can grab this information VERY easily using the JSONObject class provided by Android as shown below:

JSONObject jObject;
try {
jObject = new JSONObject(queryResult.replace("callback(", "")).getJSONObject("response");
JSONArray docsArray = jObject.getJSONArray("docs");
for (int i = 0; i < 10; i++) {
if (docsArray.getJSONObject(i).optString("mediatype").equals("etree")) {
String title = docsArray.getJSONObject(i).optString("title");
String identifier = docsArray.getJSONObject(i).optString("identifier");
String date = docsArray.getJSONObject(i).optString("date");
System.out.println(title + " " + identifier + " " + date);
}
}
} catch (JSONException e) {
// DEBUG
Log.e("DEBUG: ", JSONString);
Log.e("DEBUG: ", e.toString());
}

The first thing that I do is create a JSONObject from "queryResult", which is the JSON response from archive.org. Note that I remove "callback(" from the JSON string because, even though archive.org returns it, it should not actually be part of the JSON string (I realized this when I was catching JSONException errors).

After that, we are ready to do some JSON parsing. Since this is just a tutorial, I hardcode "10" into the for loop because I requested 10 items. This would be a bad idea in production code (if you don't know why you are a huge noob and should not be writing production code). I only want items whose mediatype is "etree", and for each of these items I print the title, identifier, and date.

Voila, you now know how to use JSON in Android.