Askari and Java prototyping

  • : Function split() is deprecated in /hsphere/local/home/guruj/guruj.net/modules/filter/filter.module on line 1200.
  • : Function split() is deprecated in /hsphere/local/home/guruj/guruj.net/modules/filter/filter.module on line 1200.

The two most useful things I've found about my Rhino wrapper Askari are:

  • it gives me the easiest way I've found to quickly load, interpret and transform XML data
  • it allows me to use Java libraries without compiling Java code

Recently, I came across a nice example which allowed me to exploit both benefits. I wanted to be able to extract data from a series of web pages, but which unfortunately were being presented in HTML format rather than XHTMl.

So I tracked down the JTidy library, and made it so Askari would automatically load it by adding the following lines to lib-askari.conf:

; JTidy interface
jar,jtidy\jtidy.jar

Then in the Askari interpreter I could load the file, convert it to XML using JTidy, select the part of the file I wanted, and write the results to disk:

var xhtmlns = new Namespace("http://www.w3.org/1999/xhtml");

for (var i=1;i<100;i++) {
  var text = readUrl("http://myurl.com?id=" + i);
  var page = new XML(tidy(text));
  var email = page..xhtmlns::div.(@id == "results");
  writeFile("email"+i+".html", email.toXMLString());
}

function tidy(s) {
  var JTidy = JavaImporter();
  JTidy.importPackage(Packages.org.w3c.tidy);
  var tidy = new JTidy.Tidy();
  tidy.setXHTML(true);
  tidy.setNumEntities(true);
  tidy.setQuiet(true);
  tidy.setDocType("omit");
  tidy.setShowWarnings(false);
  var inStream = new java.io.ByteArrayInputStream(new java.lang.String(s).getBytes());
  var outStream = new java.io.ByteArrayOutputStream();
  tidy.parse(inStream, outStream);
  return outStream.toString();
}

Most of the tricky bits are in Java type wrangling (since this isn't a normal concern of JavaScript). Nonetheless, it's still an easy way to take a Java library for a quick test drive.