Yesterday I spent nine and a half hours in a room with a group of hard-core coders and not only lived to tell the tale, but thoroughly enjoyed the experience.

Firstly I should say thanks to Dr Bruce Scharlau of Aberdeen University for hosting the event, and for providing coffee and pizza. Given that fewer than expected participants turned up there certainly was no shortage of food!

From the outset it was clear that this was a free-form day, with an encouragement for all to collaborate and to experiment on whatever we liked.

Initially we split into three groups. I paired with @mr_urf and we agreed to use some data from Aberdeen City Council – avoiding the already open data in favour of scraping some data which was locked in HTML tables. We chose the Food Hygiene Inspection Data. We chose to use ScraperWiki to do the scraping. While this might be a potentially brittle solution it does allow the user to be alerted if the scraped page changes.

The most significant challenge (other than the fact that neither of us had used Scraper Wiki before) was that while tutorial and sample code demonstrated scraping multiple pages by following Next links, this page proved more challenging as the MS ASP-generated pages used javascript postback links to paginate the records. So Mr Urf had to be more creative in finding a way overcoming the challenges that this presented.

All of this took quite a bit of time, understandably, but eventually we had a scrape working that sucked the data in and saved it, allowing it to be accessed as open data.

While all of this was going on, as the non-coder, I spent my time preparing a presentation on our project and making suggestions for naming it. After much amusement we settled on “Shiny Spoon” as the opposite of Greasy Spoon.

I quickly registered the domain name, linked it to Mr_Urf’s hosting, and completed and uploaded the presentation.

By the end of the afternoon Mr Urf was able to display the data on his own page.

Despite Mr Urf’s best efforts it didn’t feel like an enormous success: we’d scraped data and re-presented it on a web page with less functionality the original source. But this was only part of our proposed project and we agreed that we’d like to get back together to work on it further. The added value will come in combining the data with other sources.

And we shouldn’t forget that we did manage to open up data that was previously locked in an HTML page – allowing anyone else to re-purpose it in their own projects. Also, given that the Health and Safety inspection reports that we proposed to scrape are in a very similar system to the original, it will be reasonably to scrape that in exactly the same way allowing the data sets to be merged.

Mr_Urf even found time to upload the source to GitHub. If I get the address I’ll link to it here. I also wrote it up hurriedly for Rewired State.

At the end of the day we each gave a short presentation on what we’d done. The other guys’ presentations showed their work as being more conceptual and while they appeared to have real potential, they’d need some work to put them into production. One  was a system to allow school pupils to work through what exams they need to pass in order to get to a university to get a particular degree with an end job in mind. This required data from a large number of sources including LinkedIn, SQA, UCAS as well as individual university sites.

The other was a concept to allow users to define their own linked data requirements as bundles and to then generate a structure and a means of hosting that as well structured linked data. What then became of the data was up to the user community. This seemed an intriguing proposition and it would benefit from working up further.

Tomorrow I got back to my day job. As I do so, I’ll be bearing in mind the first-hand experience of how difficult it can be to get data out of council websites.

To date we’ve agreed that there is a need to provide openly, and we’ve dipped our toe in the water, picking off simple sets of data and making these available as standards-compliant open data.

My aim is to work up a strategy for how we provide open and linked data in the future.

  • Should we now try to follow the lead of Enschede and make all data open by default, with only valid exceptions held in closed formats?
  • Should we also attempt to retrofit the website’s existing data-driven systems to pull the data out of them for the ease of end users,  or
  • Should we encourage the user community to assist us using ScraperWiki and other tools.

Expect another blog post on this topic in the near future!


NHTG11 – a personal reflection
Tagged on:                 

Leave a Reply

Your email address will not be published. Required fields are marked *