Updated (*) 17/1014 2 new bullet points at the end.
At the first Code The City event, held in Aberdeen in June 2014, one of the projects which was worked on for the weekend was the setting up of a search facility for Scottish Local Authorities’ FOI disclosure logs.
Iain Learmonth, Johnny McKenzie and Tom Jones created a scraper for the disclosure logs of both Aberdeen and East Lothian councils. This scraped the titles and reference numbers of FOI responses, and, if I recall correctly, processed the data as Linked Data at the back end, and put them into a MySQL database. They then put a web search front end onto it . A demo site FOIAWiki was set up for a while but that was then taken down.
Iain has now made his code available on Github.
That wasn’t all we wanted to do – so I hope that at the second Code The City event, which takes place in Aberdeen this weekend (18-19 Oct) we can revisit this and go a bit further.
One way we could do this is to broaden out the number of councils that we scrape; and the second would be to go deeper, scraping the content of the actual PDFs which contain the detailed responses themselves, and make that searchable, too.
There are some hurdles to be overcome – and I’ve been working on those over the last week.
The first is that only a small proportion of Scottish councils actually publish a disclosure log. According to some research I carried out last last weekend, these account for only 8 councils (25% of the total). Amongst those, only two of Scotland’s seven cities actually had discoverable disclosure logs. I’ve posted a list of them here. And, of course, the quality of what they provide – and the formats in which they present their data- vary significantly.
The second issue is that about half of those with disclosure logs post their responses in PDFs. These files contain scanned images. So the text can’t be easily extracted. However there are tools which, when used in combination, can yield good results.
In preparation for the next Code The City session I spent two days and a couple of evenings experimenting with different options. Keeping with Iain and Tom’s original approach, and bearing in mind my own slender coding abilities, I looked for Python-based solutions.
In setting up my environment I had several difficulties in getting Tesseract and its dependencies installed. I’d missed that you could, on Ubuntu-based systems use the command:
"sudo apt-get install tesseract-ocr"
(as I’d being trying “sudo apt-get install tesseract” (ie without the -ocr) and it had failed.
Instead, I’d tried downloading the source and compiling that according to these instructions.
(HINT – try at all costs to avoid this option if there is the possibility to use a package manager to do the job). The upshot was that running multipage-ocr.py failed every time due to Tesseract and its many dependencies not being installed properly.
Trying to fix it by installing the pre-packaged option, then trying remove it and reinstall via the apt-get package manager failed several times and took hours to get rid of the manual install and start with a clean installation. As soon as I did the code worked.
So, running it from the command line thus:
python multipage-ocr -i abcde.pdf -o output.txt
So, for the weekend ahead, here are some things I would like for us to look at:
- amending the scraping to work over all eight sites if we can – and look at how this affects the original information request ontology from weekend one.
- grabbing the PDFs and storing them to a temporary location for conversion. The process is quite slow to run and processing a 2 or 3 page document can take some time.
- running the script, not from the command line but using os.system() or similar, passing the parameters
- limiting the OCR to work on only the first two or three pages (max). Note: I saw one Aberdeen City Council PDF which ran to 120 pages! Beyond the first couple of pages the content tends to be of less relevance to the original query – and forms background in some way as far as I can see.
- convert the multipage-ocr script to not (just) output to a txt file, but to grab the text as a bag of words, remove stop words, and store the output in a searchable DB field – associated with the original web pages / links.
- set up some hosting with all database, python scripts with dependencies and a web front end onto it.
- Further options would be to look at for standard structure in the documents and use that to our advantage – and to set up an API onto the gathered data……)
- (* update) – we need to track which FOI responses have already been indexed and only schedule indexing of new ones, and
- ww could use What Do They Know RSS feeds for missing councils https://www.whatdotheyknow.com/body/glasgow_city_council (although this would not be a full set).
And probably a whole pile of stuff that I’ve not yet thought of.
I’m looking forward to seeing how this comes along!