Project Advice

Project Advice

2. Reusing old data

You may be lucky in finding that there is already some electronic text that you can reuse. University departments often have data from old projects, although, again, watch out here, because it can be surprisingly hard to convert (or even read) old media. Another possibility is data that is freely available on the web and that can be reused – from sites such as Archive.org, the Oxford Text Archive, the Gutenberg Project, and so on. You might find that these texts need correcting, as well as tagging, but they can provide a useful base text to work with.

If the information you want to work with is modern data you may be able to obtain it from the web by scraping it. Here is a very handy guide to web scraping without programming. You can also use some of the functions built into Google Docs to scrape data, such as using the =importHTML function, illustrated here. Of course there are many ways to get web data using software, such as the wget command for the Unix command line, or the built-in functions for scripting languages.

For offline data we’ll run through a few common formats with advice about how they can be converted to XML.