FAQ for Research Material Preservation
- Should I document my research material? If so, how?
- What systems could I use to organise and find my material or data?
- Why shouldn't I just keep my data or material on my hard drive?
- I have all my data on an external hard drive - do I need to do anything else?
- I have heard that I should delete some material. How do I decide what to delete?
- How do I archive emails?
- What's the difference between 'backup' and 'preservation'?
- How do I make backups?
- Are there Cloud services that let me backup or preserve my data?
- Why should I preserve research material?
- What material and data should I preserve?
- What is the difference between short-term and long-term preservation?
- Why should I consider using a repository?
- What archives or repositories are there for preserving my data?
- Can I use my institutional repository for data preservation?
- Can or should I deposit in more than one repository or archive?
- I work on a funded project - do I need to deposit my research output?
- Will I be infringing other peoples' copyright if I deposit my research material in a public access repository?
- I want to publish my research and perhaps apply for a patent. Can I still use a repository?
- Will I lose control over the material if I preserve it?
- My research material contains confidential or sensitive information. Can I still use a repository?
- What are preprints and postprints?
- What file formats should I use?
- How do I create a pdf/a?
- Why shouldn't I use jpeg image format or mp3 audio format?
- What are .odt and .ods files?
- What should I do with non-digital material like paper and magnetic tapes?
- This all seems very time consuming. What can I do to make things easier?
Explaining your material
It is very unlikely that your research material and data will be self-explanatory, so:
- consider the audience for your research material. As a minimum, these will be other researchers and librarians/archivists;
- document your material and data to aid discovery and understanding;
- use structured descriptions (metadata) when appropriate.
Here is a checklist you could follow:
- Background info. Title, author, date. Why the data exists
- Brief description of what the research material is about and what it is not about (scope and limits)
- If you have computer programs or scripts, say what operating system, program and version were used to create or compile the material
- If you have used file formats that are not open standards, say what the formats are and what operating system, program and version were used to create the material
- file list with a description of what each one contains and whether there is any interdependency (e.g. a spreadsheet that links to another spreadsheet for its data)
- database schema. Include either here or as a separate document enough information to be able to recreate the structure of any database in your research material
- metadata schema. Explain here or in a separate document what data you have embedded and in which fields (e.g. data embedded in images)
- ownership, copyright or other IPR in the material. Be explicit about your own copyright as well as any third-party material included
- permissions and limitations of reproduction. For example, if some of the material has been copied from a library or archive, what publication conditions apply
- the licence that covers your material, for example it is a good idea to release your data under a Creative Commons licence
- deletion schedule
The purpose of organising your files is so that
- you can find them again,
- you can selectively share files,
- when you give them to an archivist, the files are in some sort of logical order to be ingested into a preservation system.
All 3 major operating systems (Windows, Linux/UNIX and Mac OS) have hierarchical filing using user-created folder structures, so this can be a good place to start organising your files. This is fine if your research maps to a single hierarchy, but it may not be long before you find that you need multiple hierarchies for your files, or that a hierarchy is not the most appropriate approach. But operating systems don't support multiple hierarchies, which is when you start resorting to the built-in search tool. Search tools are okay up to a point but they are not a substitute for organising your files in a way that is most relevant to your research, so consider some of the alternatives…
The main problem with a single hierarchy is that a file can only exist in one place, but it often makes it easier to find if it appears in several places. It is not good practice to copy the file to other folders as this means updating multiple versions and it wastes storage space. Operating systems get round this by allowing you to create a link in one folder to a file in another folder - in Windows these are called Shortcuts. This isn't a true multiple hierarchy but it does allow you to get around a limitation of a strict single hierarchy.
However, if you were to keep on creating links to files in other folders the concept of a hierarchy turns into an anarchy of cross-linking. So it might be worth considering other techniques…
If you don't think in hierarchical ways or your research doesn't lend itself to this structure, there are several alternatives.
Some operating systems (e.g. Windows 7) will allow files to be tagged, but not all file types can be tagged - in Windows, Office documents can be but PDF, for example, cannot - read more about tagging in Windows here ( http://windows.microsoft.com/en-us/windows-vista/Add-tags-or-other-properties-to-a-file), while Mac users can use Spotlight. Unfortunately this makes OS tagging of limited value at the moment. So let's look at another approach…
The easiest to describe and implement is filename tagging (tagging is pretty much the same as keywording except that keywording tends to be done by specialists).
You choose one or more words to categorise your file and include these as part of the file name. You can then use the computer's Search function to find all instances of that tag.
Simple tagging example
Let's suppose you have downloaded a PDF article and you give it a filename in the form author_date_title: DPC_2012_PreservingDigitalSoundAndVision.pdf which you save along with all other files in your My Project folder.
You could then add tags to this: DPC_2012_PreservingDigitalSoundAndVision_audio_video_preservation.pdf
Searching on your computer for "_video" will find only those files with this tag but not when the word is included in the title. For this you should search using "video". So you can search just tags, or tags and titles.
If you use a lot of different tags you will probably want to maintain a list of tags in a separate text document so that you don't confuse things with near-duplicate words.
Using tags outside the operating system
The simple example above is about as far as current operating systems can go unless you are working exclusively with file formats that can be tagged. Users of Web-based social media sites, blogs and Cloud services will probably be familiar with tagging and categorising. Google Mail and Drive, for example, uses the concept of tagging: the idea is that you create a list of tags, then attach one or more tags to each message or document. If you want to see just the messages or documents associated with that tag, you just click the tag in the left-hand navigation window.
This may sound similar to putting files into folders but has the advantage that multiple tags can be associated with each file, and displaying them is as simple as point-and-click. (Using Google and other Internet services has other issues though, which are considered elsewhere).
Keywords and metadata
Taking the concept of tagging a stage further means using applications designed for specific file types. A common requirement is to catalogue and, usually, edit media files. Keywords and/or metadata can be embedded in the files or saved separately in sidecar files and indexed & searched by a database.
Metadata is information in a structured form about the file. for example, an image might contain details about the camera and lens, a title, a location and copyright holder. Or the metadata may be held in a database (a library catalogue system is an example of a large metadata store since it contains information about the books in the library).
Sidecar files are used if information cannot be embedded in the original files. These sit alongside the original file and have the same name but the extension .xmp.
There are several software packages for managing your research references. Which you choose depends on the functions, especially output, you require. From the data preservation point-of-view, it is important to be able to extract the information from the package's database in a standard open form, which usually means an XML format such as TEI or RDF. Both EndNote and Zotero support XML export of data.
Your university probably runs training courses in managing references. Look on your Intranet or ask your library about their training.
Organising media files
If you work extensively with media files (graphics, images, sound and video files) there are several programs that can help organise them, depending on which operating system you use. You need to check whether the range of file types you use are supported by the program, but some programs can have their capabilities extended using plug-ins: for example Lightroom can support any file type using the AnyFile plug-in and have its metadata handling enhanced with LR/Transporter. Some examples follow but Wikipedia has a more complete list.
- Windows users have a choice of Lightroom, Picasa, MediaPro and DigiKam - Mac users have Aperture, Lightroom, Picasa, MediaPro and DigiKam - Linux users have DigiKam
Some software, e.g. Lightroom, allows you both embedded and database metadata simultaneously, giving you the best of both worlds: embedded metadata for maximum future-proofing, and databased metadata for speed of searching.
Storing it safely
Keeping all your research data in one place is not a good idea in general. It is essential not to keep your research data on your hard drive as inevitably hard drives fail and you will lose your data. You should always back up your data on at least two more devices or systems (ideally a repository) external to your hard drive.
Ensure that your data is well documented and held on at least two external devices/systems, ideally including an institutional digital repository.
Selective disposal of files will help you to find relevant and up-to-date information and save on back-up time and cost.
Most of your research material - including data and emails - are classed as ‘records’ and will be covered by your University and/or funder’s records retention policy. This may define a retention period for different types of records. In addition, you should consider whether there are legal reasons to keep the material and whether you are responsible for keeping the master copy (as its creator or owner). Also consider whether the material is fundamental to your project, records one-off events that cannot be recreated, or would be useful in further research (by you or others). On an administrative level, a record (such as an email) may provide evidence that you did something and why.
If none of these apply, you should consider whether the material needs to be kept.
If you do choose to delete material, make sure you dispose of it securely.
A checklist on selection and retention is available at http://www.lib.cam.ac.uk/dataman/resources/PrePARe_selection_retention_factsheet_draft.pdf . ('draft' to be removed after review)
You have 4 choices:
- a facility embedded in your email. This has to be done by your email provider, and since few provide this facility, we'll ignore it as a possibility;
- download your messages in XML format, then archive them as documents;
- create PDF/A documents from your messages, then archive them as documents;
- download your messages in text format, then archive them as documents.
XML, PDF/A and plain text can be used as long-term preservation formats. XML is intended to be machine-readable and will retain formatting. PDF/A documents are essentially read-only but are searchable and they preserve formatting. plain text is human-readable but loses the formatting. If in doubt, use XML as this can be viewed by most Web browsers or freely-available XML viewers, and can be imported into or indexed in other systems.
Few email systems are designed to make this process easy and you will usually have to process your messages one by one. To generate XML you can use XMLprinter from http://xmlprinter.com/
Backup is the process of safeguarding your files while you are conducting your research,
Preservation is the process of storing your research material so that others can access it once your research has finished.
You need to guard against system failure, theft and accidental deletion of files. The best way to do this is to keep a second copy (or more) of your files on a different drive and in a different location. Your university probably provides you with storage space on its servers that is regularly backed up to tape, so if you can use this space for your working documents this is a good solution. If this doesn't suit your working style then at least copy your files to the university network reasonably frequently.
The frequency of backup depends on how quickly your files change and how much data you are prepared to lose; anything from 10 minutes to daily could be reasonable. Automate your backups with software if possible, but check every week or two to make sure the backups are happening.
There are Cloud-based storage and backup services but they are only a few years old and so have not built up the trust necessary for long-term storage requirements. Although some offer a small amount of free storage, you can expect to pay for larger amounts.
Since the level of trust is low, the risks of storing sensitive data is high, so if you use personal data or commercially-sensitive information you should consider additional precautions such as encryption.
The British university system is working on a cloud storage system, which is in beta at the time of writing (mid-2012) So cloud-based storage may be available via your university soon.
There are no Cloud-based preservation solutions at the time of writing.
Researchers from all disciplines accumulate material in the course of their research. Considerable time, effort and money is spent in this endeavour. The preservation of research data is essential in order to further research through sharing of the data; to enable validation of results and demonstrate the process behind the conclusions and results of research.
First you must decide what the goal of preserving your data is. Since you will only preserve data if someone can access it, you can decide your principal goal by answering the following questions:
Q1. Do you want to get your research material and data to the maximum number of people? If "yes", that is your goal. If "no", go to Q2
Q2. Do you want to restrict access to some or all of your material and data in some way? If "yes", decide in what way. If "no", go to Q3
Q3. Do you want to be in complete personal control of who has access to your material and data? If "yes", you need to be very sure as there are long-term implications for you in doing this. If "no", go to Q4
Q4. What, none of the above? So what is it you want to achieve? If you have alternative goals you should speak to a specialist librarian or archivist in your university who can help you decide.
Once you have decided your goal, only preserve the research material and supporting documentation that directly achieves your goal.
You should also decide when the material should be deleted.
The distinction between short-term and long-term is somewhat flexible but by short-term we generally mean one or two digital generations. For hardware a generation is typically 5 years (in 2012) since this is how frequently we expect to migrate data from one set of hard drives to the next; but a software generation is usually the gap between major releases that make changes to any proprietary formats they use, which may be anything from 2 to 7 years (though backward compatibility often extends the life of older formats). As a general guide, short-term is up to 10 years, long-term means beyond 10 years.
Digital repositories provide online archival storage, caring for and preserving digital materials to facilitate future use. Repositories and data centres may make material available to researchers, or to the wider public on an open access basis. They also help to ensure long-term accessibility by converting files to new formats when old formats become obsolete.
There are many other advantages to using established repositories and data centres:
- It raises the impact of your research. Your research material will be easier to find and more accessible to researchers worldwide. There is also evidence that making material open access can increase your citations.
- The material will be given a stable URL, making linking to material easier and more reliable.
- It can help you comply with funders’ requirements to make research outputs open access, and some funders specify the use of particular repositories.
- Making research data available means that your research can be verified, which increases the credibility of your analysis and conclusions. It makes it easier to link publications with the underlying data.
There is no single UK repository for research data. Instead many are being developed within universities. The OpenDoar initiative provides a comprehensive list of open repositories worldwide and in the UK: http://www.opendoar.org/index.html
Here are some UK wide repositories for specific types of data:
http://archaeologydataservice.ac.uk/ The Archaeology Data Service supports research, learning and teaching with freely available, high quality and dependable digital resources. It does this by preserving digital data in the long term, and by promoting and disseminating a broad range of data in archaeology. The ADS promotes good practice in the use of digital data in archaeology, it provides technical advice to the research community, and supports the deployment of digital technologies.
http://ota.ox.ac.uk/ The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources.
http://hds.essex.ac.uk/ The History Data Service (HDS) collects, preserves, and promotes the use of digital resources, which result from or support historical research, learning and teaching. The History Data Service is a successor service to AHDS History which from 1996 to March 2008 was one of the five centres of the Arts and Humanities Data Service.
Yes, you should be able to do this, if your institution has an institutional repository which collects research material. You should enquire of your institution if this is the case.
No, it should be more than adequate to deposit in one repository but it depends on the service offered by the specific repository, e.g. does it guarantee that it will maintain access to the data over time?
Most funding bodies in the UK, and other funders internationally, request or require that their funding recipients preserve their research material in the long term and share some or all of it with the public.
While mandates cover both publications and data, they are most well-developed for dealing with research publications. Funders often specify when a research paper should be archived, what version of a paper should be archived and where. Examples of mandate requirements are:
- Deposit in an institutional repository (IR) (for example this is recommended by the EPSRC)
- Deposit in a discipline repository (e.g.the Wellcome Trust requires publications to be deposited in UK PubMed Central)
- Deposit in the funding body’s own repository (e.g. NERC Open Research Archive (NORA)).
How research data need to be preserved will depend on your funder, and a given funding body may have different requirements for different types of study. An overview of funding bodies’ open access policies for journal articles and research data can be found at the SHERPA/JULIET service ( http://www.sherpa.ac.uk/juliet/index.php). The Digital Curation Centre provides more detailed information about the data policies of the main UK research councils ( http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies).
You will probably own the copyright in research material you have created yourself (text, photos etc), but you should check your institution’s and funder’s policies and, in the case of staff, your terms and conditions of employment. Your university is likely to retain copyright in reports produced for administrative or managerial purposes and work commissioned from photographers, designers, consultants etc.
However, publishing an article or book often involves transfer of copyright from you (as the author) to the publisher; make sure that you read through your publishing agreements carefully and retain a copy for future reference. You may want to try to negotiate with your publisher to retain copyright.
In the case of joint authorship, you will need the consent of each co-author to deposit work on their behalf. If the contributions of each author/creator are distinct, the parts contributed by each have separate copyrights, unless agreed otherwise.
If you have used third-party material in your work you must take steps to ensure that you do not infringe the rights of the copyright holder(s) by depositing that work in a repository. This will apply if you have included extracts (such as a quotation) from documents, reports, papers, data, letters, tables, computer programs, databases, photographs, sound recordings, films and broadcasts (this list is not exhaustive). If you have included an extract from another’s work, you need to acquire written permission from the copyright owner (eg the creator or publisher) to reproduce it in your chosen repository. Your university may provide sample permission request letters which can be used to request permission.
Permission from the copyright holder may not be required if:
- a copyright notice that accompanies the work you have taken an extract from allows for this intended use; or
- the quotation or extract or excerpt is from a work that is ’out-of-copyright’; or
- your use of the extract meets the requirements for ’fair dealing for the purposes of criticism or review’.
For more information about permitted uses, see http://www.ipo.gov.uk/types/copy/c-other/c-exception.htm .
Remember that different types of material (for example print and audio materials) may be subject to different copyright periods, and that copyright law differs between countries.
More information on copyright ownership, duration and permission is available at http://www.copyrightservice.co.uk/copyright/p27_work_of_others .
If you are considering applying for a patent, you will need to keep your research findings confidential and unpublished. A common reason for the failure of patent applications is premature, non-confidential disclosure of the invention. You should talk at an early opportunity to your Research Office about Non-Disclosure Agreements (also known as confidentiality disclosure agreements). See http://www.ipo.gov.uk/types/patent/p-applying/p-apply/p-cda.htm .
However, this does not mean that you cannot use a repository. Some repositories will allow you to submit your data but embargo its release (restrict access) for a few years. In some situations, a repository may be able to make part of your material open access while keeping the more proprietary parts hidden and inaccessible. See also http://www.ipo.gov.uk/types/patent/p-applying/p-apply/p-paper.htm .
Some publishers will not publish material – such as a book or journal article – if that material has been previously published; this includes it being made available on the internet through a repository. In such cases, you may want to embargo access to the material in the repository until you have published. The publisher may also require an embargo on open access for a certain period after they have published your work. This may be a particular issue for graduate students who wish (or are required) to make their thesis available online through a repository.
Students who wish to put their thesis in a repository should discuss it beforehand with their supervisor to avoid jeopardising any future plans for publication or patent resulting from their work.
A significant number of research funders require that data produced in the course of the research they fund should be made available for other researchers to discover, examine and build upon to allow for new knowledge to be discovered through use, reuse, comparing data and so on. However, you are responsible for deciding what data is legally obliged to be open or closed according to various pieces of legislation such as FOI and data protection. This should be stated at time of deposit.
Research data containing personal information - even sensitive data such as that relating to an individual’s health and ethnicity - can often be shared if researchers obtain informed consent, anonymise data and control access where required ( http://www.data-archive.ac.uk/create-manage/consent-ethics).
You must inform interviewees and other participants how research data will be stored, preserved and used in the long-term and how confidentiality will be maintained if required (e.g. by anonymising data). If you collect personal data, you must obtain written consent for data sharing and re-use from interviewees.
You should consult your University’s Ethics policy and your funder’s policy on sharing and deposit. You should also seek advice on how to control access to personal and/or confidential information. It is good practice to store information about the data's sensitivity and participant consent or use agreements with the data itself to avoid accidental misuse in future.
- A preprint is a pre-refereed and unpublished paper which is usually submitted for publication.
- A postprint is the final peer-reviewed version of a paper. The postprint incorporates any changes or corrections necessary to ensure publication. It is the author’s copy of the paper, not the published pdf version.
- The publisher PDF is the formatted PDF file as created by the publisher.
- An eprint is the umbrella term for an electronic copy of a paper which can include both pre and postprints.
For advice about managing personal versions and revisions please see the LSE's Version Toolkit for authors, researchers and repository staff ( http://www2.lse.ac.uk/library/versions/VERSIONS_Toolkit_v1_final.pdf ).
Preprints and postprints are terms used by repositories and publishers to describe which version of a publication – generally a journal article – the author(s) can put into a repository.
Unfortunately the definitions of preprints and postprints are not universally agreed, and this can cause confusion.
The principle is to use a format with the maximum chance of being readable by computers in decades time. This means using formats that are based on open standards, such as OpenDocument or PDF/A. Your university may provide detailed advice for each category of file, or you can get a basic list at the British National Archives: http://www.nationalarchives.gov.uk/documents/information-management/file-formats-for-transfer.pdf.
PDF/A-1a is an ISO standard created for the long-term preservation of documents, and converting text-based documents to PDF/A-1a will help preserve their appearance over time.
PDF/A-1a is not suitable for all types of document, as it does not allow:
- any audio and video content;
- any encryption
You should also note that all fonts are embedded within the document and this means your use of fonts is limited to those that are legally available for open, unrestricted use. Popular fonts (such as Times New Roman) can be embedded without problems but if you use other fonts you must check the converted PDF/A document. Open the document in Acrobat Reader, right click and select Document Properties, then click the Fonts tab. All the fonts listed should be labelled ‘(Embedded Subset)’.
How do I convert my document to PDF/A-1a? The following list is not exhaustive. You can create PDF/A-1a with other applications.
- Open a document in Writer, Calc, Draw, or Impress document
- Click File > Export as PDF
- Ensure the 'PDF/A-1a’ checkbox is checked
- Click the Export button
- Add a suitable filename
- Open the document in Acrobat Reader and check that the formatting has not changed.
Microsoft Word (2010)
- Go to the File tab and choose the ‘Save as’ option
- From the ‘Save as type’ drop-down menu, choose ‘PDF (*.pdf)’
- Make sure that the ‘Optimize for:’ option has the ‘Standard (publishing online and printing)’ option checked.
- Click on the ‘Options…’ button
- Look at the bottom of the Options box and make sure ‘ISO 19005-1 compliant (PDF/A) button is checked under ‘PDF options’. (Note that ISO 19005-1 is the standard for PDF/A-1a)
- Click the ‘OK’ button to close the window.
- Click ‘Save’ to save the file (making sure that you have chosen a sensible file name in the process).
You have now created a PDF of your word document with embedded fonts.
The process is similar for other Microsoft Office products, including Excel and PowerPoint. Although the exact appearance of the drop-down menus differs, the options that you need to select remain the same.
If you have Adobe Acrobat Professional installed then Word will have an ‘AdobePDF’ menu, which you can use to produce a PDF/A.
- Open this ‘AdobePDF’ menu and select ‘Convert to Adobe PDF’
- Click the Options button at the lower right and then check that the ‘Create PDF/A-1a’ option is ticked
- Click ‘OK’
- Add a suitable filename and check that it has the extension .pdf.
Formats that use compression should be avoided if they involve losing information (known as lossy compression). This includes mp3 for audio and JPEG for images. Once the data has been compressed there is no way to recover it. Also, if the file is opened and re-saved, further loss will occur, so the sound or image quality will gradually be degraded. You may not notice this at first but it will eventually be obvious. So for long-term preservation, avoid formats that use lossy compression.
These are examples of OpenDocument format files, .odt indicating OpenDocument text and .ods indicating OpenDocument spreadsheet. OpenDocument formats are becoming widely supported by office application like Word and OpenOffice, so use the .odt format in preference to proprietory formats whenever possible.
File extensions starting .od indicate a losslessly compressed file whereas .fod indicate an uncompressed file. Although compression is generally frowned on for long-term preservation, in this case it is acceptable as the compression algorithm is also an open standard.
Further details on Open Document formats can be found at https://www.oasis-open.org/standards/
Although most research material and data are already digital, there is often some material that needs preserving that is in some other non-digital form. Examples could be field notebooks, lab books, audio and video tapes, photographs, photocopies, and original objects. These materials will need to be digitised in order to preserve them in a digital repository.
An important distinction to make when considering digitisation is between the value of the object and the value of the content. For example, if it is important to record the condition of the binding of an old book this would be an object value, but if it is important to be able to read the text in the pages of a book, this would be content value.
- Digitising the content of an item can be done with consumer-level equipment and a bit of practice with the technology
- Capturing the object value of an item requires a higher level of equipment and skill, which you might be better to pay a specialist for
- Images of text-based documents can have extra value if they are processed using optical character recognition software
- high-quality audio and video digitisation requires skill and practice.
Planning (Start early)
At the start of the project, think about what you will do with your research outputs at the end. You might find it helps to write a data management plan, and this may have been a requirement of your funder. The Data Management Plan can guide you through many of the issues dealt with in these FAQs.
Don’t think that you need to deal with everything by yourself. There are lots of sources of help and information, for example the DCC ( http://www.dcc.ac.uk/), DPC ( http://www.dpconline.org/), DMP Online ( https://dmponline.dcc.ac.uk/), and repository, data centre and library staff. You may also want to get information relating to your specific circumstances from specialists in copyright, data protection or patents, for example. By getting support and specialist advice on all aspects of managing and preserving digital material, you can gain increased confidence in your research plans.
The process might still seem very time-intensive, but it might help to remember that by addressing issues early on in a research project you are making things easier in the long run. Leaving things till the end of the project may make it seem quicker at the time, but will make some issues much harder to deal with effectively and take much longer to sort out.