Preview these changes Research Data @Essex - Pushing research data management forward at the University of Essex

DMP scoring system: RELU DMPs, 2005-2010

The Relu Data Support Service (Relu-DSS) helped researchers of the Rural Economy and Land Use (Relu) Programme manage their data throughout the research lifecycle. For the duration of the Relu Programme, the Relu-DSS provided proactive data management guidance and support to the Programme's researchers and co-ordinated the archiving of interdisciplinary data collections.

Relu was the first cross-council multi-million pound research programme to fund a dedicated support service, co-ordinated between the UK Data Archive and the Centre for Ecology and Hydrology (CEH) at Lancaster, to realise its data sharing policy. It also promoted the first programme-level research data policy which expected all its researchers - with support from the Relu-DSS - to plan a data management strategy for their research and manage their data well throughout their research, with data archived at data centres funded by the Research Councils (ESRC and NERC) when the research finished.

All research projects funded by the Relu Programme developed a data management plan.  They were also given bespoke advice on preparing and implementing data management plans.

In their Relu data management plan researchers had to describe:

  • the need for access to existing data sources
  • data planned to be produced by the research project
  • planned quality assurance and back-up procedures for data
  • plans for managing and archiving research data
  • expected difficulties in making data available for secondary research (through data archiving) and measures to overcome such difficulties
  • who holds copyright and Intellectual Property Rights of the data
  • data management responsibility roles within the research team

View data management plan template (see Section 3 as DMP combined with Stakeholder and Communications Plan)

View example plans

Relu-DSS reviewed data management plans, made an assessment and gave feedback and support to the reseach teams when needed. During the review we considered whether:

  • information provided on the data planned to be produced is adequate and realistic according to the proposed research and methodology
  • all relevant data management aspects have been considered, with meaningful information provided in the plan
  • where difficulties are anticipated to make data available for archiving,  possible solutions have been suggested
  • all possible obstacles to sharing data have been considered, such as ethical limitations and copyright ownership
  • a team member with data management responsibility is in place at each participating institution

For of the six DMP questions (3.2 to 3.7) we scored 1-3, with

1= Insufficient: severely lacking clarity or detail

2= Adequate, but more information needed

3= Excellent, no more information required

Each initial plan submitted could score a maximum of 18 and a minimum of 6.  We did not sign off any plans until we were happy that all had eventually scored 18!  In some cases this required iteration and, rarely, some coercion from the Programme Director for them to complete the information.

We also assessed which existing data sources they stated were to be used (Q 3.1) to check whether researchers were seeking to purchase expensive third party data sources, and assess whether there may be a case for purchase of a programme-wide license.

Our view in administering the whole process is that this system works very well, and that the assessment should be done by someone who is familiar with the kinds of data being produced to see whether any short comings in the plan are apparent.  For this particular Programme mid-term checks on progress with DMPs were not done, which would be advisable to prevent any problems with data sharing at the end of the award periods.

Research data asset inventory [download]

In a follow up to our earlier post on the "10 questions...", we thought we'd share our full data inventory form. Though geared toward self completion, it doubles as a comprehensive interview schedule.

This document builds on a similar form based around the Data Asset Framework and originally devised to inventory ESRC research centres (read more about the DMP-ESRC project here). We have added to and adapted this, while considering the need for a broader, whole-institution approach.

Attached to this post is an XLSX verison of the inventory - feel free to adapt and re-use.

Click here to download:
UKDA_RDE_DataInventory_01-00.xlsx (44 KB)
(download)
Above: XLSX

Note: the above is not a permanent link - this file will added to our project page in due course.

EPrints for data: webinar report

Last Wednesday (23rd March) we held an ‘EPrints for data’ webinar for the JISC MRD community, to discuss current approaches and challenges. Present in the ‘virtual room’ were representatives of various repository/JISC projects, along with members of the EPrints and SWORD development teams. Lots of interesting insights came out of the discussion - check out the summary below to find out what you missed.

Many institutional repository projects seem to be in a similar situation to us at Essex – they have an EPrints IR and now want to prepare for data. Parallel to this, the EPrints team are developing improvements to help those wishing to ‘do data’ with EPrints. The SWORD team meanwhile are working toward adapting the protocol to be more suited to data transport. Given the shared interests of these various endeavours meetings such as this seem particularly crucial, ensuring an exchange of knowledge and ideas. In addition finding a standard approach cross-institution, it was suggested that it will be worthwhile to consider how best to accommodate the gamut of disciplines represented in UK academia.

The Institutional Data Management Blueprint (IDMB) project was mentioned several times during the discussion, and the outputs are certainly a worthwhile read for anyone interested in the topics covered below.

Data collections

A large portion of the discussion focused on difficulties with presenting the hierarchical relationships between studies, data collections and the multiple associated data and documentation files at the lowest level. While their was no definite solution proposed, awareness was raised around the potential complexity of data produced by a single project. A pragmatic solution might be to follow the eCrystal model and use a simple type classification for datasets within an eprint. The Kultivate Containers plugin is similar to this technically, but works at a different level, tagging a selection of EPrints with a shared type. The implementation of more integrated grouping for related items is being considered by the EPrints team, a move which would also necessitate careful consideration of the knock on effects on the discovery/searching architecture and how exactly to maintain a positive user experience.

Metadata

Louise Corti (Essex) described the need for multilevel metadata, and the process of ‘drilling down’ to different levels as necessary. EPrints current thinking, again based on IDMB, is of 3 kinds of data within an eprint of dataset-type: primary; additional metadata; and readme. This idea was well received by the group. Wendy White (Soton) noted that a consultation during the IDMB project found that archaeology researchers believe the best way to ‘do metadata’ is to add an xml file containing anything that doesn’t fit within the provided metadata schema.

Les Carr (Soton) ran through some of the key points of his ‘Mind The Gap!’ presentation, based around the IDMB project findings, particularly highlighting the 'tug of war' between metadata pragmatism and completeness. Les presented the viewpoint that a small amount of clear, realistically acquirable metadata is better than a lot of rubbish metadata entered under duress.

EPrints updates and plugins

We learnt that the EPrints team will be releasing data-specific enhancements to the software at some point in the next few months (no set date at this point!), which will be based around some of ideas mentioned above.

David Tarrant (Soton) gave some useful information on the Bazaar system (an app store a-like delivering EPrints plugins), explaining that packages are self-publishable through EPrints – minimising boundaries between developers and the community. Documentation for this process is available on the Bazaar webpage.

SWORD and VREs

Richard Jones (Cottage Labs) explained that since the SWORD2 standard roll-out the team have started considering what adaptations might be required for data. A data version is unlikely to appear in the near future however. SWORD experimentation continues in other areas e.g. deposit using content references rather than objects, and content specific tailoring (e.g. SWORD-ARM).

Richard reminded us that the focus of the SWORD project is firmly on transport. DataStage (part of the DataFlow project) was suggested as a good candidate to provide a virtual environment for user-side data management and deposit. This is standalone from the DataBank element of the DataFlow project and, as it features an implementation of the SWORD2 standard, could sit on top of an EPrints repository. Opinion seemed to be that integration into workflows is perhaps the most interesting use of SWORD. A suggested ‘vision’ is the implementation of tweaked versions of a very general model for different research groups data requirements. These instances could be left to run long term, interacting with the repository without need for direct interaction.

Disclaimer: I have attempted to report the meeting as accurately as possible, but please bear in mind that the above is a significantly condensed interpretation of what was said rather than a word-for-word transcription.

RD@Essex Progress Report (March 2012)

What have we been up to recently? Well, quite a lot really:

  • Completed data interviews with UoEssex researchers and research leaders from the four pilot departments
  • Began drafting a full report and recommendations based on these interviews, targeted at both Essex and the wider MRD community
  • Collated sample data from departments ready for trial ingest into our (now functional) EPrints test bed
  • Started some experimental metadata mapping with EPrints default, DC and DDI, while considering minimum metadata requirements for research data
  • Initial tests with SWORD2 powered deposit
  • Ingest of dummy files
  • Submitted a proposal for the OR2012 EPrints panel
  • Arranged an MRD2 programme webinar for EPrints users (we will report back here)

 

We'll be posting our full data inventory form soon, and will report some more details on the metadata front as we progress.

EDIT: If you're interested in hearing more about anything mentioned in this post, get in touch with us via Twitter - @RDEssex - we're more than happy to chat!

Advance warning: the Twitter juggernaut has subsumed Posterous, and as a result I'm doubtful of the services long term future. We may look at migrating this blog to Wordpress in the near future!

Testing the Swordv2 PHP library with Eprints 3.3.8 on Debian Squeeze

One of the tools we’d like to trial is the use of Sword2 for data-ingest.  

Sword is an acronym for Simple Web-service Offering Repository Deposit.  It is a JISC funded profile of the Atom Publishing Protocol that enables a remote CRUD (Create, Replace, Update, delete) service.  Essentially Sword enables the remote manipulation of repository contents.  

Sword2 is intended to be used with a client and the client relies on a repository service document to tell it how to interact with the server. In Eprints  the service document is at: https://myeprintsrepo/sword-app/servicedocument

To help build clients a PHP library is being developed by Stuart Lewis.  The library comes bundled with a test website that allows simple testing between a Sword2 test client and EPrints.  When depositing, thePHP  library packages up a zip archive of the files to be uploaded and writes an atom entry that describes the object.  EPrints then uses the atom entry to generate the necessary metadata.  

In view of the shortage of EPrints documentation, this short how-to might help those struggling to set-up a Sword2 client for testing on an EPrints 3.3.8 test server.

Testing Swordv2

If it isn’t already installed  grab php5

apt-get install php5  

Make a directory for the Swordv2 PHP library to live in

mkdir /usr/share/swordv2

Download the library to this folder from https://github.com/stuartlewis/swordappv2-php-library and unzip.

We then need to tell EPrints that Apache will be serving a PHP directory.   First create the file urls.pl

touch /usr/share/eprints3/archives/<your repo archive id>/cfg/cfg.d/urls.pl

Then add a rewrite exception to the file using your favourite editor

vim urls.pl

$c->{rewrite_exceptions} = [ '/swordv2/'];

Next we need to set up an apache alias so that we have a readable url eg

https://eprints3instance.ac.uk/swordv2/

Create the file

[eprint install dir]/archives/[archive]/cfg/apachevhost.conf

Then add the following directives

Alias /sword2/ /path/to/swordv2/
 <Location "/swordv2">
    AddHandler php5-script php
    DirectoryIndex index.php
 </Location>
 <Directory /path/to/swordv2>
    Order allow,deny
    Allow from all
 </Directory>

Next open up /path/to/swordv2/test/website/config.php and replace the location with the file path to this website directory

$_SESSION['location'] = '/path/to/swordv2/test/website/';

Make sure the trailing slash is included otherwise php will be unable to complete the filepath.

PHP will want to move files to be packaged to /files but the directory doesn’t exist and php won’t create it. So we create it and set permissions so that PHP can upload and package our content

mkdir /path/to/swordv2/test/website/files

cd /path/to/swordv2/test/website/

chmod 0777  files

In a browser visit the url https://eprints3instance.ac.uk/swordv2/

You should see the following screen:

Swordv2-screengrab

 

 

Understanding a researchers’ data in 10 questions

Having almost completed the data inventory portion of our investigation, it seems a good time to reflect and explain a little about our approach.

The Archive has previously developed a data interview methodology for the JISC DMP-ESRC project, which formed an excellent basis on which to develop ours. I also spent quite a bit of time reading up on other institutional approaches, particularly those based on the DAF framework (some good examples in the methodology document). There were of course many recurring questions across these various examples. The version we settled on focused down on a critical series of questions while taking an unstructured approach to the interviews. This helped maintain an atmopshere of openess with interviewees, and allowed us to better squeeze into busy schedules.

I've put together 10 questions, based on a more expansive guide-sheet, which I think cover pretty much all the essential ground in a data inventory exercise:

1. What are the types of data handled, and how are they acquired?
2. What data formats are used and are they open or proprietary?
3. To what extent are datasets accompanied by documentation?
4. How are the data stored, and with what level of security?
5. How is versioning of stored data carried out?
6. Have issues of ethics and consent been sufficiently considered?
7. Have issues of copyright and intellectual property been sufficiently considered?
8. Is data shared with anyone inside and/or outside of the University?
9. How confident would you feel completing a data management plan?
10. What could the University do to make managing your research data easier?

It is worth noting that it might pay to have an example of all of the above in memory, in order to illustrate your question to someone not familair with the terminology. I found a useful intro to be just sitting at a computer with the interviewee and letting them explain the process in their own terms. This can then be followed up with more focused discussion.

In light of this excercise, I will be thinking more about the role of the data interview. While it is tempting to launch straight into a rigorous questioning, I now think this kind of work should be treated with a great deal more subtlety. The key aim as far as a project such as RDE is concerned, is to tease out those data challenges. This will only happen with a mutual respect and patience.

Choosing a data repository platform: initial thoughts on EPrints

http://rdm.c4dm.eecs.qmul.ac.uk/platform_choice

I've extended the neat table he put together to include EPrints. Although EPrints has until recently been fairly focused on publications, given the specs it's hard to see how it couldn't work very well as a data repository too. Early ingest attempts are awaited with interest.

Have a look at the comparison table here. I've left it open for edits, so please do add to it if the mood takes you.

 

For the second part of this blog post, I wanted to briefly highlight some great examples of EPrints in action that our developer Alexis Wolton showed us at a recent team meeting. They nicely show off what the platform is capable of.

1. eCrystals from Soton, which cleanly presents complex sets of grouped files on a single page.

Ecrystals

2. UAL Research Online uses the Kultur plugin to heavily customize the default EPrints appearence, meeting the requirements of a visual arts collection.

Ual


3. Serpent uses EPrints flexible item classification system to create a highly structured species level taxonomy.

Serpent

 

Metadata session feedback: MRD 2011-13 Programme Launch Meeting

I have volunteered to blog this session. So here it is. BTW this is my first blog!

In the thematic parallel session on Friday afternoon on METADATA, we shared and discussed which metadata standards we expected to use in our projects and ideas about which ones we could all be using, and ones we might not know about...but should.

Some fruitful exchanges were had over quite specific metadata standards for data description - schemas and specifications - and on the use of controlled vocabularies. Without these, powerful resource discovery is impossible.  Some of the metadata standards mentioned were:

Dublin Core; Datacite; Cerif; Data Documentation Initiative (DDI; social science, health); INSPIRE (geo); Moles (NERC); Core Scientific Data Model (CSDM) and I-CAT usage (STFC)

and Ontologies:

Climate Science project (name?); HASSET (social science thesaurus); SNOWMED-CT; does YouShare project have an ontology for software types? 

Our session agreed that we should try to agree, quite soon, on the use of some common top level generic descriptive elements/fields - e.g. for  'study/ research project/investigation' level metadata.  On the spot we agred that time period and geo-location would be vital. As we delve deeper into richer data-level description, different schemas would be appropriate.  However, even across specific domains we should try to ensure the use of commonly agreed schema, or at least undertake some mappings.  Most importantly we should be working with any international standards and not seek to write our own, if they already exist.  We should also see which ones have been RDFised.

The session agreed to pool its knowledge by completing a grid of metadata standards being used/planning to be used by all projects.  

ACTIONS:

  1. Louise would take a first pass at a grid template and send this round for comment
  2. Projects working in similar domains to consult each other about the use of metadata, early on
  3. Simon to organise a Programme meeting to be held in early Spring 12 to discuss metadata further and gain some agreement.

 

Well, that really wasn't so painful!

EDIT: Regarding action point 1, there is now an open spreadsheet on JISC MRD project metadata usage

Research data managment benefits and metrics: MRD 2011-13 Programme Launch Meeting

Day 2 of the JISC MRD2 programme meeting, and discussions are focusing in on the benefits of research data management for researchers, support services and institutions, and metrics for measuring these. Team discussions have already yielded a 3 point benefit evidence gathering plan for RD@Essex:

1. Benefit for researchers and teams: increase DM skills and awareness of support services and tools. Metric: quantify increase in quality of data managment plans using a scoring methodology (previously developed at UK Data Archive).

2. Benefit for research suport services: better awareness of researchers DM and sharing needs and streamlined/updated advice and guidance on DM. Metric: researchers use of services and training provided.

3. Benefit for instutions and scholarly communication: visibility of research assets. Metric: number of datasets and data citations (the latter as a long term measure).

We also discussed benefits for early career researchers, which we think might be particularly significant. For example as an exposure to research methods and approaches.

Meeting the departments (and their data)

As our data seeking tendrils begin to move though the University, we have begun work on the data inventories that will form a foundation for a University of Essex data management strategy. Our happless victims eager volunteers have proven both willing and able, and we can already make a few initial observations. 

The Department of Language and Linguistics hold mainly qualitative data, and the video aspect of this will be an interesting challenge. Perhaps more significantly though will be the complex access issues surrounding potentially disclosive data - something we are very familiar with at the UK Data Archive with regards qualitative social science research. On the Biological Sciences side, it seems one of the most daunting collections will be the vast (a single experiment can generate many GB per day) mass of proteomics and transcriptomics data. Over the coming months we will be going into a lot more detail on all the issues mentioned above as we commence in-depth interviews with data holders within each department.

As an aside, I (Tom) have been at a JISC conference this week marking the end of a geospatial project I have been involved with (the UGeo project). Lots of people there actively involved in digital repository projects, which has got me thinking about how geospatial could add value to an outward facing IR. One possibility would be something like ShareGeo's bounding box search, which utilises the spatial extent of the data. 

Collabo_links-medium-950x475

Something I immediately envisage is the mapping of collaborators (see example in image above); a great way of demonstrating a University's active role in a global research community. Just ideas for now, but this is something I hope to return to at a later date.