Researching e-Science Analysis of Census Holdings
The ReACH e-Science workshop series has now finished its reporting
phase, and the resulting report has been submitted to the AHRC.
A copy of the full version of the final report delivered to the AHRC can be found here: ReACH Report.
The project aimed to: highlight issues regarding the application of e-Science technologies to Humanities datasets, develop a project proposal for full scale analysis of Ancestry.co.uk's historical datasets utilising Research Computing facilities at UCL, bring together a wide range of interdisciplinary expertise to ensure best practice, highlight any issues of concern which would preclude a large scale project from being useful or successful, ascertain the historians viewpoint of the benefits and concerns in undertaking a larger scale project, predict the form and type of results which would emanate from a future project with the available datasets, and ascertain the comprehensiveness and accuracy of the available transcribed datasets.
The results of the well attended workshop series was a sketch for a potential project, and recommendations regarding the implementation of e-science (high performance computing) technologies in this area. However, at this time, it was not thought possible to pursue the potential project in the following e-Science call from the AHRC due to a variety of reasons which are elucidated in this report. As a result of the workshop series, the ReACH workshop series proposes the following recommendations:
For the historian:
- Although there has been much financial, industrial and academic investment in the creation of digital records from historical census data, there is not the quantity nor quality of information currently available to allow useful and usable results to be generated, checked, and assessed from undertaking automatic record linkage across the full range of census years. This will change as more data is digitised (and becomes available to the general research and genealogical community through publicly available websites).
- The potential for high performance processing of large scale census data is large, and may result in useful techniques and datasets (for both historian and genealogist) when adequate census data becomes available. This should be revisited in the future. Access to computational techniques or expertise or managerial issues are not the limiting factors here.
For researchers in e-Science and the Arts and Humanities:
- High performance computing and e-Science community are very welcoming to researchers in the Arts and Humanities who wish to utilise and engage with their technologies. There is also potential for research in the arts and humanities informing research in the sciences in this area, particularly in areas such as records management, information retrieval, and dealing with complex and fuzzy datasets.
- Often the problems facing e-Science research in the arts and humanities are not technical. Although there is still fear in using high performance computing in the arts and humanities, dealing with the processing of (predominantly) textual data is not nearly as complex as the types of e-Science techniques (such as visualisation) used by scientific researchers.
- However, the nature of humanities data (being fuzzy, small scale, heterogeneous, of varying quality, and transcribed by human researchers) as opposed to scientific datasets (large scale, homogenous, numeric, and generated (or collected/sampled) automatically), means that novel computational techniques need to be developed to analyse and process humanities data for large scale projects.
- Using the processing power of computational grids may be unnecessary if projects have access to stand alone machines which are powerful enough to undertake the task themselves. Processing data via computational grids can be a security risk: the more dispersed the data, the more points of interception there are to the dataset. Researchers should choose the technologies they use to carry out processing according to their need, but often queuing jobs on a stand alone high performance machine requires less managing at present than using processing power dispersed over a local, national, or international grid.
- Finding arts and humanities data which is of a large enough size to warrant grid (or high performance computing) processing whilst being of high enough quality can be a problem for a researcher wishing to high performance computing in the arts and humanities. This may just have to be accepted: and the fuzzy and difficult data generated regarding arts and humanities data explored and understood to allow processing to happen. In this way, using e-Science to deal with difficult datasets could benefit computing science and internet technologies. Perhaps this is the main thrust of where e-science applications in the arts and humanities may have uses for others - and knowledge transfer opportunities.
- The high performance computing facilities at University College London are available for research in the humanities - and there is potential here for providing the computational facilities for projects such as the Victorian Panel Study (Crocket, Jones and Schürer (2006)).
- Where commercial, and sensitive, data sets are involved in a research project, Intellectual Property Right issues and licensing agreements should be specified before projects commence. Although not necessary to include these in a funding bid, it is important to ascertain that the project has access to infrastructures which will allow it to negotiate licenses and contracts. The importance of this issue cannot be stressed enough - especially when the project is wholly dependant on receiving access to datasets, or dealing with commercially valuable and sensitive data.
- Commercial companies are often keen to be involved in research if there are benefits to themselves: nevertheless, the IPR of academic institutions should be safeguarded. This can best be achieved through setting up specific licenses for the use of algorithms in the commercial world: again, this should be ascertained before the project commences.
- Those in arts and humanities research may not be used to dealing with legal aspects of research. Most universities have legal frameworks in place to deal with such queries in the case of medical and biomedical research. These facilities are generally available free of charge to arts and humanities projects with their institutions, and so funding would not be compromised by having to include legal charges in funding bids. The time taken to negotiate licenses for data use should not be underestimated, however. Advice should also be taken from those involved in biomedical research: the similarities between projects in this area and the arts and humanities are large when it comes to data management, IPR, copyright, and licensing issues.
- Where sensitive data sets are used, the Arts and Humanities researcher should look towards Medical Sciences for their methodologies in data security and management, in particular utilising ISO 17799 to maintain data integrity and security.
For funding councils:
- Where arts and humanities e-Science projects involving large datasets are proposed, it is likely that the complexity of the project will require large scale funding. Yet many of these projects will be "blue-sky", and may require a variety of employed expertise over a number of years to undertake the work, as well as requiring technical expertise and infrastructure. These projects will then be expensive: funding calls in e-Science for the Arts and Humanities should take this into account.
- e-Science projects in the arts and humanities may also be high risk with less definable outcomes than similar projects in the sciences, due to the complexity and inherent qualities of arts and humanities based data. If funding councils wish to foster success in this area, the risks of funding such projects should be acknowledged. The very attempt to develop practical projects which wish to apply e-Science technologies in the arts and humanities may result in cross fertilisation with the scientific disciplines.
- Definitions of e-Science vary from council to council. High performance computing is as much a part of "e-Science" in the sciences as distributed computational methods. The two should not be distinguished from each other. If there are to be different definitions of e-Science between the arts and science councils, the reasons for this should be researched and expressed to elucidate different funding council's approaches to e-Science, and to further explore where e-Science technologies can be of use to arts and humanities research.