Next Generation Tools
Next generation tools for linguistic research in grammatical treebanks
Ref: R 000 23 1286
Institution: University College London
Department: Department of English (Survey of English Usage)
Principal investigators: Bas Aarts and Sean Wallis
Period: 1 January 2006 to 31 December 2007
This project was rated Outstanding by the ESRC.
Comments from ESRC referees on the final report
Evaluation Grade: Outstanding
Activities and achievements
The research project has fully achieved its objectives. What has been produced is an impressive research tool. It has been achieved through a clear understanding of the issues, a vision of what might be achieved and how it might be achieved, and a lot of hard work. This research software will provide linguists, particularly corpus linguists, with relatively easy access to a sound experimental design and powerful statistical analyses by which they can explore and test the workings of language. The accompanying manual ("release notes") also represents a considerable undertaking and is a valuable support to the software.
The software does have some rough edges, but these are acknowledged by the investigators and I do not doubt that they will be ironed out in the coming months (not least of all with proper user feedback). I was slightly disappointed by the objective of developing "documentation, tutorial material and 'wizards"'. The "release notes" seem to have fused the documentation and tutorial material (I could not find evidence of separate tutorial material). In principle, this fusion could have worked; indeed, pedagogically it a good idea. However, in practice there sometimes seemed to be a tension in the "release notes" between the two. For example, I sometimes got lost when the text switched from discussion of the practice example to explaining general functions of the programme, and then back again, and so on. Also, the "wizards" did not seem very "wizard-y" to me: it was more like opening up a different tool within a suite of tools, rather than being led by the hand through a process. In working through the software, I got stuck on a number of occasions and frequently had to fall back on the pre-loaded examples.
The major highlight is simply being presented with the possibility of exploring interactions between grammatical units, particularly higher-level units, in a robust manner without the weeks of work it would usually involve (if one could do it at all). I am also quite excited about the fact that the platform and methodology it provides is extensible.
More specific highlights concern the more visual aspects of the software, for example, being able to sample and browse the trees in the corpus or seeing the display indicating the coverage of the hypotheses.
Clearly, the major aspect of dissemination - making the software available at cost over the Internet - is fully achieved. There is also the "getting started" guide and the on-line help manual, both of which are invaluable, though, as I hinted above, the guide could do with a little bit more expansion and polish. I also note a goodly number of papers delivered and publications.
The investigators are right to say that this is primarily a set of research tools for academic linguists. In fact, they are primarily for a specific set of academic linguists, particularly corpus linguists with an interest in grammar and lexis (this group is acknowledged as the immediate beneficiary in the proposal) this. Having said that, I was pleased to see that sociolinguistic variables can be incorporated into the research, so there should be some appeal to that group too. I can also see interest for academics working in the areas of language engineering and computation. The proposal suggests that language teachers might be a secondary beneficiary. Theoretically, this is true, but I do not think the software is sufficiently user-friendly to attract a broader group of language teachers. The same reason, I think program is only really suitable to acaderrilics and possibly PhD students. The end of the award report mentions MA students as well, but I very much suspect that they will find the program too difficult.
I think the priority for the short to medium term future should be getting what is in hand to work even better and making sure that it is as accessible as possible to as many people as possible.
Further into the future, one could consider incorporating semantic annotation and even pragmatic annotation.
I have spent a day and a half working through the software and its accompanying documentation. A full assessment would require considerably more time than that, and ideally the testing of a number of specific research projects. From the user's point of view, a full assessment would require feedback from a number of different users. Furthermore, what I am working with is the beta version. What this means is that not all of the functions work, the program has bugs and, in fact, crashes periodically (for example, one of the times it crashed was when I clicked on the "undo" button). Consequently, I have not been able to glean concrete proof of all of the program's functions, nor can I speak for its usability in all respects. As a result, my assessment is partly based on the assumption that what is claimed of the program in the documentation is true.
In the proposal there is also the suggestion of future workshops. I was strongly recommend that the investigators pursue these. Even if only a few people can be trained, they can then in turn pass on their knowledge. As I was exploring the software, the need for training was very apparent.
Two more specific points are as follows:
- The guide for ICECUP IV assumes that one is familiar with the guide for ICECUP 3.1. I do not think users are particularly good at "knowing" the whole of a guide or a program, or want to get referred back to another document all the time. I would rather see one integrated guide.
- Some of the buttons on the program are not only small (a perhaps unavoidable problem) but do not contain readily distinguishable symbols.
Activities and Achievements
The International Corpus of English Corpus Utility Program IV (ICECUP IV) developed in the project assessed, marks a major step forward in the systematic study of parsed corpora because it allows the statistical testing of specific predictions about syntactic data in addition to even extending the exploratory power of ICECUP III. Considering that this will allow more extensive research into language (I hope ICECUP IV or similar programs will soon be available or will be developed for other corpora, too) and considering that there are probably wider fields of application available beyond linguistics (automatic text processing and text mining, to name just the first that come to mind), developing ICECUP IV is a great achievement indeed. The outcome thus warrants the efforts and costs invested and is clearly in agreement with the targets outlined in the project proposal (if not going beyond these).
It would require more time to explore the tool more systematically to make a valid claim in this section. I can judge the overall quality, but with time being limited, I do not dare to offer a comparative assessment of the different aspects of the program, which would be required to identify some as more valuable than others.
The nature of the project itself has been geared towards dissemination. The very fact that the software is now available for free download (though we need to have ICE-GB to fully benefit from it) and that the major textual output of the project is a manual for the programme guarantees that other linguists can draw upon the product of the project immediately (in addition to allowing maximum transparency).
1 am looking forward to the publication of the book based on the manual that the authors promise in their evaluation. I hope that in this book they will take a more pedagogical approach as my only criticism of the project overall is the condensed and thus not always easily accessible style of presentation in the manual. This should even increase the audience for the achievements of the project.
At this point, the audience is still restricted to linguists with more than a working knowledge in corpus linguistics, statistics and syntax/morphology. But as mentioned above, a few didactical modifications of the manual will extend the target audience to language students and any kind of language professionals.
The authors make some suggestions for further research in their evaluation of the project. These are primarily concerned with more immediate extensions. Working in semantic applications of corpus linguistics, I would like to see research using a program such as ICECUP IV to develop meaning-oriented tools for the use in discourse analysis. I am absolutely sure that such applications are possible even though they may still be some time (and further projects) off.
Activities and achievements
The project aimed at providing novel software to conduct experiments on English grammar, especially allowing for the study of interactions of variables in language. This objective has been fully met and the project holders have delivered a well-designed and easy-to-use software tool that is tightly integrated with an existing corpus and allows for easy formulation, refinement and testing of hypotheses. The research followed closely what was set out in the original proposal and I am positive that the resulting tool will be very valuable to corpus linguists. Only some minor details are not yet implemented (numerical variables, for example) and I don't think that this has a major impact on the overall tool.
The methodology used is overall sound and I especially liked the integration of statistical results and hypotheses formulation in an iterative loop. However, I have some concerns about some methodological choices, where I think that a closer involvement of a statistician or a computational linguist or a machine learning expert would have been essential. [These are also the main reasons why I do not give an outstanding grade.] First, the observations about case interactions are correct and it is good that the proposers want to tackle case interactions. However, their a priori model seems very ad hoc (it is unclear how the constants used were chosen) and not evaluated at all. For an idea how the evaluation of case interactions might look like I recommend Ken Church's paper 'Why the occurrence of two Noriegas is more similar to p/2 than p^2'. Second, their exploration of grammar interaction that results in rules of the form if A=a then B=b, is essentially a version of supervised rule learning (restricted to the case where the left-hand side is a conjunction of literals) which has been explored in machine learning since decades. Typical examples are decision list algorithms and the rule-learner RIPPER. This as such is not a problem as the project did not set out to come up with a novel rule learning algorithm. However, the evaluation of learned rules has also long been discussed in the corresponding literature and the authors' utility function should have taken this research into account. Similarly their exhaustive search algorithm for finding the rules will be very slow when faced with a bigger corpus or more feature values and could profit from experiences m machine learning. Third, I am a bit disappointed that the authors do not compare their tool to other existing tools either for searching in parsed corpora (like GSEARCH or tgrep) or with machine learning tools.
With regards to value for money, I think this was essentially a cheap project for what it delivered and therefore very good value for money. The researcher employed might have been more expensive than usual but the result is of overall high quality and the money spent on consumables, equipment, travel and other support are very low for such a project.
For me the highlight of this research is the tight integration of the corpus with what is basically a machine learnmig tool - a rule learner. This allows for a steady exploration of hypotheses while exploring the corpus at the same time. This makes experiments possible that before were not possible for linguists without a strong computational or programming background. Even users with that background will profit from using the proposed tool for initial explorations of hypotheses.
Another highlight for me is the excellent documentation that will make wide usage in the corpus linguistic and linguistic research community possible.
The software has been presented at ICAME as well as via two book chapters. It is also available from a web site for free. These places for dissemination are entirely appropriate if not yet sufficient in quantity.
The researchers say themselves that dissemination is yet in the early stages. I would especially have liked to see a draft for a more ambitious Journal paper (for example, for Corpus Linguistics or for "Language Resources and Evaluation") as well as presentation at more conferences and meetings so that the research is more widely made known to potential end users (LREC, EACL, UK meetings). I hope that the proposers will do so in the future. The suggestion for a book collection of studies using the tool is very good but will obviously operate on a longer timescale.
I also think that the integration of the tool with other corpora than ICE-GB would make it more appealing and increase its usage (for example, with the Penn Treebank).
The most immediate audience are corpus linguistic researchers in the UK and abroad. The tool is sufficiently general that it should excite interest in all corpus linguistic researchers working on English, and especially on English grammar. For example, almost all universities in the UK and US have such groups.
There is potential for wider interest in the research community in computational linguistics, data mining and text mining but for this to be feasible the statistical tools will need to be more sophisticated than the current simple rule-learner. Therefore future developments should involve a machine learning expert.
1 also think that the tool has great potential for use in teaching in linguistics at the undergraduate level and even at A-level standard, something the authors do not mention.
There are several lines of research to be pursued. Firstly, the use of the tool in case studies (as suggested by the authors) will lead to new testing of grammar interactions. Secondly, the tool itself can be expanded via incorporating different and improved machine learning methods when evaluating and generating hypotheses. For this, involvement of experts will be crucial. Thirdly, the authors are interested in models of case interaction. This is promising but again (see above) it is crucial that the probabilistic models involved have a sound grounding in probability theory. Fourth, the tool should not rest entirely on ICE alone but integration with other corpora as well as other platforms (non-Windows, for example) would heighten its appeal.
Activities and achievements
There is little doubt that this project has built upon the work previously undertaken by the UCL team to provide user friendly interfaces which allow linguists to explore treebanked data. This is most welcome and a substantial achievement. However, the work does appear to stand in something of a vacuum - some of the claims to novelty in the work, while true in the context of linguistics, become contestable when one considers the work done in computing on probabilistic grammars in particular and patterns of probabilistic dependency in language more generally. I do think this work has suffered a little by failing to engage with the plentiful - and quite sophisticated - work undertaken by computer scientists who have approached treebanked data. I am thinking in particular of the enormous amount of work that has been undertaken on the Penn Treebank, With that said, I abide by my original view that the interface provided here is the key novelty and deliverable of the project - little of what is provided by computational linguists is readily or easily used by non computational linguists. With that said, it would have been reassuring to see that work being acknowledged and used on this project.
The interface - this is clearly a usable and helpful piece of software for linguists.
Less satisfactory - there are good outlets in the form of journals which could have published work like this (Corpora, the ACH publications). This, I am sure, would have led to this work becoming more widely known than it is. A web presence is fine, but more work could have been done to get this work recognised. I can see little evidence that a sustained effort has been made to disseminate this work, which is a pity as the program, having been authored, clearly needs users.
The audience is clearly linguists, though, as noted, I am not quite sure that enough has been done to disseminate these results.
1 think that further research lies in the use of this software rather than its farther development. This project has opened the door to potential methodological and theoretical innovation. Further research should focus on realising the promise of such research rather than tweaking the interface/program further.
This is a solid project which could have been better though in fairness, the project did what it set out to achieve. I do feel, however, that there is untapped potential in this project which ideally would have been realised while the project was still on-going.
Activities and achievements
My overall assessment of the achievements of this project are quite positive. I have been an ICECUP/ICE-* user for quite some time, but 1 think that the new version if ICECUP provides a variety of substantial improvements and extensions, allowing users to perform an ever larger number of different operations. The number of ways in which different search patterns can be created and combined with both linguistic as well as sociolinguistic or register-related variables is staggering (which will lead to the only major negative comment below in Section 5) and will not only be useful to the next generation of corpus linguists, but also set the bar for future projects of similar sorts very high.
- the improved wildcard search facilities;
- the improved handling of CPU resources;
- the way in which the sample viewer allows to view multiple hits per line separately;
- the ease with which additional context can be viewed in the output window;
- the diversity of search operations etc. that can be performed;
- the inclusion of association rules.
In general, I also appreciate the fact that the issue of independence of matches is addressed (but cf. below).
In terms of number of presentations, the project's research appears to have been disseminated very well - in fact, more than has been anticipated at the time the proposal was submitted. However, the dissemination of the program would merit, and benefit from, more prestigious outlets. The new version of ICECUP constitutes a significant improvement over the previous versions with which I am familiar and is therefore worthy of more and more high-caliber exposure. For instance, for broader (though not necessarily high-caliber) dissemination, why not submit a shorter notice to the ICAME Journal and/or have a commented demo (video) on the website which exemplifies some of the features? (Cf. <http://www.youtube.com/watch?v=HeKyTGZILhI> for an example.) For instance, for more high-caliber dissemination, why not present the software in some of the peer-reviewed journals of the field (e.g., International Journal of Corpus Linguistics, Corpora, Corpus Linguistics and Linguistic Theory, Literary and Linguistic Computing, etc.?) The software is certainly good enough for more and better exposure.
With the caveat regarding the documentation (cf. Section 5 below), this software has a huge target audience. Students (at both an undergraduate and postgraduate level) as well as seasoned researchers English linguists, language learners, lexicographers, and to a limited degree computational linguists. In this regard, this project is exemplary.
Given the nature of the project I include a comment here on future work, not on future research. The manual still needs some final error-checking because there are still a few minor errors in there. For example, the default display of the variable selector after having defined a new project is not "CASE INTERACTION" but "CASE DEFINITION", but things like this will be ironed out quickly.
More generally and this is the only major/serious negative comment I have, I myself am rather unhappy with the manual. Although I would like to consider myself an experienced ICECUP 3.1 user, I find it confusing and lacking important overview sections. The bewilderingly huge number of functions that ICECUP now offers - and it is of course great that the program can do so much - is unfortunately completely inversely proportional to (my own subjective assessment of) the readability of the manual. The way the different sections of part 2 were written made it very difficult for me to discern any goal-oriented structure (to the point where I found it hard and painful to replicate examples from the manual) and made me miss an overview that makes clear how the different steps and notions - edit FTF, mapping rules, creating variables, creating mapping rules, etc. - relate to each other and contribute to the big picture rather than have many different steps be interspersed with lengthy explanations. I openly admit different people have different learning strategies so everybody else may like the current format, my comments must be taken with a pile of salt. One other thing I would also recommend is a cheat sheet (maybe with button icons and shortcuts such as <http://www.pixelbeat.org/cmdline.html> or <http://juerd.nl/site.plp/perlcheat>
The following are a few small comments with regard to things that I feel could still he addressed to make the results of the project more widely available and more useful, However, I mean these comments to be understood constructively and they refer to comparatively minor issues which did not deter me from awarding the highest possible grade for the project.
It would be nice if the ICECUP program was available for other operating systems. MS Windows is a commercial operating system and it would be nice if ICECUP was available to users who use open source operating systems from the Unix/Linux family. In addition, since many researchers in the humanities still use Macintosh OS, it would be good to be able to offer an ICECUP version that runs on Mac OSX. My suggestion, therefore would be to aim at making the functionality of ICECUP available in the form of a platform-independent (and ideally portable) Java program. Even within the Windows installation that is currently available, ICECUP has a bit of an old-fashioned Windows 3.1 look-and-feel. Users cannot install the program into C:/Program Files (unless they know the "C:/Progra~1" trick); they can't save results into files with names longer than 8 characters; they cannot pick a folder into which results are saved (by default they get saved into the installation folder, which requires users to then copy that into the folder for the research projects and can be problematic if many people use the program from a server) but, illogically, they can pick a folder for the FTF that resulted in a concordance.
I personally (and I know quite a few colleagues that agree with me on that) find the way that the notion of experiment is used in the software and the accompanying texts rather inappropriate. Neither does an experiment consist of research questions (as is said on p.7 of the end-of-project report) nor does a corpus search constitute an experiment in the sense as the term is usually understood in the social sciences. Rather, it is customary to distinguish between observational and experimental research, and within linguistics, corpus work is the paradigm case of the former, not the latter.
This page last modified 25 April, 2013 by Survey Web Administrator.