Q- What do you do as Head of the Oxford Text Archive (OTA)?
A- I'm responsible for both the day-to-day and strategic management, which obviously has to fit in with the rest of the Unit, the Computing Service, and the University´s library and information services.
Q- How many people work for the OTA now?
A- There's four of us at the moment (Alan, John, Karen and me), plus Andy, who works on an external project although he's based with us. And there will be one more person soon, funded by the UK´s Arts and Humanities Research Board (AHRB).
Q- What is the main task of the OTA?
A- We archive texts for future generations of scholars. We were originally founded by Lou Burnard in 1976 to stop people from duplicating the work of others. Someone can spend five years typing in a text in ancient Greek, and if this is not known, another person might start doing the same thing. What Lou wanted to do was to collect copies of this kind of material to share it with anyone in the world who needed it. This is pre-web, pre-FTP. In the early days, the texts were distributed on tape, disk, etc., then via FTP, and then more recently using web delivery. Lou did this on his own at first, not oficially for the OUCS or anything. Now, my post and half of Karen's are funded by Oxford, but a lot of the posts in the OTA rely on external money, as we are part of the national Arts and Humanities Data Service. The texts are useful to people wanting to do electronic editions, and also for other kinds of research, such as statistical linguistic analysis and that sort of thing. After someone has scanned a whole text, like a play by Shakespeare, it makes sense that other people can benefit from it.
Q- How do you deal with copyright problems?
A- There are a lot of copyright problems. There is a deposit agreement that people are supposed to sign when they offer material to the archive. The early versions didn't deal much with copyright at all. Today's version has to fit in with the whole of the Arts and Humanities Data Service, and it essentially says that the person giving us the material is responsible for the rights being cleared as best as they can.
Q- So if there is any trouble, that person will face the courts, and not the OTA.
A- Yes. Legally it's still a bit wobbly, because obviously, if a publisher is looking for someone to sue they're going to try us first. But it has never gone that far. Quite often we get publishers coming to us and saying that we shouldn't have certain material, but in most cases they are wrong. For example, if a deposite involves a set of medieval texts that they published in the nineteen sixties, then they have rights in the typography, but not in the content itself. If the depositor has left out any notes or introduction written by the editor, the publishers have no right to the text that we have. Especially in the nineteen eighties, a lot of publishers didn't see any value in electronic texts, and quite a few still don't. If you think of very popular authors like Jane Austen or Molière, that have been published in good cheap editions, the publishers knew they couldn't make any money out of an electronic version, so they were happy just to give them to us. However, now they're starting to want such material back to put it on their own websites.
Q- What's your access policy? Who can get the texts?
A- We have levels of users, and it's up to the person who gives us the material to decide. We encourage them to make the texts as freely available as possible for private research and teaching. If an end-user wants to take the material and somehow make money out of it, we tell them to contact the original depositor. There are other categories, some academics want to approve every release of their text. There are instances where institutions and departments are competing, and if they've put a lot of effort into creating a resource, they don't want rivals to have it. When it comes down to it very few people are that worried, quite often they just want to know who has requested their material because they may want to get in touch with more people interested in the same subjects. But some academics are genuinely very nervous of putting material in the archives, they fear they will lose control, that someone else will publish it without their permission, and they will not receive recognition for all their hard work.
Q- What kind of materials do you have? Is it only literature? Is it all in English?
A- At first the collection was shaped by Lou's personal interest. He was very keen on English Literature of the 17th, 18th and 19th centuries, and also on French Literature. So we've got reasonably extensive collections in those areas. Our focus is more on research now, we are interested in academic research data. Within the Arts and Humanities Data Service there's a debate going on concerning what to collect and how to catalogue it, because for one model it works very well to collect it by subjects, but there isn't enough money to do that for all subjects within the arts and humanities. So another way to do it is to think in terms of the nature of the resource. In our case we want electronic texts, and we don't really care if it's English, French or German, as long as we know who did it, what is it about and all the necessary information to catalogue it and make it available to people, because we can't be experts in every discipline, obviously. Text is so small by today's storage standards (compared to video, for example), that there are few technical limitations on what one can access and store. Our problem is we can only provide what people give us. We have agreements with some international institutions to exchange texts, not with any Spanish though, although Lou has been talking to the people working on the Cervantes Library.
Q- Is all your content "canonical"? Does it only contain very famous writers?
A- Much of our holdings are farily mainstream, although there are a few rare and unusual items. Nowadays I tend to feel that on the whole we shouldn't worry about the texts of the great authors, because publishing houses and academics are covering those already. We need to worry about all the other authors who are much harder to find in printed editions, or who have been less well studied over the years.
Q- Don't you have any restrictions? For example, would you take a nazi manifesto?
A- Yes, we´d consider even that. But we wouldn't make it available to just anyone, maybe only for research purposes, such as a Historian who is working on nazi propaganda. But we don't really have these kinds of texts. Our biggest problem now is that web authoring is so easy that many academics don't deposit things with us but just put them up in their departamental web pages. Yet these sites can die, or the academic can leave the institution, etc., and then maybe the work is lost.
Q- Does your material have advantages over HTML? Is it all for example searchable?
A- We're trying to persuade people to follow the TEI (Text Encoding Initiative) Guidelines and use International Standards like SGML and XML, but we can't force them to do that. At the moment we accept other formats as well, but we advise these depositors to migrate the data to TEI SGML/XML. We couldn't do the conversion ourselves, as we don't have enough money to do that. And usually, to do a conversion you need to know a lot about the source data, if it's in Turkish it doesn't help that I know SGML. All we can guarantee is that the stream of data we store will be available in 10, 20 or 40 years time.
Q- What happens when the format standards change?
A- We keep all the files we get now, including any original propietary formats, which were often thrown away when the text was converted into plain ASCII characters during the 70s and 80s. Sometimes the versions of the programs can be so old that even if we can read the files we might not be able to do much with the data, so we try to convert any propietary data to newer versions. For example, we have texts from the sixties that can be opened, but you mitht get a string of weird code numbers running down one side and we have no idea what those codes mean or what they are related to; but if at the bottom of the page you get a transliteration into ancient Greek, then you can guess that the weird code is probably the Greek text, and so you could theoretically convert it.
Q- Is there an international consensus about how text should be digitized?
A- In the big digitation projects like the ones at Virginia and Michigan, people are scanning images of early printed books, because some people want to look at the visual layout of the page, the pictures, etc. But the big image files are not easily downloadable. They're also transcribing the text and putting in a reasonable amount of encoding, to enable users to restrict searches to particular aspects or regions of a text (e.g. only within titles, etc.). They're not going down to the level of part-of-speech tagging. But the potential is there, if someone else wanted to take a TEI SGML/XML encoded text and add another level of encoding it would be straightforward.
Q- Can you give me some impressive figures about the OTA?
A- It's difficult to accurately state how many texts we have, because we have quite a few corpora, each of which we count as one item. For example the Old English Corpus contains over three thousand individual texts, which was quite big for the eighties, (e.g. 35 megabytes), but nowadays many people don't mind downloading files of that size from the web. The hard part is the cataloguing, because if you're only interested in a short text within this corpus, how do you find it? You don't want to download the whole thing and then search it, because it's a continuous string of lines that's fine for computers but not for human beings. We also have lots of anthologies and collected works of some writers, each of which we also count as only one text. Some publishers do the opposite and count an anthology of 200 poems as 200 texts. In an average month we're getting about 5500 items downloaded by users, and about 3000 using the old FTP system, so about 8500 altogether. But it's difficult to evaluate this, because some academics might only want a paragraph from one particular text, or another is going to use a whole corpus, and we don´t monitor activity at all the sites around the world who mirror our holdings.
One problem is that people now expect to be able to get everything immediately from the web, but some of our texts have IPR problems so we require end-users to fill in a form when requesting such material. And even though it's free, in the sense that no-one has to pay for this service, people don't like to have to go through this, and perceive it as a limitation on their use of these digital resources.