The case for structured data
As the number of projects contributing to NINES grows, more and more I find myself answering questions about structured data. Why XML? How does one choose between formats like that of the Text Encoding Initiative and other databases?
These are good questions, and can be hard to answer without using even more acronyms and alienating technical terminology. Recently, however, text-encoding guru Julia Flanders addressed the topic of XML and databases on the TEI encoding seminar series discussion list. Her answer is one of the most informative and eloquent presentations of the topic I’ve seen, and with her permission I’ve posted an excerpt here.
“I should…say in advance (before anyone reads the more detailed screed below) that we’re teaching XML and TEI for a reason, which is that they help us work with text in a way that respects both its nuance and our own interest in that nuance. So my own personal recommendation for representing textual information is to use XML on principle, because (regardless of what tools are available right now) in the long run it’s the right kind of approach. However, it’s worth understanding the broader context, which I will try to sketch below.
The short theoretical answer to your question is that at a deep level, both XML and databases are doing the same thing: they are representing the structure of your information. Both of them can do essentially the same work of representing the individual comment entries, the names and addresses of the commenters, their gender, etc. In a sense, both the database format and the XML format are really just expressions of a deeper and more abstract data model that is conveyed when we say things like:
–”a comment is made by an individual who has a name, address and gender”
–”a comment contains text”
–”a comment may contain information about other exhibits attended by this individual”
etc. etc.
Database tools and XML tools differ in the kinds of things they’re good at (and this is where the readings may come in handy, to give concreteness to this point). In addition, database structures and XML structures differ somewhat in their emphasis: database structures emphasize what is regular and predictable about your data (e.g. the fact that every individual commenter has a name, address, and gender). XML structures emphasize what is less regular and predictable about your data (e.g. the fact that the comment might or might not include praise for the exhibit, references to other exhibits, references to specific artists of interest, statements about being inspired, etc., and also the fact that the comment might contain an unpredictable number of paragraphs). For your data, which has a fairly regular and predictable structure, the difference is comparatively minor. For other kinds of data, though, the difference might be great: it would be much more difficult and bizarre to express the structure of a novel using a database.”
Julia goes on to describe the practical aspects of using databases and/or XML to store your content. In the end, each project is different and the decision ultimately boils down to the kinds of data you’re working with and the questions you pose. But it is crucial that scholars familiarize themselves with the technical aspects of digital scholarship, not just for the success of their projects, but also for their sustainability.
Further reading: Ronald Bourret, “XML and Databases” and “Going Native: Use cases for native XML databases”
Thanks, Julia!
Add comment September 11th, 2008