A Catalog of History Dissertations, part 2

Part of the detail view for dissertations.

Data on history dissertations from the old days is tricky to deal with. The existing analog catalogs are not always complete and often contain incorrect information. And while university libraries have incorporated these dissertations to their online catalogs, each library has done so with varying degrees of precision, depending on many variables. After considering some alternatives, I have decided to manually check for names and titles when possible (I will write about this in a later post). So then we come to the problem of how to store and make available all the data.

My first option was to use some existing software for digital cataloging or content management. After browsing around, I have found some applications that could do the job reasonably well, even if some customizing was always necessary. I thought Omekawas a bit of an overkill, since I was not planning on digitizing all the content of the dissertations – and even if I was, I would then have to deal with copyright issues (as far as I know, Brazil has no “fair use” doctrine, and even if it had, I don’t believe it would apply to my database). Other library catalog and institutional repository applications, such as D-Space and others, would be cumbersome to customize for some extra information I would like to have included in the database.

Then I considered building a system myself, containing just what I needed for the project and no extra bells and whistles. As I had some knowledge of Python, I followed the suggestion of a friend and started learning Django. Django is a robust web framework that enables quick and relatively easy development and deployment for web applications. The basics were pretty straightforward. The learning curve is smooth and the community is very responsive, so most doubts can be solved with a quick Google/Stack Overflow search. For the database itself, I went with MySQL, which integrates well with Django. PostgreSQL remains an option for deployment, as there are some extra functionalities which might be useful in the future.

There is a third way that I have not entirely discarded, but that will remain in the “might do if I have time to” pile, is messing with node.js – specifically, separating the back and front end of the digital catalog. During the last few months I have learned the basics of RESTful APIs using JavaScript, but I had no previous experience with JS… So there is this extra learning that has to be taken into account.

All in all, programming is definitely not the hardest part of the project. That might be so because I am a historian who had some programming experience in his teens. Nevertheless, decisions over what metadata to include, how to normalize names and titles, or how to structure the tagging system, all of these are questions with far greater weight on the probability of success of the catalog.

A Catalog of History Dissertations

My current postdoc project consists in building a large database with (meta)data for all History PhD and MA dissertations defended in Brazil from 1942 to 2000. It has been a while since the last one was published, and the data from the Ministry of Education are inconsistent (to say the least) for entries older than the 2000’s. As a historian of historiography, this inconsistency has made me waste much time trying to cross-check information about specific dissertations – and I know many colleagues who had to do all the same things for their own works. So I decided to bring everything together and build this tool. It should gather important library metadata for all items (following Dublin Core specifications, except for the abstract, for reasons of time and budget constraints) but also implement some features specific for the target-audience – other historians of historiography.

There are two country-wide catalogs in print. One was published in 1985 and spans from 1973 to 1984; the other was published in 1995 and spans from 1984 to 1994. The information they collected is inconsistent, though, and many entries are missing. Many Graduate Programs have published their own catalogs too, some in print, some on their websites. But as those are filled by different people, and people have this thing of (1) not following input standards and (2) not caring for incomplete data, I noticed these catalogs vary greatly in terms of quality. I have managed to acquire most of the print stuff and saved locally all the online ones.

After getting all the available catalogs, I started inputting the information to a Excel spreadsheet, just to check for inconsistencies (different spellings, dates, etc.) and missing data. Right now I have over 3000 entries. Since then I have tried to normalize the names of individuals by developing an authority control (some of the individuals are harder to find than others), and now I am in the process of visiting the university libraries that host the texts and checking all the information with the works themselves. However, a major difficulty is that recording the thesis committee and the defense date was not mandatory until recently. So I have some detective work going forward for the older works.

My next post will be about the technological aspect of the project: the website, the database, and the analytics.