Transforming Mountains into Mole Hills: Creating a Bibliographic Database for the Smokies with Drupal
Mark P. Baggett, Systems Development Librarian
Harrison Pang, Research Associate, Office of Institutional Research
Jennie McInturff, Graduate Research Assistant, Research Services
University of Tennessee, Knoxville
Originally presented at the 2012 Tennessee Library Association conference.
In an effort to promote research on the Great Smoky Mountains, the University of Tennessee Libraries recently developed a bibliographic database for the Great Smoky Mountains Regional Project. The “Databases of the Smokies,” referred to throughout this article as DOTS, provides citations to written works about the Great Smoky Mountains National Park and bordering communities from 1935 to the present. Because interest in the Smokies is expansive, DOTS is intended for both scholarly and popular audiences, and careful attention was taken to make sure it is accessible and usable for all. This paper describes the development of the database from the platform selection and review phase through installation and implementation. Furthermore, the creation of a controlled vocabulary, establishing a workflow, and detailing challenges faced along the way are discussed.
For the past several years, Anne Bridges and Ken Wise, librarians at the University of Tennessee Libraries, have been working to compile a comprehensive annotated bibliography focusing on the region surrounding the Great Smoky Mountains. This print bibliography emphasizes scarce and difficult-to-locate materials prior to 1935 and the formation of the Great Smoky Mountain National Park. The text was recently completed and is currently under consideration for publication.
In an effort to capture materials outside the breadth of the print bibliography, focus shifted in early spring 2011 to establishing an online bibliography that could host records for materials created after 1934. Since the scope of this project was much broader, an online bibliography was more desirable than a traditional print monograph. By hosting it online, the bibliography could house many more records and increase discovery of the resources by a much larger audience. To build this platform, a team was formed that consisted of both librarians at the University of Tennessee Libraries and graduate students from the School of Information Science. Together, this team organized a list of technical requirements for the project, reviewed potential platforms, selected and implemented the final product, and developed the workflow for adding, describing, and managing records within the database.
The team’s first charge was to evaluate and select a platform to build the database. To do this, the team created a list of requirements to help assess each potential platform and make the selection process as systematic as possible. As items were added to the list, they were categorized as either a need or a want. Needs were defined as qualities that the selected platform must contain. In other words, if something was classified as a need and was not present, that particular platform could no longer be considered a viable option. On the other hand, wants were defined as qualities that would be beneficial but not strictly necessary. By defining wants, the team believed it would be easier to choose a platform if several met the initial list of requirements.
When compiling the list of requirements, the team thought about working with the platform from three separate perspectives: the end user, the site curator, and the system administrator. By considering each of these perspectives, it was easier for the team to identify the needs and wants associated with each. The end user perspective helped identify requirements necessary to make the database effective to search, retrieve, and browse for records. The site curator perspective identified requirements associated with adding and managing records in the platform’s database. The system administrator perspective helped identify requirements associated with implementing or supporting the platform.
When identifying requirements, the end user perspective was considered most important. Two needs and two wants associated with this perspective were identified. The needs were the ability to perform advanced searching and search by keyword (see Figure 1). Advanced searching would provide end users with the ability to perform precise searches by limiting search terms to specific fields. Keyword searching would provide end users with a Google-like experience allowing search terms entered into one text box to be searched across all fields. Providing both search capabilities would give the end user the flexibility to search broadly or to find specific records. The team also wanted a taxonomy of subject terms to give end users the ability to browse by subject. Finally, it was important (but not essential) that the platform display results in multiple views. Results displayed as citations would provide a recognizable interface for end users familiar with scholarly databases. On the other hand, displaying records in a table with labeled columns would facilitate a more straightforward way to view results (see Figure 2).
It was also important to examine the needs and wants from the perspective of the site curator. The team needed to ensure that the platform would be easy to update and that the content would be relatively easy to manage. Therefore, the traits of the administrative interface of each potential platform were considered. Relatedly, a platform that allowed for bulk records to be uploaded and indexed rapidly and easily was defined as a need. Since many of the records to be included in the database had already been identified and the librarians in charge hoped to release the online bibliography rather quickly, the team wanted to find a platform that could be implemented, installed, and developed within a reasonable time frame.
Finally, the platform needed to be easy to administer and support, so the perspective of the systems administrator was examined. Normally, web applications at the University of Tennessee Libraries are supported by the systems department staff. For this reason, the server technologies required to install and maintain the platform needed to be compatible with the department’s expertise and existing web server. On a related note, since the department could not afford to assign a staff member to specifically focus on the project long term, and there was a belief that some customization would be required, it was appealing to identify a platform with an active user community who is still diligently developing the system. Finally, while limited funds were available for a one-time purchase, annual renewal fees for a proprietary platform were very undesirable.
After completing the list of requirements, several platforms were reviewed. These included a custom XML solution, InMagic DB/TextWorks, RefBase, and Drupal. Each of the potential products were reviewed and compared against the items in this list of requirements. In order to be as systematic as possible in the selection phase, platforms were eliminated immediately when a quality defined as a need was missing. After a brief period of testing, the list of wants was used to select the best option from the remaining platforms.
A custom XML solution and InMagic DB/TextWorks were eliminated first. Initially, a custom XML solution seemed like a viable option. The original request for the bibliography seemed quite simple, and staff members in systems had expertise in retrieving data from XML documents. However, this solution was discarded after the number of records that would be included in the bibliography was more clearly defined and the amount of custom development it would take to make the database functional was realized. InMagic DB/TextWorks, a proprietary solution, seemed interesting and was a viable candidate early in the process, but was eliminated when a conference call with the vendor revealed a few problems. Most significantly, it required a Windows Server environment and other proprietary Microsoft server software that the systems department was trying eliminate. Also, the product was somewhat expensive and considerably over-powered for this particular project.
RefBase, an open source platform, was seriously considered. It met all the needs defined by the team and many of the wants as well. In order to properly review the platform, the system was installed and a significant number of records were added for testing. Ultimately, RefBase was rejected. While a few academic institutions were still using it for online bibliographies, all development of the platform had stopped, and the remaining community surrounding the project was very small. This raised red flags about supporting the platform in the future and the amount of expected effort to customize the solution.
In the end, Drupal was selected as the platform to host the online bibliography. While the systems department was not yet using Drupal for any existing web applications, it was considered supportable due to its application stack. Also, this project was considered an interesting use case for exploring the possibilities of future applications for Drupal. When comparing Drupal against the list of requirements, the platform seemed to meet all the identified needs and most of the wants. Drupal was particularly attractive because of its active community of developers. This community provided detailed instructions for getting the bibliography up-and-running quickly and presented a setting to ask questions and find solutions for customizing the database without the use of significant staff time.
Right off the bat, Drupal comes with lots of bells and whistles. Features such as content management, access control, custom content type, and taxonomy allow site builders to quickly create a site and make it fully operational in very little time. However, DOTS has more needs than Drupal core can offer and therefore calls for additional modules to satisfy the list of requirements. One advantage of the Drupal platform is that there is virtually no limit with regard to functionality, as long as there is an existing module. On top of that, the Drupal community is very large. As of 2012, there are more than 390,000 web sites constructed with Drupal (Buildwith, 2012) and 15,000 plus contributing modules available for all kinds of site building purposes.
Biblio is a Drupal contributed module that has been in active development since June 2006. Some key features that are particularly useful to this project are:
4) CiteProc enabled versions (6.x-2.x & 7.x) have an almost limitless selection of output styles.
5) Learn more about the CSL/CiteProc technology at citationstyles.org.
6) In-line citing of references.
7) Taxonomy integration. (rjerome, 2011)
The Biblio module works much like EndNote, except that it is fully integrated with Drupal. It introduces a new content type that is suited for managing bibliography records. All Biblio fields are customizable to accommodate specific needs. In addition, Biblio works with a variety of importing and exporting formats, with EndNote XML being the most useful one in this particular case. Custom mapping is also allowed to ensure accuracy, and taxonomy integration is utilized to create local subject headings that are unique to this project.
One cannot fully explore the potential of Drupal and the Biblio module without leveraging the power of the Views module. As its name implies, the Views module allows the site builder to create custom views of any site content. For this project, Views was needed to display bibliography records in tabular and Chicago Manual of Style (CMS) citation format. Views can also implement advanced search functionalities through exposed filters.
Other important modules needed were:
1) Pathauto, which creates a URL alias automatically based on content type.
2) Taxonomy manager, which is a superior alternative to Drupal’s default taxonomy management tool.
3) Tag Cloud, which creates a block of the most used terms displayed in the form of a tag cloud.
4) Taxonomy tree, which produces an improved view of all of the taxonomy terms in their natural hierarchical order.
Drupal’s module system not only offers solutions to meet the immediate project requirements, it also guarantees future expandability. The platform has the ability to grow organically, both in terms of content and functionality as end user demands increase over time. This allows the developers to easily expand the applications of the site without having to reinvent the wheel to meet future demands.
From the beginning of this project, the need for a controlled vocabulary was recognized to create access points, describe the publications, and create relationships among the citations. UT Libraries uses Library of Congress Subject Headings (LCSH) to describe the content of its collections. It was quickly realized that LCSH lack terms to describe many of the topics relevant to the Great Smoky Mountains region, such as names of places and people. Many LCSH are expressed using state abbreviations to clarify concepts, such as “Cades Cove (Tenn.).” This qualifier would be unnecessary in DOTS because all the content is related to the Great Smoky Mountains region. It appears that there is no established authority that includes a majority (or even half) of the descriptors ideal for indexing DOTS content, so the decision was made to create a local controlled vocabulary. The process began by surveying a selection of articles on a variety of topics and noting the subjects. The subjects formed into a list, and the list quickly became hierarchical with broader categories like Biology, Conservation, Mountain life, Park development, Tourism, and Places. Topics under these broad subjects are more specific and are often followed in the hierarchy with narrower terms. For example, Fauna is a narrower term under Biology, Mammals is a narrower term under Fauna, and yet even narrower terms exist under Mammals (see Figure 3).
Figure 3. Example of hierarchical layout of subjects.
DOTS is designed for anyone who has an interest in Great Smoky Mountains research. Because the potential end user population is expansive, the terms for the controlled vocabulary have been created using common language rather than scientific. For example, the scientific name for wild boar is “sus scrofa.” It is assumed that the majority of end users are more likely to recognize and use the descriptor “Wild boar” than “sus scrofa.” A problem with using common language rather than scientific terminology is that the common names for flora and fauna species are different among regions. For example, the wild boar is also widely known as the wild hog, feral pig, razorback, Russian wild boar, and Eurasian wild boar. The scientific name is going to be the same all over the world, but since it is not expected that all DOTS end users will be scientists, the common terminology is used to increase accessibility. For concepts commonly referred to in more than one way, the terminology that appears most in publications about the Great Smoky Mountains region is used. A future project will be to experiment with implementing some sort of “see also” reference in the controlled vocabulary to direct end users to terms when a similar or related term is searched.
The “see also” reference would be an asset for all concepts as most terms seem to be known by more than one common name, and of course there are also gerunds, singulars, and plurals. For example, “moonshine” can also be called “white lightning,” and the concept can also be referred to as “moonshining” or “bootlegging.” When creating the controlled vocabulary, the proposed terms were all checked against LCSH. If a concept was expressed by LCSH with a term considered to be commonly used by the general public, that subject was adopted into the controlled vocabulary. For some concepts, the LCSH was not general enough. “Moonshine” is one illustration. LCSH uses “distilling,” which is a commonly-used term. However, “moonshine” was included in DOTS controlled vocabulary because it is the term most often used in publications and it seems to express a broader concept than the act of making moonshine, which is the only concept suggested with “distilling.”
Geographic names are usually some of the easier terms to establish when forming a controlled vocabulary, but Great Smoky Mountains place names proved to be more convoluted to express than other terms because the names of mountains and trails are spelled differently across sources. For example, Mount Le Conte also appears in publications as “Mount LeConte,” “Mt. LeConte,” “Mt. Le Conte,” “Mount LaConte,” “Mt. LaConte”, “Mount La Conte,” and “Mt. La Conte.” Also, many places are remote and even the United States Board on Geographic Names does not include all the names of places in the Smokies. Luckily, the book Place Names of the Smokies by Allen R. Coggins lists almost all the names of mountains, trails, valleys, and coves in the park with a description of how the names originated. This book was adopted as the authority on place names.
As the controlled vocabulary grew, certain standards were established to create consistency among the terms. For instance, all flora and fauna are expressed in the plural (“bats” rather than “bat”) and all personal names are formatted as First name Last name (“Wiley Oakley” rather than “Oakley, Wiley”). At the moment, the controlled vocabulary displays in DOTS as one long list reflecting the hierarchy. To browse the list of terms, users must scroll up and down the page; a task that is tedious to most end users. The controlled vocabulary will continue to expand, which will only add to the annoyance of scrolling and most likely make end users feel overloaded with information. To alleviate these problems, better ways to display the terms are currently being reviewed.
Because of the involved process of creating DOTS, it is crucial to establish a workflow that is smooth, efficient, and less prone to human error. The essential process can be described using the following diagram:
Figure 4. DOTS work flow.
As with many digitization projects, the first step begins by opening up file cabinets and digging through hundreds of paper records. When a record is identified, it is entered into EndNote manually, thus generating an electronic copy. The EndNote record serves as the foundation of the online database. The site curator will always create records in Endnote first and later mirror the changes online. This approach ensures that records will only be generated in one location – the site curator’s computer, which is backed up periodically. The Drupal database is fully capable of handling data input, but EndNote is very mature in terms of generating semantically correct records that are compatible with many other programs and therefore is the obvious choice for creating records. It is important to note that this decision is not a redundancy measure. The DOTS database is backed up nightly and stored to tape.
Batches of records are periodically exported from EndNote and imported into the Drupal database as XML files. These records are ready to be imported into the Drupal database. Before making the transfer, field mappings are checked to make sure they match up correctly. Every major field that comes out of EndNote has been mapped automatically in the Biblio module.
Once done, the mapping should not change unless more fields are added in the future. Uploading batch files into Drupal is a quick-and-easy process that requires little interference. New records are added as new nodes in Drupal. The system automatically detects duplicate records. Every EndNote record has a unique identifier attached to it, which is not visible to the end user but is stored as metadata in the XML file. The Drupal database will look for this information first before importing a new record. If a duplicate is detected, Drupal automatically skips that record and moves on to the next one. This feature proves to be extremely handy since record creation and record presentation are done in two separate systems.
To leverage Drupal’s true power as a content management system, taxonomy terms need to be assigned to each individual record. Drupal’s taxonomy system allows the creation of the aforementioned controlled vocabulary, which is used to establish relationships among Biblio records. Subjects are assigned in EndNote during record creation. The Biblio module then enables automatic cloning of those subjects to a specified taxonomy vocabulary. When a new record is imported from EndNote, the keywords in that record are copied to an additional taxonomy field. These taxonomy terms are pre-organized in hierarchical relationships using the Taxonomy Management module. Thus, the newly imported records will become interconnected with other records instantaneously, again with minimal human intervention.
At this stage, the basic record creation process is complete. It is now the system administrator’s responsibility to ensure that all imported records are properly indexed and visible to search. Records that are not index are not available to search in a Drupal environment. Index is run on a pre-defined cron schedule (a job scheduler in UNIX-like operating systems (IEEE, 2008)), and Drupal index runs when cron is scheduled to execute. Cron runs on a fixed schedule mainly for performance reasons. When cron is executed, site performance will be impacted significantly. Therefore, it makes sense to upload new records and index them during non-peak hours, such as weekend evenings when the server load is small and the impact of not having new records immediately available for search is minimal.
Last but not least, the entire Drupal site and the backend MySQL database need to be backed up regularly to ensure maximum reliability. In the event of a failed import attempt or a database error, system administrators should have the ability to quickly restore the database to its pre-error state. This can be achieved through Drupal’s proprietary command line tool, Drush. Drush allows one-step dump of the entire site and database into a compressed file, and an additional step to restore everything to its original state. Drush backup can be set on a cron schedule as mentioned earlier, thus enabling automatic backup.
From creation to usage to management and maintenance, there are certainly many steps involved in this life cycle. But every step is carefully designed and tested before being introduced to the workflow. Every step is necessary to complete a fluid, error free, reliable, and sustainable process.
Ideally, citations should display in precise CMS format, as if the editors of the CMS formatted the content themselves. However, Drupal is an open source software, and while the Biblio module works well, there were still many errors to correct, and finding solutions to those errors was challenging. The following examples express the most common issues related to displaying CMS formatted citations and provide an explanation and solution for each issue.
Minetor, Randi and Minetor, Nic. Great Smoky Mountains National Park Pocket Guide. Guilford, CT: Falcon, 2008.
Issue: If there are two authors, both names would display in Last name, First name format. Explanation: Both author names are treated the same in Drupal in terms of format, regardless of order.Solution: Modify the PHP script that controls the formatting for CMS format. When more than one author is detected, those beyond the first author name will be displayed as First name Last name format.
Example: Service, National Park. "Disposition of Elkmont Structures, Great Smoky Mountains National Park." (April 11, 2002).
Issue: When the author is a corporation or organization, the name would display as if the last word in the name were the last name. Explanation: In Endnote, author name can be entered in either First name Last name format or Last name, First name format and Endnote will format correctly according to citation style. To signify that the name is a corporation or organization and should not be reformatted, a comma must be entered at the end (National Park Service,).
Solution: Enter comma after name in EndNote and author name will format correctly in Drupal.
Banes, Ruth A.. "Doris Ulmann and Her Mountain Folk." Journal of American Culture 8 (1985): 29-42.
Issue: When there is one author and their name includes a middle initial, two periods would display after the middle initial. Explanation: The Biblio module inserts a period after the author’s name but does not have a command written into the PHP script to ignore that punctuation if a period is part of the author name.
Solution: Omit periods after middle initials in EndNote.
Meyer, Samuel L. "Autumn Coloration in the Great Smokies".Journal of the Tennessee Academy of Science 17, no. 3 (1942): 269-272.
Issue: Periods appear outside quotation marks. Explanation: Punctuation appears outside quotation marks due to an order issue in the PHP script that controls the formatting.
Solution: Alter the PHP script so that periods are rendered before quotation marks.
Alexander, Bob. "Our Home Was Built Near the Mountain So Wild." Tennessee Connections Summer, (1999): 14-15.
Issue: Commas appear after periods when certain fields are left empty. Explanation: Comma appears because the content type has fields for volume and issue. These fields have been left empty because volume is not applicable, yet the Biblio module displays a comma to follow this information as if volume number is included.
Solution: Modify PHP script to direct it not to render a comma when the volume or issue fields are empty.
Wells, B W. "Origin of the Southern Appalachian Grassy Balds." Science 83 (March 20, 1936) (1936): 283.
Issue: Year field can only contain four digits and always displays, so when the date field is used to express the full date, the year is repeated. Explanation: Year field is mandatory information in Biblio module. It is used to sort records, so it cannot be disabled.
Solution: Tackle problem from the date field and express dates without year (March, 20). This solution is a compromise because the date is broken up between two fields and depending on the record type, will display with parenthesis distinguishing the two elements.
Cain, Stanley A, Smoky Mountains Hiking Club, and The University of Tennessee Department of Botany. "A Preliminary Guide to the Greenbrier-Brushy Mountain Nature Trail, the Great Smoky Mountains National Park." submitted.
Issue: Year field can only contain four digits, so there is no way to express an approximated date for material that is undated. In addition, when the year field was left empty, the word “submitted” appears in the citation. Explanation: Biblio module uses the year field to sort records. Biblio is not capable of sorting content that is a “circa” date. If no date is included, the system treats it as material that has not yet been published, hence “submitted” in place of Year.
Solution: Incorporate a best practice that states all records must have a Year. If the year is approximated, use the date field for “circa,” so the record displays as: (circa) (2002).
Smith, Harvey T. “A Study of Trillium.” Master of Science. Knoxville, Tenn.: University of Tennessee, 1998.Illustrated, Maps.
Issue: Notes field would display at the end of every citation. Explanation: PHP script identified the notes field as part of the citation format.
Solution: Remap the notes field from EndNote to Biblio module.
Albee, Thomas F. Preliminary Investigation of Light Scattering and Visibility in Two Eastern National Parks. Vol. Master of Science. Virginia Polytechnic Institute and State University, 1979.
Issue: “Vol.” would display in Theses citation when degree type was entered in the EndNote degree field. Explanation: The EndNote field labeled “Degree” is the same table as the volume field in other reference types.
Solution: Disable the degree field in EndNote and include degree type under the “Thesis Type” label in EndNoted, which is equivalent to the type of work field.
Additional challenges came to light as the search and retrieve functions of the Drupal database were tested. For example, if the citation has no author, the entire record is unsearchable. This presented an issue for anonymous publications. The obvious solution was to enter “Anonymous” in the author field. However, this solution would make “Anonymous” recognized, related, and sorted as an author name. The working solution is to enter “Anonymous” in the secondary author field on the Drupal side because the secondary author field is not mapped correctly between Biblio and EndNote. This solution makes the record searchable; however, editing the record on the Drupal side interferes with the established workflow. This issue is still in the process of being fully resolved.
In the process of overcoming the various challenges, best practices for adding content to the database were established. Since a large part of developing DOTS has been making the process of populating the database as seamless and with as little record cleanup as possible, it made sense to record best practices in a way that can be referenced in the future. Establishing best practices will hopefully create consistency among the citations and prevent future site curators and system administrators from rediscovering problems and solutions. A document was created to illustrate the guidelines for describing publications in EndNote, assigning and creating subject headings, exporting records from EndNote to XML format, and uploading the XML records to Drupal. The Wiki module was then implemented in Drupal, and a section for best practices created within. The information from the best practices document was uploaded to this section of the Wiki for access by those with administrative permissions.
Developing DOTS with Drupal has proven to be the right decision. While the requirements of the project steered us toward the platform from the beginning, there were a few initial fears. Most specifically, the systems staff was unsure of using Drupal since they had no experience using it and no immediate plans to use it for other projects. In the end, this project proved to be a great test case for understanding the platform’s framework, and several other projects using Drupal are now being considered.
Furthermore, the platform’s extensibility has been invaluable. While the process of selecting a framework based on a list of requirements helped the team be systematic, it did not ensure that unforeseen functions would be easily implemented. Drupal will allow the team to easily deploy future features for DOTS such as crowd-sourced records and see-also references for taxonomy subject headings. This extensibility not only ensures that DOTS will evolve over time in terms of record size, but also in meaningful, cutting-edge enhancements. These characteristics will help DOTS remain relevant for years to come.