Tri-Trophic Database Project: Data Cleaning and Data Dissemination Utilizing Discover Life
iDigBio Summit III, November 18-20, 2013
We held a demo at the iDigBio Summit III meeting about how the TTD-TCN utilizes Discover Life services. Below is the documentation of that demo.
Discover Life (http://www.discoverlife.org) is data portal whose mission is to assemble and share knowledge about biodiversity. The project is located at the University of Georgia, under the direction of Dr. John Pickering. Fundamentally, Discover Life is a data aggregator, with information from more than 108 institutions, and the Global Biodiversity Information Facility (GBIF). Discover Life utilizes this collective data to create 1,268,125 species pages with 624,476 maps. Additionally, DL exports maps to Encyclopedia of Life, exports records to GBIF, assimilated over 1.2 million valid taxon names, and holds geographic boundary limits for many of the world divisions.
Who can participate?
Anyone can participate right now. To sign up for help with data cleaning email John Pickering: email@example.com
Data Cleaning Services with Discover Life
Discover Life leverages the large amount of collective specimen based information available in its databases for data cleaning efforts. Project datasets are compared with data in Discover Life, analyzing each for differences and possible errors. Two particular Discover Life services heavily utilized by the TTD-TCN are: 1) Locality Data Checking and 2) Taxon Name List Checking.
The general paradigm for all Discover Life services is to “round-trip” your data exposed on the web, through a comprehensive DL data checker, and returned to you on the web at a different address. Data providers (like the TTD-TCN) expose data through a text file on the web and let DL know of its location. Discover Life picks up this text file from the URL every night and processes those records. The data you provide are compared with the collective information found in the DL databases. The results from the comparison are then returned to the provider in a separate text file, at a different web address, for review at anytime.
“round-trip” your data
The Tri-Trophic Database project, as well as many others, utilizes the “round-trip” service in two main ways. The first is for latitude/longitude locality checking. An example output from this Discover Life service can be found here: http://pick18.pick.uga.edu/DB/NCSU/BAD.txt. The output shows a series of coordinates that do not fit within the pixel maps of the world, or the map worldview, DL maintains. Additionally, in the return service, a note is added to the output text describing what might be the source of the error. These localities are then reviewed in the TTD-TCN database, as they are considered suspect until further appraisal. Once records are corrected in the TTD database, the output file made available for DL also reflects this correction, and the record will no longer appear in the BAD.txt file shown above.
A second service, performed in a similar way as the locality cleaning service, is a valid name checker. A list of names TTD provides to Discover Life is compared with the entire 1.2 million valid names from DL. TTD-TCN includes in the clean-up host plant names and insect names. To augment the DL name lists we periodically provide them with highly vetted name lists from taxon experts (particularly from the Miridae Catalog). DL then assimilates any updates from the catalog into its services. An example output from a checklist is: http://www.discoverlife.org/nh/cl/US/GA/Clarke/moth.cl
Name resources DL utilizes include, but are not limited to, TROPICOS, Plant Name Index, Catalog of Life, and ITIS. Discover Life also maintains a list of all known synonyms for valid names, which includes misspellings.
Transcription of Labels
The new label transcription service provided by Discover Life for any natural history collection label.
Discover Life Time Machine (public view)
The transcription service utilizes full quality jpg images from providers. Providers can either upload images directly to Discover Life or provide URLs for the images for Discover Life to pick up. Functionality highlights include:
1. Different views for authoritative and nonauthoritative digitizers
2. Ability for providers to hide uploaded data and images from the public. This functionality is important when working with endangered specimens.
3. Providers may include OCR text for parsing by Discover Life into locality, collector, institution and other fields. Alternatively, Discover Life will perform the OCR for providers.
4. Results of transcribed labels are returned to providers as a text file through a web page such as:
TTD-TCN Integration Portal
The display of organism association data, or trophic level integration, is a fundamental product of the Tri-Trophic Database project. Discover Life creates view and discovery pages where the public can explore host - plant - parasitoid data on the web.
Modeling association data across institutions is part of the challenge. We have a proposed list of defined MISC fields for exposing those data and are soliciting input about the data structure. We think one important aspect of host data are well defined relationships.
Feedback to Providers
Feedback needs to return to providers, and is not edited directly on Discover Life. Comments and corrections about specimens are emailed directly to providers using a simple feedback form. The email and contact information for providers is carefully curated in DL database.