About‎ > ‎

Data Mining Workshop

ADBC TCN Tri-trophic Database: Hemiptera, their plant hosts and parasitoid Hymenoptera

Data mining and distribution modeling workshop (UCR)

Workshop Outline


Location: Department of Entomology, UC Riverside

Date: June 17-18, 2014 (Tuesday to Wednesday)

Objective: Bring together ADBC TTD TCN participants and external collaborators and experts to work on a series of research questions. The goals of the workshop will be to develop a set of draft papers. Group participation is oriented to working through problems and offering suggestions. The overall aim is to integrate collection data with higher level questions in science from biogeography to host associations, climate change and other major issues.

Remote Participation: Remote viewing of the talks will be available through Adobe Connect (http://idigbio.adobeconnect.com/ttd-tcn). All presentations will be recorded, and made available here. 

Research projects:

Video of short explanations of the research projects (http://idigbio.adobeconnect.com/p4gz8umroyd/)

1. Evolution of host range in scale insects (lead: Normark)

2. Assessment of host-network associations found in natural history collection data (lead: Seltmann)

3. Areas of endemism in Western North America (lead: Schuh)

4. Data mining – treehoppers, oaks and climate change (lead: Bartlett)

5. Adding the “tri” in tri-trophic data: mining parasitoid information (lead: Heraty)


Workshop participants:

In Riverside

1.     Ben Normark (1 scales)

2.     Daniel Peterson (1 scales; grad student)

3.     Nate Hardy (1 scales; external expert)

4.     Geoff Morse (1 scales)

5.     Katja Seltmann (2 host networks)

6.     Neil Cobb (2 host networks; external expert)

7.     Toby Schuh (3 areas of endemism)

8.     Mary Ann Feist (3 areas of endemism)

9.     Christiane Weirauch (3 areas of endemism)

10.  Jorge Soberon (3 areas of endemism; external expert)

11.  Jacob Christian Cooper (3 areas of endemism; PhD student working on joint plant and pollinator distributions)

12.  Chris Johnson (3 areas of endemism ; 4 tree hoppers)

13.  John Heraty (5 tri-trophic)

14.  Jordie Ocenar (5 tri-trophic; Hawaiian PhD student interested in modeling)

15.  Pamela Soltis (TBA; external expert)


Virtual participation (Adobe Connect for overview sessions; AC or Skype for breakout sessions)

1.     Michael Schwartz (3 areas of endemism; external expert [web presence])

2.     Charles Bartlett (4 tree hoppers [web presence])

3.     Matt Yoder (5 tri-trophic; external expert [web presence])


Guests (Tuesday morning attendance and maybe more)

4.     Kim Barao (PhD student; UFRGS/UCR)

5.     Rochelle Hoey-Chamberlain (UCR)

6.     Michael Forthman (PhD student; UCR)


Summaries of research projects

1. Evolution of host range in scale insects (Normark/Peterson)

UMass's proposal, authored by PhD student Daniel Peterson, Nate Hardy and Geoff Morse:

We propose to use the TTD data to investigate the evolution of host range in scale insects. Some scale insects are extreme generalists (utilizing dozens or >100 host plant families), although most have more limited host ranges. Because scales are wind-dispersed and flightless, they are under strong selection pressure to be able to utilize all hosts present in their environment. Nevertheless, fitness tradeoffs between hosts or segregation by microclimate may limit scales from being able to utilize all hosts in a given geographic location. By analyzing the overlap between sets of scale insects species found on different hosts in the same habitats, we can quantify whether or not scale-host associations conform to a null model of random associations or whether some hosts share scales less often than expected. A network-based framework will allow us to randomize the insect-host associations while maintaining the unique structure of the data, permitting a test of the statistical significance of any “missing overlap” between sets of scales on hosts. The TTD data is extremely well suited to this analysis because locality data for each insect specimen will allow us to restrict comparisons to only those between species known to inhabit the same geographic area, a significant challenge for analyses of more general host association datasets.

a. "Completeness" of available arthropod records (locality, date, and identification information)
b. Utility and accuracy of records for niche modeling applications

  • What percentage of species in different taxonomic and functional categories can be modeled?  
  • As you go up in trophic levels fewer species will be model ready.
d. Summary statistics of data acquired from both SCAN and TTD-TCN projects
e. Provide some idea/insight for future digitization efforts
a. Host specificity as elucidated from arthropod collecting records (network analysis)
b. Evaluation of species richness in North America along rainfall and/or altitudinal gradients (niche analysis)

Assuming we find significant evidence for missing overlap in the scale-host network, we can further examine what mechanisms may be involved. This analysis could take the form of a regression model in which inferred divergence from expected overlap between each pair of hosts (from the above analysis) is the dependent variable and the explanatory factors could include: phlyogenetic distance between hosts plants, similarity of plant chemistry or physical defenses, geographical overlap between host distributions, and the hosts' preferred microclimates.

2. Assessment of arthropod data from natural history collections. (Seltmann/Cobb)

We would assess arthropod data captured during natural history collection specimen digitization efforts for accuracy, usability, completeness, and applicability for answering biological questions. Data from the TTD-TCN project would be pooled with data aggregated from SCAN, iDigBio, and GBIF in order to evaluate where we are generally as a community.

Data assessment and evaluation would in-part be based on the "decision tree" algorithm proposed by Seltmann, Schuh, and Johnson (summary: http://tinyurl.com/kylqgox) for assessment of reliability of host-specimen networks. This approach would then be formalized and extended to evaluate other aspects of collection record information besides host networks.


1. Summarize these areas of interest regarding arthropod data from natural history collection records:
2. Address specific biological questions in order to evaluate our assessment of data records:


We are still uncertain if we are capturing enough data useful for climate change analysis, niche modeling, or to address “big data” biological questions. We hope to outline the general characteristics of those arthropod data captured from historic specimens thus far, suggest ideas for improving digitization of "research ready" datasets, and propose a list of idealistic guidelines for future specimen data capture.

3. Areas of endemism in Western North America (Schuh/Weirauch/Schwartz/Feist)

Heteroptera: We propose using our data as a way of better understanding areas of endemism, particularly in Western North America. There is software to help deal with the identification of areas. We might also be able to acquire data from the SCAN project in an effort to compare phytophagous species with ground dwelling taxa. As already suggested by Christiane, she and I would work collaborative with Michael Schwartz on this issue, but I suggest adding a component that might help to better understand factors influencing areas.

Plants: Centers of diversity in North America's largest groups of vascular plants and their phytophagous Hemiptera (Naczi/Feist)


1. Identify centers of diversity for N. America's largest plant genera.
2. Identify centers of diversity for N. America's largest families of phytophagous hemiptera.
3. Test the following hypotheses.
a. North America's most diverse plant genera host a correspondingly large portion of North America's phyotphagous hemipteran diversity.
b. Centers of diversity for plants coincide with centers of diversity for phytophagous hemiptera.


An investigation of centers of diversity of North America's largest vascular plant genera and their bugs would help us understand biogeography and diversification of both groups.


4. Data mining – Treehoppers, oaks and climate change (Bartlett/Johnson)

Most treehopper species in the large subfamily Smiliinae are oak feeders, with species generally thought to be narrowly oligophagous (feeding on few oak species, or occasionally also on Carya or Juglans; a few smiliines feed on unrelated trees).

Generally, treehoppers are collected only as adults, but adults are mobile, and it is unclear whether associated host records are resting/dispersal records or ‘good’ host records (in the absence of observations on feeding activity or the presence of nymphs); so it is unclear whether any given species is actually monophagous, or truly oligophagous (none are polyphagous).  Most treehopper workers collect treehoppers by inspection and hand collection, so it is usually clear which plant the treehopper was associated with.

A. Our TCN data could be used to explore host usage by providing quantitative observations regarding host records (i.e., numbers of plant host records for each treehopper species) to explore host range (presumably there will be more records on the real host(s)).  Treehopper records could then be mapped against tree species distribution for correspondence.  (This follows the general spirit of some of Matt Wallace’s work – he used sticky traps in oak species to document treehopper abundance and seasonality). 

It is possible that host ‘preference’ varies over space, but I am not sure how to investigate that.

Treehopper life history (at least treehoppers associated with woody plants) is closely tied-in with host phenology.  Generally smiliine treehoppers are univoltine (some *may* have a second generation), with eggs inserted into the current-years growth of the host species.  Eggs overwinter, and may require desiccation (and possibly cold stratification), followed by rehydration in order to hatch.  Egg hatch is therefore tied in with spring sap flow, and subsequent development of nymphs is associated with temperature.  The emergence of adults would be tied in with the conditions of that particular year; consistent changes in the seasonality of treehoppers might reasonably be attributed to changes in climate.

B. Use TCN data to examine changes – if any – in the seasonality of oak-feeding treehoppers.  The expectation is that if the climate is warming, treehoppers would be emerging earlier from eggs and becoming adults earlier in the year.  Some metrics that might be examined includes the earliest–dated record (of a particular species) from each year; or the median (or mode) date from a given year, with some form of regression to examine trends in these data (particularly if we can figure out how to correlate emergence dates with temperature data).  I am not sure that it matters which treehopper species and which oak species are involved, but i think it would be best to restrict the dataset to oak-feeding taxa.

This could be attempted with data aggregated from many species, or we could select species with large numbers of records.

I had looked into this with somebody here at UD (whom I’ll talk to shortly; he is out today).  Found some promising – but inconclusive - things, had planned on using the preliminary results as leverage for a grant, but never did.

This idea is along the line of Katja’s suggestion with aphid males; but I am guessing that there will be many more treehopper records. 

As for collaborators – I think we should invite Matt Wallace to help with the treehopper side of things.  I am out of my depth on the climate side of things.


5. Trials and tribulations of adding the “tri” in tritrophic data: mining parasitoid information (Heraty/Yoder/Seltmann)

Museum records for reared material of insect parasitoids are sparse, and the accompanying host records are often vague, referring to a multitude of terms that ambiguously refer to the host insect or the host plant. Superimposing data over known host associations will be explored in an attempt to map host associations, tracking of invasive hosts and exploring potential distributions based on host information. Notably, some of the most accurate host information is often associated with invasive pest species as part of directed research projects.