Linking genomic data to spatial biodiversity data in the Atlas of Living Australia

ANU Caroline Chong and Justin Borevitz Division of Plant Sciences, and Lindell Bromham, Division of Evolution, Ecology & Genetics
CSIRO Rebecca Pirzl and John La Salle Atlas of Living Australia

A standardised way to archive, share and access all biodiversity data for Australia’s biota that links biogeographic and environmental records to biological sequence data is lacking.

This capacity is needed to facilitate researcher and public access to comprehensive biodiversity knowledge nationally, and will help biodiversity researchers to visualise and explore biological data in an efficient and integrated space.

Developing a molecular data resource to link genomic data to the Atlas of Living Australia will enable researchers to readily access population- and phylo-genomic data together with spatial and environmental distribution information across biological scales, promoting new biological discoveries through comparative data analysis.

We propose to develop a spatial genomic data resource to link novel next-generation sequence data of Australia’s biota with open access in the Atlas of Living Australia.

Creating this resource will assist the endeavours of all users to readily access, visualise and explore all available biodiversity data for the taxon sets and spatial regions of interest, and aims to enhance research planning by improving the ability to compare genetic, environmental and spatial distributions and signals of biological diversity.

The envisaged utility of this spatial genomic annotation pipeline is not limited to investigation of a particular species or taxon group, or terrestrial or aquatic system, but can be applied to help jointly explore and understand the distributions of molecular, spatial and environmental occurrence and variation in Australia’s flora and fauna. For example, analysing spatial genomic data can potentially help us to understand cryptic patterns of species distributions in changing environments, as well as helping to resolve molecular taxonomies.

We have recently produced large genotyping-by-sequencing data for Australian Pelargonium, Eucalyptus and Brachypodium among others. These raw data constitute short read sequences from hundreds to thousands of samples as well as associated SNPs, population and molecular taxonomic data.

This new information on population and species diversity can now be analysed for unique distributions across geographic and environmental space. Additional case study data including taxon sets of conservation and evolutionary research priority can be added automatically as they are generated or are released by the researchers.

Our specific objectives are to:

  • Develop a genomic data resource to link molecular biological diversity data to the Atlas of Living Australia. The proposed resource will comprise a standardised framework for researchers to access next-generation biological sequence raw and annotated data generated from novel research. This genomic resource will ultimately enable all users to readily access and explore these data in conjunction with spatial and environmental taxon records using the Atlas interface.
  • As the test case study, validate and contribute to the data pipeline by making available recently-generated genomic data to the biodiversity research community (Pelargonium, Eucalytpus, Brachypodium)
  • Through collaborating with the ALA, identify the information management and user requirements to develop ALA user access to the genomic pipeline, including to: view the data integration status for a given taxon record; subset all available biodiversity records for taxon groups of specific interest; access associated publications. This will enable users to more rapidly detect research information gaps and priorities for future research planning (i.e. where more genetic, spatial, and/or environmental data are needed to inform analyses, inferences and management).

Methods
1) With ANU GDU bioinformatics support and using recently-generated NGS data on Australian plants as the exemplar, construct the sequence data pipeline capacity including to:

  • Establish a standardised set of supported read file formats (fastq) 
  • Readily access the sequence read archive and associated metadata from the ALA site
  • Contribute new genomic data sets to the research community

2) Work with ANU and CSIRO biodiversity researchers to identify priority genomic data records to integrate, and diverse contributor requirements. 

3) Engage with ALA staff to discuss development of the ALA user interface to:

  • Access the sequence read archive
  • Enable users to visualise and interactively explore spatial occurrence, associated distribution records and genetic distribution data jointly
  • Generate and enable user access to summary reports on data availability and data integration status (e.g. raw, annotated, population or phylogenomic data types; associated publications)
  • Make link to annotated population and phylogenomic data available to users on the ALA
  • Allow searches of samples in the ALA database that contain specific genetic data.

Outcomes

  • Our overall benefit is to enhance and contribute to the public accessibility and archiving of Australia’s biodiversity knowledge. Ultimately we aim to create a genetic-associated data tab on the ALA site to allow user-friendly and flexible investigations of annotated population- and phylo-genomic data.
  • Linking this genomic data archive to Australia’s biodiversity records will facilitate and progress future data sharing, and curating and research behaviours. In particular, we aim to augment biodiversity data visualisation and usage by jointly investigating genomic and biodiversity distributions of variation across space.
  • Tracking invasive species spread, or the potentially adaptive diversity within rare and endangered species, requires easy access to both genetic and spatial data. This work will link two disciplines accelerating both basic research and on-the-ground conservation efforts.
  • Users will be able to access, visualise and explore the integrated species biological data available, including information on genetic and environmental variation, in geographic space. This will better enable researchers and decision-makers to detect existing biological knowledge gaps, and to focus experimental design and research priorities.
  • Improved access to genomic data associated with environmental data may also assist researchers to interpret historical to future patterns of species movement and evolutionary responses to changing climates. 

Updated:  29 July 2017/Responsible Officer:  Director/Page Contact:  Coordinator