thailand workshop logo2

Preface

Thank you for attending this genomics workshop series.

This booklet, with the lecture material, wet lab protocols, exercises and tips will guide you in becoming an expert in analysing the type of genomic datasets we are presenting here.

You can always get the most up-to-date versions of our courses from our from our website. This particular course is available from here as an HTML document or here as a PDF document.

Intended audience:

This workshop series is for those who care what genomic data is, how they originate, and what they may mean. It is for the next current generation of molecular biomedical scientists, molecular ecologists, and biocurators. It is for research students, their supervisors and scientists who want to expand their skills. Importantly, we do not seek to train the next bioinformatician, rather we provide essential skills to teach you how to learn in this rapidly evolving field.[1]

Why we wrote this course:

Bioinformatics is a big word, it can identify a variety of topics and mean different things to different people. It is not what we (bioinformaticians) call a controlled vocabulary: a term that is expertly defined so that when used everyone agrees on its meaning. This is a problem that we have to face every time we provide training to non-expert bioinformaticians; often there is a disconnect between what the audience expects and the trainer intends to deliver. Likewise, it has been challenging producing a body of work that can seek to both educate nascent scientists and manage their expectations.

The need for this work has come about from surveying students and academics. Genomics is rapidly evolving so we have endeavoured to write something that could be readily updated as new techniques reach us.

Slide2
Slide3
Acknowledgements

This work would not have been possible if certain people did not freely contribute their material, ideas and help:

  • Konrad Paszkiewicz (Exeter) has been with us from day one, when we did our first workshops back in 2010 and his lectures have been enthralling students for years.

  • Brian Haas (MIT) has been paving the open-access way in improving genome bioinformatics research and training from the early days of genomic research.

  • Dave Clements is the person who first brought us into this path and he has been leading the GMOD and Galaxy training for more than a decade now.

  • Frederico Lauro (Singapore) and Robert Edwards (San Diego State University) for contributing material for our metagenomics presentations.

  • Then there are countless of supporters throughout the years who have contributed ideas, critique and insights on how we can empower the next generation of molecular scientists. This workshop is unlikely to have existed without them and we would have been poorer scientists ourselves.

Design credits:
  • The RNA-Seq workshop was designed by Kay Anantanawat and Alexie Papanicolaou

  • The metagenomics workshop was designed by Kay Anantanawat, Thomas Jeffries, Robert Mueller, and Alexie Papanicolaou

  • Our logo was designed by Laura Castañeda Gómez of SciViz

Your feedback:

Please let Alexie know in person or email and - at the very least - we’ll add you to the acknowledgments or if you go the extra mile and wish to contribute, we can invite you as a co-author :-)

Conventions:

At times, entries like this will appear:

they offer tips
suggestions you should consider
important warnings you must not ignore
or just a friendly note as an FYI
a question for you that you need to think about before proceeding

Further, at times you may be asked to run something on a Linux computer (e.g. a server or your laptop). In that case, the instructions will appear as so:

$ some kind of command

or as

some kind of command

or as

1 | some kind of command
2 | more commands
3 | even more commands

If the numbers, vertical line/pipe sign (|), or dollar sign ($) are present, don’t type them. It is just away for us to show you that what you’re typing is in the Linux command line (a language called BASH).

License

The field is completely dependent on the good will of many people. The best way to ensure we continue the rapid advancement is if we share open and freely (free as in free beer and free speech). Hence the following blurb:

Copyleft CC BY-SA This work is 'copyleft' licensed to you by Dr Alexie Papanicolaou et al (2017-2018) under the Creative Commons 4.0 Share-Alike Attribution (Australia) license. You should not have paid money for receiving just this document. You should read the legal document but basically: First, there is no warranty given or implied for any purpose. Second, you can remix, tweak, and build upon this work even for commercial purposes, as long as we are given appropriate credit, you state which parts where changed, and this text is included unchanged. Your new work must be licensed under these identical terms with no additional restrictions. Third, this license is still limited by any publicity, privacy, or moral rights and restrictions that may be applicable in your region.

All product names, logos, and brands are property of their respective owners. All company, brand, product and service names used here are for identification purposes only and their use does not imply endorsement by the authors or their employers. All views and opinions are those of their respective authors and not of their employers.

About the authors

kay Kay Jutamat Anantanawat (PhD) is a molecular biologist with both a biomedical and an agricultural background with research interests in molecular biology, immunology, genetics, epigenetics, and bioinformatics. She is currently leading the microbial biodiversity sequencing of the Next Generation Sequencing Facility at Western Sydney University. Kay completed her Bachelor and Honours degree in biomedical science (majoring in microbiology and immunology) from the University of Adelaide. She then joined the group at Waite Research Institute (University of Adelaide), to complete a PhD in molecular entomology. Her PhD research was on the molecular mechanism of inducible immunity and tolerance against pesticides in Helicoverpa armigera. After her PhD, she joined Murdoch University to undertake research on the effects of lethal stresses on the cellular biology of Tephritid fruit flies. Kay has been extensively involved in creating and analysing Next Generation Sequencing data, especially RNA-Seq data. Her main expertise areas are in creating NGS libraries including whole genome shotgun sequencing, 16S/ITS amplicon NGS and RNA sequencing using commercial kits or custom methods she co-developed. She also has a broad experience working with sequencing instruments including 454 Roche sequencing, Illumina MiSeq and HiSeq2500.

Twitter: @kanantanawat


aaron Aaron Darling (PhD) is an Associate Professor in Computational Genomics and Bioinformatics in the University of Technology Sydney. He is core member of the Faculty of Science’s ithree institute, a research institute focused on infectious disease. Darling has nearly two decades of experience developing computational methods for comparative genomics and evolutionary modelling, having previously worked at the University of California-Davis and the University of Wisconsin-Madison, USA. Aaron’s research focuses on computational metagenomics, specifically on statistical bioinformatics methods to understand the structure, function, and evolution of natural microbial communities. Through a broad network of collaborators, he has worked to characterise the microbiomes of humans, animals, and plants. His contributions to computational genomics include Mauve, the software for multiple genome alignment and comparison, A5-miseq, an integrated pipeline for high quality draft genome assembly, and the BEAGLE phylogenetics library, which has been integrated to the widely used BEAST & MrBayes phylogenetic analysis packages. Aaron’s current interests lie in developing metagenomic approaches to analysing genomes from naturally occurring microbial communities. He has contributed to several methods for microbiome and metagenome analysis, including the PhyloSift software for phylogenetic analysis of metagenomes and the DESMAN software for reconstructing strain genomes from metagenomic time-series and spatial transect samples.

Twitter: @koadman


thomas Thomas Jeffries (PhD) is a Lecturer / Assistant Professor, a microbial ecologist with a wide expertise in microbiology, microbial biogeography, and ecological genomics. His research seeks to determine the environmental factors and ecological principles which drive patterns in microbial diversity in habitats spanning the ocean, dryland soils, contaminated sites and the human microbiome. He completed his PhD in 2011 at Flinders University where he studied the influence of salinity in structuring microbial communities using metagenomics. Then as a postdoctoral fellow at the University of Technology Sydney, he studied coastal marine microbiomes and the response of microbes to local oceanographic patterns and contamination inputs in Sydney Harbour. Since 2014, he has been a post-doctoral fellow and later Lecturer at the Western Sydney University where he studies the biodiversity of soils, agriculture, and aquatic environments. He is a member of the INDIGOV Expedition, a citizen science project using ocean yachts to sample the world’s ocean microbiomes (INDIGOV website). Additionally, Tom co-ordinates and teaches microbiology and genomics training workshops throughtout the world. He has published over 35 peer reviewed articles, is the chair of Joint Academic Microbiology Seminars (JAMS), a core member of the Global Centre for Land Based Innovation, and committee member of the Australian Society for Microbiology.

Twitter: @Thomas_Jeffries


robert Robert Mueller (Western Sydney University, Australia) is a PhD candidate at the Hawkesbury Institute for the Environment, Western Sydney University, a entomologist and a plant pathologist. Roberts interests are in microbial ecology, biodiversity and computational biology. His research uses next generation 16S rRNA gene and ITS2 amplicon sequencing to describe the bacterial and fungal microbiome of the only known eusocial beetle in world Austroplatypus incompertus across its entire distribution in Australia. The beetle is a fungal-farming wood-boring ambrosia beetle reliant on its fungal symbiont. This relationship is very unique in the sense that the beetles inhabit the heartwood of living Eucalyptus trees, which is a challenging habitat for their fungal partner as well as for the beetles. Robert completed his Master of Science in Plant Protection at the University of Natural Resources and Life Sciences, Vienna under the supervision of Prof. Dr. Christian Stauffer and Dr. Thomas Cech. He worked as a research assistant in forest pathology at the Austrian Research Centre for Forests (BFW) and was a visiting researcher at the Max-Planck Institute for Chemical Ecology, Jena in 2016 and the Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria in 2017.

Twitter: mueller89_r

alexie Alexie Papanicolaou (PhD) is the Editor and organiser of this workshop series and a Senior Lecturer / Assistant Professor in genome bioinformatics working on ecological and economically important species, such as the Heliconius butterflies, the megapest Heliothine moths, invasive Tephritid fruit flies, C4 grasses, eucalyptus trees, and marine species such as oysters and anemones. Before joining the Hawkesbury Institute for the Environment in 2015, Alexie was a post-doctoral fellow at the CSIRO (Australia), a research associate at the University of Exeter in Cornwall (UK), and a PhD graduate of the Max Planck Institute for Chemical Ecology, Jena (Germany). Alexie’s work aims to blend the frontiers of molecular biology and computer science to secure Australia’s agricultural and natural ecosystems in the face of a changing climate. His main research interest is developing bioinformatics capability to address evolutionary questions such as how organisms adapt to a changing or novel environment. His main teaching activities focus on a M.Sc. of Science in Data Science, empowering PhD and post-doctoral researchers to analyse their own data, and develop the soft skills needed for furthering their academic or industrial careers.

Twitter: @alpapan


Guest lecturers

Arinthip Associate Professor Arinthip Thamchaipenet is a member of Department of Genetics, Faculty of Science, Kasetsart University and the Associate Dean for Research and International Affairs. Her research work is on actinomycetes in drug discovery using genome scanning for biosynthetic gene clusters, and in plant-microbe interaction using actinomycetes to enhance plant growth, tolerate to environmental stresses and diseases. Moreover, she also interested in marine actinomycetes to combat biofilm formation of pathogenic bacteria. She uses metagenomics, genomics, transcriptomics and proteomics to explore those host-microbe and microbe-microbe relationships.

Sissades Sissades Tongsima (PhD) is currently a principal researcher and the head of biostatistics and informatics laboratory at Genome Technology Research Unit, National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand. He received his B.Eng in Industrial Instrumentation Engineering from King Mongkut’s Institute of Technology Ladkrabang (KMIT’L) and his MSCSE and PhD in Computer Science and Engineering from the University of Notre Dame, USA. His research interests include bioinformatics and computation modeling. Dr. Tongsima has published his computational biology research in many renowned journals; some of them were highlighted in the Asia-Pacific International Molecular Biology Network (AIMB-N). He has served as an executive committee of Asia Pacific Bioinformatics Network (APBioNet), and a steering member of several international consortia. He is also an associate editor of the Journal of Human Genetics (JHG), Nature Publishing Group.

Teaching assistants
  • Dr Passorn Wonnapinij (Kasetsart University, Thailand)

  • Preecha Pathumcharoenpol (Kasetsart University, Thailand)

  • Theerapol Jatuponwiphat (Kasetsart University, Thailand)

Schedule

Day 1

08:00-08:30

Registration

08:30-09:00

Welcome and Opening ceremony

09:00-09:15

Group Photo

09:15-10:00

Lecture 1: Bioinformatics and agriculture

10:00-10:15

Tea break

10:15-10:45

Lecture 2: Metagenomics in agricultural research

10:45-11:45

Lecture 3: Basic description of sequencing, assembly, genome comparison, and phylogenetics

11:45–12:45

Lunch and networking

12:45-14:15

Lecture 4: Metagenomics: Recent advances and ongoing challenges

14:45-15:00

Tea break

15:00-16:15

Lecture 5: Next generation microbial ecology

16:15-17:00

3 minute Lighting Talks from participants

Day 2

09:00-10:00

Lecture 6: Technological capabilities for amplicon metagenomics

10:00-10:30

Overview of Wet Laboratory exercise

10:30-10:45

Tea break

10:45-12:00

Lecture 7: Recap of Wet lab; Quality assessment of amplicon libraries and sequencing data

12:00-13:00

Lunch and networking

13:00-16:30

Wet Laboratory: 16S amplicon library construction

Day 3

09:00-10:00

Demonstration of Genomic Analysis of Microbial diversity

10:00-10:15

Tea break

10:15-12:00

Computer Lab 1: Building Bioinformatic skills

12:00-13:00

Lunch and networking

13:00-17:00

Computer Lab 2a: Genomic Analysis of Microbial Biodiversity

Day 4

09:00–09:45

Open Panel Q&A with Speakers

09:45-12:00

Computer Lab 2b: Genomic Analysis of Microbial Biodiversity

12:00-13:00

Lunch and Networking

13:00-13:30

Closing ceremony; Certificates; Awards for best Lighting Talk

13:30-16:00

Consultation and networking

Day One

  • 08:00-08:30: Registration

  • 08:30-09:00: Opening ceremony

  • 09:45-10:15: Lecture 2: Metagenomics in agricultural research

  • 09:00-09:45: Lecture 1: Bioinformatics and agriculture

  • 10:15-10:30: Tea break

  • 10:30-11:30: Lecture 3: Basic description of sequencing, assembly, genome comparison, and phylogenetics

  • 11:30-12:30: 3 minute Lighting Talks from participants

  • 12:30–13:30: Lunch and networking

  • 13:30-15:00: Lecture 4: Metagenomics: Recent advances and ongoing challenges

  • 15:00-15:15: Tea break

  • 15:15-16:30: Lecture 5: Next generation microbial ecology

1. Bioinformatics and agriculture

By Dr Sissades Tongsima (BIOTEC, the National Center for Genetic Engineering and Biotechnology in Thailand)

In this lecture we will introduce some of the cutting edge science being performed at the BIOTEC and explain how bioinformatics can drive 21st century agricultural practices.

Bioinformatics Agriculture Page 01
Bioinformatics Agriculture Page 02
Bioinformatics Agriculture Page 03
Bioinformatics Agriculture Page 04
Bioinformatics Agriculture Page 05
Bioinformatics Agriculture Page 06
Bioinformatics Agriculture Page 07
Bioinformatics Agriculture Page 08
Bioinformatics Agriculture Page 09
Bioinformatics Agriculture Page 10
Bioinformatics Agriculture Page 11
Bioinformatics Agriculture Page 12
Bioinformatics Agriculture Page 13
Bioinformatics Agriculture Page 14
Bioinformatics Agriculture Page 15
Bioinformatics Agriculture Page 16
Bioinformatics Agriculture Page 17
Bioinformatics Agriculture Page 18
Bioinformatics Agriculture Page 19
Bioinformatics Agriculture Page 20
Bioinformatics Agriculture Page 21
Bioinformatics Agriculture Page 22
Bioinformatics Agriculture Page 23
Bioinformatics Agriculture Page 24
Bioinformatics Agriculture Page 25
Bioinformatics Agriculture Page 26
Bioinformatics Agriculture Page 27
Bioinformatics Agriculture Page 28
Bioinformatics Agriculture Page 29
Bioinformatics Agriculture Page 30
Bioinformatics Agriculture Page 31
Bioinformatics Agriculture Page 32
Bioinformatics Agriculture Page 33
Bioinformatics Agriculture Page 34
Bioinformatics Agriculture Page 35
Bioinformatics Agriculture Page 36
Bioinformatics Agriculture Page 37
Bioinformatics Agriculture Page 38
Bioinformatics Agriculture Page 39
Bioinformatics Agriculture Page 40
Bioinformatics Agriculture Page 42
Bioinformatics Agriculture Page 43
Bioinformatics Agriculture Page 44
Bioinformatics Agriculture Page 45
Bioinformatics Agriculture Page 46
Bioinformatics Agriculture Page 47

2. Metagenomics in agricultural research

By Assoc. Prof. Arinthip Thamchaipenet, Kasetsart University

Beneficial plant microbiome for abiotic stress tolerance: a case study by endophytic actinomycetes
Plant microbiomes are untapped reservoir for a vast microbial diversity and function. Plant phenotypes are influenced by plant genotype and environmental factors (both biotic and abiotic). Plant phenotypic traits, therefore, partly influenced by plant microbiome. Beneficial microbiome including endophytes reside in plant by mutually symbiotic relationship. All of these microbes play functionally important roles in plant biology including enhancement of growth, nutrient uptake, abiotic and biotic stress tolerance, and disease resistance. In this study, endophytic actinomycetes, Gram positive filamentous bacteria, were investigated from halophytes naturally grown in Bang Krachao area, Samut Prakarn province, Thailand. Culture-dependent survey of actinomycetes from Indian Marsh Fleabane (Pluchea indica (L.) Less.) and culture-independent study by metagenomic analysis using 16S rRNA amplicons were compared. 1-amino cyclopropane-1-carboxylate (ACC) deaminase is one of plant growth promoting (PGP) traits that highly identified from actinomycetes associated with halophytes. ACC deaminase converts ACC, a precursor of ethylene in plant, to ketobutyrate and ammonia for bacteria to consume as N-source; and consequently reduces stress ethylene level. It was proved that an ACC-deaminase producing Streptomyces sp. GKU 6042 isolated from halophytes could improve salt tolerance of jasmine Thai rice KDML105.

3. Bioinformatics Primer

Basic description of sequencing, assembly, genome comparison, and phylogenetics
This introductory lecture introduced the modern era of genomics and bioinformatics: hypothesis generation, complex visualisation, big data analytics, and high throughput computations are driving exciting new science. We use examples from Dr Papanicolaou’s laboratory to showcase that any of the so-called non-model species (i.e. not Drosophila, human, mouse, Baker’s yeast, and Arabidopsis) can now utilise high end technologies and explore previously unknown phenomena.
Slide1
Slide4
Slide5
Slide6
Slide7
Slide8
Slide9
Slide10
Slide11
Slide12
Slide13
Slide14
Slide15
Slide16
Slide17
Slide18
Slide19
Slide20
Slide21
Slide22
Slide23
Slide24
Slide25
Slide26
Slide27
Slide28
Slide29
Slide30
Slide31
Slide32
Slide33
Slide34
Slide35
Slide36
Slide37
Slide38
Slide39
Slide40
Slide41
Slide42
Slide43
Slide44
Slide45
Slide46
Slide47
Slide48
Slide49
Slide50
Slide51
Slide52
Slide53
Slide54
Slide55
Slide56
Slide57
Slide58
Slide59
Slide60
Slide61
Slide62
Slide63

4. Metagenomics: Recent advances and ongoing challenges

By Assoc. Prof. Aaron Darling, University of Technology Sydney

Abstract
This lecture describes the state of the art in metagenome analysis methods for taxonomic assignment, community profiling, and metagenome assembly. Topics of discussion include the ongoing challenges in metagenome assembly and emerging solutions such as time series analysis and metagenomic-derived Hi-C.
darling kasetsart Page 01
darling kasetsart Page 02
darling kasetsart Page 03
darling kasetsart Page 04
darling kasetsart Page 05
darling kasetsart Page 06
darling kasetsart Page 07
darling kasetsart Page 08
darling kasetsart Page 09
darling kasetsart Page 10
darling kasetsart Page 11
darling kasetsart Page 12
darling kasetsart Page 13
darling kasetsart Page 14
darling kasetsart Page 15
darling kasetsart Page 16
darling kasetsart Page 17
darling kasetsart Page 18
darling kasetsart Page 19
darling kasetsart Page 20
darling kasetsart Page 21
darling kasetsart Page 22
darling kasetsart Page 23
darling kasetsart Page 24
darling kasetsart Page 25
darling kasetsart Page 26
darling kasetsart Page 27
darling kasetsart Page 28
darling kasetsart Page 29
darling kasetsart Page 30
darling kasetsart Page 31
darling kasetsart Page 32
darling kasetsart Page 33
darling kasetsart Page 34
darling kasetsart Page 35
darling kasetsart Page 36
darling kasetsart Page 37
darling kasetsart Page 38
darling kasetsart Page 39
darling kasetsart Page 40
darling kasetsart Page 41
darling kasetsart Page 42
darling kasetsart Page 43
darling kasetsart Page 44
darling kasetsart Page 45
darling kasetsart Page 46
darling kasetsart Page 47
darling kasetsart Page 48
darling kasetsart Page 49
darling kasetsart Page 50
darling kasetsart Page 51
darling kasetsart Page 52
darling kasetsart Page 53
darling kasetsart Page 54

5. Next generation Microbial Ecology

By Dr Thomas Jeffries, Western Sydney University

Abstract
In this lecture we explain how Microbial Ecology benefits from these new genomic / bioinformatic approaches. How can biodiversity be assessed and microbial interactions described? How can we better understand the function of ecosystems and microbiomes using metagenomics?
Slide2
Slide3
Slide4
Slide5
Slide6
Slide7
Slide8
Slide9
Slide10
Slide11
Slide12
Slide13
Slide14
Slide15
Slide16
Slide17
Slide18
Slide19
Slide20
Slide21
Slide22
Slide23
Slide24
Slide25
Slide26
Slide27
Slide28
Slide29
Slide30
Slide31
Slide32
Slide33
Slide34
Slide35
Slide36
Slide37
Slide38
Slide39
Slide40
Slide41
Slide42
Slide43
Slide44
Slide45
Slide46
Slide47
Slide48
Slide49
Slide50
Slide51
Slide52
Slide53
Slide54
Slide55
Slide56
Slide57
Slide58
Slide59
Slide60
Slide61
Slide72
Slide73
Slide74
Slide75

Day Two

  • 09:00-10:00: Lecture 6: Technological capabilities for amplicon metagenomics

  • 10:00-10:30: Overview of Wet Laboratory exercise

  • 10:30-10:45: Tea break

  • 10:45-12:00: Lecture 7: Recap of Wet lab; Quality assessment of amplicon libraries and sequencing data

  • 12:00-13:00: Lunch and networking

  • 13:00-16:30: Wet Laboratory: 16S amplicon library construction

6. Technological capability for amplicon NGS

Abstract
Microbial biodiversity studies the microbial species present in the community at a particular time, their possible roles, or functions of the species that can contribute to the community. The use of amplicons is one of the Next Generation Sequencing (NGS) techniques popularly used in to generate the data for these studies. Like every technique, there are important concerns that we must be aware of including the source of biases generated from this technology. This lecture introduces the bacterial and fungal amplicon NGS and the limitations of using it to capture biodiversity data. We then explore other types of sequencing that can be used to assess microbial biodiversity including PCR-free whole shotgun sequencing, RNA sequencing, and the use of single molecule sequencing such as Oxford Nanopore.
Slide1
Slide2
Slide3
Slide4
Slide5
Slide6
Slide7
Slide8
Slide9
Slide10
Slide11
Slide12
Slide13
Slide14
Slide15
Slide16
Slide17
Slide18
Slide19
Slide20
Slide21
Slide22
Slide23
Slide24
Slide25
Slide26
Slide27
Slide28
Slide29
Slide30
Slide31
Slide32
Slide33
Slide34
Slide35
Slide36
Slide37
Slide38
Slide39
Slide40
Slide41
Slide42

7. Overview of the wet laboratory

Abstract
16S amplicon sequencing is a commonly used technique to survey biodiversity. The preparation of this type of library is relatively simple compared to other NGS libraries. Based on Illumina 16S NGS Sequencing Library preparation, the libraries are generated by two PCR reactions: first, we amplify the target region, and then we attach the barcode to the amplicons. Each PCR is followed by the PCR clean-up step using SPRI beads. After barcoding, the quality of the library can be assessed using a Bioanalyzer or TapeStation. The libraries can then be pooled together with other libraries and sequenced. For this workshop, we will perform the first PCR and amplify the 16S V3-V4 region (using the 341F and 805R primers), and PCR clean-up using SPRI beads. The final product can be barcoded using Illumina Nextera XT index kits before sequencing.
Slide1
Slide2
Slide3
Slide4
Slide5
Slide6
Slide7
Slide8
Slide9

8. Quality assessment

By Dr Kay Anantanawat and Thomas Jeffries

Abstract

In this lecture we provide an overview of the important quality control steps when you are generating amplicon libraries and sequence data.

Slide1
Slide2
Slide3
Slide4
Slide5
Slide6
Slide7
Slide8
Slide9
Slide10
Slide11
Slide12
Slide13
Slide14
Slide15
Slide16
Slide17
Slide18
Slide19
Slide20
Slide21

9. Amplicon NGS libraries

Purpose
In this laboratory exercise, you will create your own amplicon NGS library.
Why produce your own libraries?

Producing your own libraries can seem daunting at first but once you’ve done a couple of times you will be able to take advantage of:

  • Reduced cost and saving money per library (and so can increase number of samples)

  • Targeting custom regions and designing your own primers

  • Processing samples that may not meet the requirements of the NGS facility

  • Taking ownership of the process and the study quality

9.1. General notes for creating libraries

Before we begin our protocol, it is important to familiarise yourself with some concepts that are critical in determining the quality of your output.

The quality and quantity of genomic DNA (gDNA)

The purity of gDNA extracted can be assessed using Nanodrop. What to look for is the ratio values of 260/230 and 260/280 which should be in the range of 1.8 ~ 2.0. If the ratio is a bit low, especially 260/230, there is a high chance that the sample is contaminated with phenol or other reagents leftover from the extraction. These reagents can act as PCR inhibitors. Therefore, if the ratio is low, perform DNA purification either using ethanol precipitation, spin column, or bead based purification.

Nanodrop can also give a rough estimation of the concentration of the sample. However, since a spectrophotometers do not differentiate between DNA and RNA sufficiently well, it usually overestimates the real value of DNA (expect something like 7-10 times more). Treatment with RNAse A during the DNA extraction (add it during the Proteinase K / lysis step) is necessary if you are going to rely on Nanodrop; it also reduces the (already low) likelihood of primer - mRNA interference. A DNA concentration lower than 30 ng/μl cannot be measured accurately using Nanodrop. If possible, use dye-based measurement, such as Qubit or PicoGreen assays, which can better estimate the amount of DNA.

When extracting DNA, try to lyse the tissues as much as possible using mechanical, temperature, or chemical means. When eluting the DNA, use the lowest volume of eluent to obtain the highest concentration of DNA possible. If the kit says 100 μl, try 50 μl. For soil DNA, you may not get satisfactory results if you try to use 100 μl for elution.

SPRI beads
  • The beads have to be left in room temperature for 30 mins.

  • SPRI beads are viscous in nature. When pipetting the beads, try to be slow so that you won’t flush the beads.

  • When adding the beads to the DNA sample, the ratio between the volume of the beads to the volume of DNA sample can affect the size of DNA in the final product. The fewer beads used (i.e. lower volume or concentration of solution), the bigger size of DNA will be in the final product. In this workshop, we use 0.9x beads to get rid of primers dimers which are below 200 bp in size.

We are working with very small volumes, if there is an error while pipetting, this ratio can change very easily.
If you are not confident about the pipetting, add water into the DNA mixture to increase the volume and increase the volume of the beads used. This will allow more room for pipetting error.

9.2. Library preparation and primer design

Basic workflow

The basic protocol is similar for the different PCR primers / markers. The following is based on the Illumina 16S NGS library preparation but you can optimise it for different primer pairs.

pic1
Figure 1. The workflow of Illumina 16S NGS Library Preparation through two PCRs.

The primers used for this workshop is 341F and 805R, targeting the V3-4 region of the 16S gene. The primer sequence contains the Illumina overhang adapter sequences added to the locus-specific sequence. The overhang adapter allows the amplicon produced in the samples to be indexed using the Nextera XT Index kits. The course will cover amplicon PCR and PCR clean up as the indexing is a straightforward protocol for Illumina.

16S Amplicon PCR Forward Primer 341F
5' TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG 3'
16S Amplicon PCR Reverse Primer 805R
5' GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC 3'
The overhang adapter (underlined in the sequences), can be added to other locus-specific primers to amplify other amplicons.
Table 1. 16S Markers
Primer pair 16S region

27F / 519R

V1 + V2 + V3

341F / 805R

V3 + V4

515F / 806R

V3 + V4

530F / 926R

V4 + V5

926F / 1394R

V6 + V7 + V8

pic2
Figure 2. Regions of 16S rRNA available for PCR primer design.
Table 2. ITS Markers
Primer pair ITS region

fITS7 / ITS4

ITS2

ITS1f / ITS2

ITS1

pic3
Figure 3. Regions of ITS region available for PCR primer / marker design (Tedersoo et al., 2015)

9.3. Methodology

Materials and equipment:

  • Pipette

  • Thermocycler

  • PCR tubes

  • Nuclease free water

The extracted DNA should meet these quality benchmarks:

  • The concentration measured with Nanodrop should be at least 15 ng/μl

  • 260/230 value should be in the range of 1.8 ~ 2.0

  • 260/280 value should be in the range of 1.8 ~ 2.0

  • Hi-Fidelity Taq polymerase

In this course, we use Phusion Hi-Fidelity polymerase from NEB (M0530).

9.3.1. Set up the PCR reaction.

Table 3. Generic PCR recipe

Item / Mix

1x (10 μl)

1x (20 μl)

5x Buffer HF

2

4

10 mM dNTPs

0.2

0.4

10 μM Forward primer

0.5

1

10 μM Reverse primer

0.5

1

NEB Phusion Hi-Fidelity Polymerase

0.2

0.4

Nuclease free water

5.6

11.2

Template (< 250 ng)

1

2

In every PCR always include a negative control: an additional tube with water instead of DNA. A positive control is good practice but not required here.

Set up the PCR in the Thermocycler using following PCR conditions:

  • Set lid at 105 °C

  • 95°C – 30 seconds

  • 35 cycles of:

    • 98°C – 5 seconds

    • 55°C – 10 seconds

    • 72°C – 30 seconds

  • 72°C – 5 minutes

  • 4°C - ∞

9.3.2. Amplicon PCR bead clean up

Beads must be left in room temperature for at least 30 mins before use.
  1. Add 10 μl of water into the PCR reaction to increase the volume to 20 μl.

  2. Add 18 μl of beads into the reaction (0.9x). Mix the mixture well using pipettes and vortex.

  3. Leave the beads in the room temperature for 8 minutes, gentle mixing if possible.

  4. Put the tube onto the magnetic stand, and wait for 5 minutes or until the beads fully separate from the mixture.

  5. Carefully remove the supernatant.

  6. Add 200 μl of 80% EtOH. Wait for 10 seconds. Then discard the ethanol. Repeat this step twice.

  7. Leave the beads to air-dry. This step should take about 5 to 10 minutes. Be careful not to over-dry the beads since this will lead to the loss of DNA. The best way to test whether the beads are dry is to rotate the tube around the magnetic stand. If the beads move, then it still has a bit of ethanol left. The best stage is when you see a little bit of a crack.

  8. Add 20 μl of nuclease-free water into the tube. Mix well.

  9. Let the tube stand in the room temperature for 5 minutes. Mix if possible.

  10. Put the tube back onto the magnetic stand, and wait for 5 minutes or until the beads are fully separated from the mixture.

  11. Collect the supernatant containing the purified amplicons. This product can be kept at -20°C.

9.3.3. Assess the quality

To assess whether amplicon has been successfully created, run 5 μl of samples in 1 % agarose gel for electrophoresis, or use a Tapestation or BioAnalyzer instrument. The benefit for using instruments is that only 1 μl of the sample is required, whereas in the gel electrophoresis, more volume of sample is needed. What you expect is the one strong band at around 550 bp, and no band in the water negative control. If there is a band below 200 bp, then you have primer dimer carrying over. This can be eliminated by an additional round of bead clean-up.

It is most efficient to removed any primer dimer before continuing with the indexing PCR, even though you can also do it after the index PCR.

9.3.4. Troubleshooting

No PCR product

DNA concentration might be too low, or there is just no organismal DNA that matches your primers in the extraction (which is unlikely). Include a positive control in the PCR to ensure that the PCR works. The samples might contain PCR inhibitors. Bead based purification or ethanol precipitation can be performed on the extracted DNA sample before PCR.

Multiple bands, but no bands in water

That is OK and it is correct!

Difficult samples

These samples include biofilms or endogenous bacteria which contain a high amount of host DNA. Biofilms can be processed with kits made for soil bacteria (the PowerSoil kit is commonly used). There are many kits available to help enriching microbial DNA including QIAamp DNA Microbiome Kit (QAIGEN); NEBNExt® Microbiome DNA Enrichment Kit (NEB).

Different Taq have different PCR efficiencies in different types of sample. If one doesn’t work, another might work better. It is worth investing in optimising your experiments for your organism (if it is not standard practice in the literature).
This is not the only way of making the NGS library!

9.4. References

  • Tedersoo L, Anslan S, Bahram M, Põlme S, Riit T, Liiv I, Kõljalg U, Kisand V, Nilsson RH, Hildebrand F, Bork P, Abarenkov K (2015) Shotgun metagenomes and multiple primer pair/barcode combinations of amplicons reveal biases in metabarcoding analyses of fungi. MycoKeys 10: 1-43. (link)

  • 16S NGS Sequencing Library Preparation (15044223 B): Preparing 16S Ribosomal RNA Gene Amplicons for the Illumina MiSeq System. (link)

Day Three

  • 09:00-10:00: Demonstration of Amplicon Analysis of Microbial diversity

  • 10:00-10:15: Tea break

  • 10:15-12:00: Computer Lab 1: Command Line Bioinformatic skills; High Throughput Computing using the Galaxy Graphical User Interface

  • 12:00-13:00: Lunch and networking

  • 13:00-17:00: Computer Lab 2: Phylogenetic Analysis of Microbial Biodiversity

10. Demonstration of Amplicon Analysis using QIIME2

By Dr Thomas Jeffries, Western Sydney University

Abstract
In this lecture we explain how the workflow of the tool QIIME 2. This tool can help us analyse bioinformatically microbial biodiversity datasets.
Slide2
Slide3
Slide4
Slide5
Slide6
Slide7
Slide8
Slide9
Slide10
Slide11
Slide12
Slide13
Slide14
Slide15
Slide16
Slide17
Slide18
Slide19
Slide20
Slide21
Slide22
Slide23
Slide24
Slide25
Slide26
Slide27
Slide28
Slide29
Slide30
Slide31
Slide32

11. Hands-on bioinformatic skills

Using the Command line and High Throughput Computing with the Galaxy Graphical User Interface
In this lecture we will introduce you to the skills needed to perform high throughput computations, as required for modern genomics. Some of the skills you will use in the dry laboratory and other you will find useful elsewhere.

12. Analysis of 16S rDNA amplicons

Purpose
This tutorial will guide you through basics of analysing amplicon NGS data using QIIME and Jupyter notebooks.

This tutorial is meant to be run from a Jupyter Hub installation at the Hawkesbury Institute for Environment. It was originally developed for the Masters in Data Science tutorial written by Robert Muller and Alexie Papanicolaou. It has been expanded for this workshop by Thomas Jeffries and Alexie Papanicolaou. Workflows are based on the QIIME2 resources.

QIIME™ (canonically pronounced chime)

Quantitative Insights Into Microbial Ecology

QIIME is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data published here and available from here. QIIME is designed to take users from raw sequencing data generated on the Illumina or other platforms through publication quality graphics and statistics. This includes demultiplexing and quality filtering, OTU picking, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations. QIIME has been applied to studies based on billions of sequences from tens of thousands of samples.

QIIME is not the only software to analyse such data (see our lectures). The second version of QIIME (QIIME 2) has, however, a number of improvements that make the analysis more robust when compared to the previous version. It also has a standardised file format (called an artefact which means the results of a study, but the QIIME program misspells in their documentation as artifact which means error) with the suffix .qza. It also has a second output format that you can use for human viewing and analysis with the suffix .qzv.

If you are running this through the Jupyter notebook, we advise you that you make a backup of this original notebook. Go to File → Make a Copy . You can then rename it via File → Rename. We recommend you click the Not Trusted button at the top right to make it Trusted. save_jupyter
In Jupyter you are executing actual commands on the server when you are writing in a Code block. And you are processing actual High Throughput Next Generation Sequencing data, this may take some time for some commands. If the Code block has a star (\*) on the left, then this means it is still running the command. running_jupyter

12.1. The input data

(Or "How to Deal with millions of sequence reads")

After you submitted your DNA for amplicon sequencing, you will get files back with the ending R1 or R1 and R2 like this:

RM1_S140_L001_R1_001.fastq.gz
RM1_S140_L001_R2_001.fastq.gz
These two files are paired. What do we mean by paired sequence files?

The first thing you should do is uncompress all files and maybe have a quick look how they look. Well let’s do that:

First, we have to navigate to the folder containing the data using cd (which stands for change directory)

cd Rawfiles

If the files are already uncompressed (.fastq instead of .fastq.gz), they you don’t need to uncompress. Otherwise just use

gunzip *gz

to unzip all files within a folder in a UNIX environment.

Now we want to display the head (the first 5 lines of the file).

Don’t open the whole file since it might be too big, because it could contain millions of lines
head RM1_S140_L001_R1_001.fastq
head RM1_S140_L001_R2_001.fastq

These sequences contain both the sequences themselves, and quality information. To know what all that means have look at Wikipedia. To navigate out of the folder, you can use:

cd ..

12.2. Initiating QIIME on our server

Now we can start using Qiime2 and start read in our Data, but first we have to activate the Environment

If you are interested in installing the software locally, see the website
module load anaconda/3.6_local
source activate qiime2-2018.2

This should return (qiime2-2018.2) if everything is working ok.

12.3. Reading data

Let’s start and look what Qiime2 gives us

qiime --help
    Usage: qiime [OPTIONS] COMMAND [ARGS]...
      QIIME 2 command-line interface (q2cli)
      -------
      To get help with QIIME 2, visit https://qiime2.org.
      To enable tab completion in Bash, run the following command or add it to
      your .bashrc/.bash_profile:
          source tab-qiime
      To enable tab completion in ZSH, run the following commands or add them to
      your .zshrc:
          autoload bashcompinit && bashcompinit && source tab-qiime
    Options:
      --version  Show the version and exit.
      --help     Show this message and exit.
    Commands:
      info                Display information about current deployment.
      tools               Tools for working with QIIME 2 files.
      dev                 Utilities for developers and advanced users.
      alignment           Plugin for generating and manipulating alignments.
      composition         Plugin for compositional data analysis.
      cutadapt            Plugin for removing adapter sequences, primers, and
                          other unwanted sequence from sequence data.
      dada2               Plugin for sequence quality control with DADA2.
      deblur              Plugin for sequence quality control with Deblur.
      demux               Plugin for demultiplexing & viewing sequence quality.
      diversity           Plugin for exploring community diversity.
      emperor             Plugin for ordination plotting with Emperor.
      feature-classifier  Plugin for taxonomic classification.
      feature-table       Plugin for working with sample by feature tables.
      gneiss              Plugin for building compositional models.
      longitudinal        Plugin for paired sample and time series analyses.
      metadata            Plugin for working with Metadata.
      phylogeny           Plugin for generating and manipulating phylogenies.
      quality-control     Plugin for quality control of feature and sequence data.
      quality-filter      Plugin for PHRED-based filtering and trimming.
      sample-classifier   Plugin for machine learning prediction of sample
                          metadata.
      taxa                Plugin for working with feature taxonomy annotations.
      vsearch             Plugin for clustering and dereplicating with vsearch.

The above list are all the plugins which come with Qiime2. It lets you handle the different output files and manipulate them. If you want to know more just go on the Qiime2 website and have a look at the different plugins and their functionality.

12.4. Tutorial One

The Moving Picture Tutorial

For this tutorial we will use the Moving Picture Tutorial from the Qiime2 webpage and the following notebook is adapted to it. On that website there is some further interpretation of the results. The tutorial is reproducing some of the research shown in this article

Let’s see what is in this directory. For starting your analysis you need, of course, the Raw Sequencing data which is stored in the emp-single-end-sequences file. Furthermore, you need a Metadata file which contains the SampleID and all the measurements you did. In our case its saved as sample-metadata.tsv. You need a taxonomic classifier for assigning taxa to your sequences, here gg-13-8-99-515-806-nb-classifier.qza.

ls *qza

Let’s see what other kind of files we have:

ls
The folder precomputed contains the ready to go output files to double check and in case somebody could not get the script to work.

First we will create a new folder and name it output. In this folder we will store all our output files we will produce with Qiime2.

mkdir -p output

Let’s check if we created the directory!

ls

Let’s start reading in our data

Qiime2 has different tools to let us handle files. The following command shows us what tools we have:

qiime tools --help
  Usage: qiime tools [OPTIONS] COMMAND [ARGS]...
      Tools for working with QIIME 2 files.
    Options:
      --help  Show this message and exit.
    Commands:
      export            Export data from a QIIME 2 Artifact or Visualization.
      extract           Extract a QIIME 2 Artifact or Visualization archive.
      import            Import data into a new QIIME 2 Artifact.
      inspect-metadata  Inpect columns available in metadata.
      peek              Take a peek at a QIIME 2 Artifact or Visualization.
      validate          Validate data in a QIIME 2 Artifact.
      view              View a QIIME 2 Visualization.

We want to import our raw sequencing files. Let’s see what import gives us.

qiime tools import --help
    Usage: qiime tools import [OPTIONS]
      Import data to create a new QIIME 2 Artifact. See https://docs.qiime2.org/
      for usage examples and details on the file types and associated semantic
      types that can be imported.
    Options:
      --type TEXT                The semantic type of the artifact that will be
                                 created upon importing. Use --show-importable-
                                 types to see what importable semantic types are
                                 available in the current deployment. [required]
      --input-path PATH          Path to file or directory that should be
                                 imported. [required]
      --output-path PATH         Path where output artifact should be written.
                                 [required]
      --source-format TEXT       The format of the data to be imported. If not
                                 provided, data must be in the format expected by
                                 the semantic type provided via --type.
      --show-importable-types    Show the semantic types that can be supplied to
                                 --type to import data into an artifact.
      --show-importable-formats  Show formats that can be supplied to --source-
                                 format to import data into an artifact.
      --help                     Show this message and exit.

First we need the type of our data. Let’s see what types we can import.

qiime tools import --show-importable-types
    DeblurStats
    DistanceMatrix
    EMPPairedEndSequences
    EMPSingleEndSequences
    FeatureData[AlignedSequence]
    FeatureData[PairedEndSequence]
    FeatureData[Sequence]
    FeatureData[Taxonomy]
    FeatureTable[Balance]
    FeatureTable[Composition]
    FeatureTable[Frequency]
    FeatureTable[PresenceAbsence]
    FeatureTable[RelativeFrequency]
    Hierarchy
    MultiplexedPairedEndBarcodeInSequence
    MultiplexedSingleEndBarcodeInSequence
    PCoAResults
    Phylogeny[Rooted]
    Phylogeny[Unrooted]
    QualityFilterStats
    RawSequences
    SampleData[AlphaDiversity]
    SampleData[BooleanSeries]
    SampleData[FirstDifferences]
    SampleData[JoinedSequencesWithQuality]
    SampleData[PairedEndSequencesWithQuality]
    SampleData[SequencesWithQuality]
    SampleData[Sequences]
    TaxonomicClassifier
    UchimeStats

Our data are EMPSingleEndSequences, but normally you have other data depending on your sequencing centre. The common datatype we deal with is PairedEndSequencesWithQuality, but we will come to that later. Let’s import our data. We have to also give the command an input path and an output path.

qiime tools import --type EMPSingleEndSequences \
--input-path emp-single-end-sequences \
--output-path ./output/emp-single-end-sequences.qza
The backslash (\) character allows us to press return (Enter key) without executing the command. That allows us to write our commands in multiple lines which is more readable.

If no error message then it worked. Great!

Differences with Paired-End sequences

In the example above we created a qiime2-artifact using Single-End sequences. However, many researches want to have a longer fragment and therefore are using paired-end sequencing like for example the Illumina Miseq 2x300bp paired-end libraries. We will use an example of this later in the tutorial using the command qiime tools import --type EMPPairedEndSequences However, often your data will come already demultiplexed or not in the "EMP" format in which case we can use the command qiime tools import --type 'SampleData[SequencesWithQuality]'.

For using this option we need a so called manifest file which looks like the following:

head phred64_Bacteria.csv
    sample-id,absolute-filepath,direction
    RM48,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM48_S48_L001_R1_001.fastq,forward
    RM118,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM118_S118_L001_R1_001.fastq,forward
    RM3,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM3_S3_L001_R1_001.fastq,forward
    RM74,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM74_S74_L001_R1_001.fastq,forward
    RM156,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM156_S156_L001_R2_001.fastq,reverse
    RM55,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM55_S55_L001_R1_001.fastq,forward
    RM30,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM30_S30_L001_R2_001.fastq,reverse
    RM82,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM82_S82_L001_R2_001.fastq,reverse
    RM53,/mnt/nfs/home/90928911/my_data_store/Bacteria2017/RM53_S53_L001_R1_001.fastq,forward

The manifest file has three columns, containing the sample-id, the full path of your file and the direction (forward is R1 and reverse is R2).

You probably don’t want to use Excel for creating such a file if you have hundreds of files. Therefore, the best way is to write a little script in any language you want, to create such a manifest file, like this one we wrote in Python (don’t run this on Jupyter):

import os
import shutil
import csv

data_dir = "#Here comes the path were your files are"  # put the directory in there were you stored your fastq files
with open (r'phred64.csv',"w") as f:                   # the name of your file
    writer = csv.writer(f)
    writer.writerow(["sample-id","absolute-filepath","direction"])

    x = os.listdir(data_dir)
    for filename in x:
        if filename.startswith('RM'):                  # the beginning of your fastq files eg. "RM"
            myarray = filename.split('_')
            sample_nb = myarray[0]
            if myarray[3] == "R1":
                direction="forward"
            else:
                direction="reverse"
            filepath = (os.path.join(data_dir, filename))
            writer.writerow([sample_nb, filepath, direction])

The next step is to read in your paired-end data into qiime2 something like this: (NB: If you run this now, it will produce an error since we didn’t write the correct input)

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path phred64_Bacteria.csv \
--output-path paired-end-demux_bacteria.qza \
--source-format PairedEndFastqManifestPhred33 \
Make sure that you know the file type and what source-format the sequencing facility uses. For further information go to the qiime2 page.

12.4.1. Demultiplexing

Explanation

Illumina uses a Multiplex sequencing technique which allows large numbers of libraries to be pooled and sequenced simultaneously during a single run on a high-throughput instrument. Sample multiplexing is useful when targeting specific genomic regions or working with smaller genomes.

Individual "barcode" sequences are added to each DNA fragment during next-generation sequencing (NGS) library preparation so that each read can be identified and sorted before the final data analysis. Pooling samples exponentially increases the number of samples analyzed in a single run, without drastically increasing cost or time.

So now we need to demultiplex using the barcodes in our metadata file.

This might take a few minutes.

qiime demux emp-single \
  --i-seqs ./output/emp-single-end-sequences.qza \
  --m-barcodes-file sample-metadata.tsv \
  --m-barcodes-column BarcodeSequence \
  --o-per-sample-sequences ./output/demux.qza
  Saved SampleData[SequencesWithQuality] to: ./output/demux.qza
Some sequencing facilities give you demultiplexed files so you can ignore the previous step.

Now let us generate a summary of the demultiplexing results. With this we can access our quality and have some basic statistics on our reads:

qiime demux summarize \
  --i-data ./output/demux.qza \
  --o-visualization ./output/demux.qzv
    Saved Visualization to: ./output/demux.qzv

If you were using your own computer or a server that you can connect to graphically then you could do the following. If you are using Jupyter, however, you will not be able to.

qiime tools view ./output/demux.qzv
    Usage: qiime tools view [OPTIONS] VISUALIZATION_PATH

    Error: Visualization viewing is currently not supported in headless environments. You can view Visualizations (and Artifacts) at https://view.qiime2.org, or move the Visualization to an environment with a display and view it with `qiime tools view`.

To view the file we created, you can select the files from the main Jupyter window and download them to your computer. download_jupyter

Then we select the file and just drag/drop the file onto the QIIME VIEW webpage.

On the overview section you can see some basic statistics about your sequencing run. The minimum of sequences you got per sample were 1,853 and the maximum 18,787. You also got the mean and median, as well as the total number of sequences / reads.

Now go to the interactive quality plot in the top left menu: here you can see the overall quality of your sequences. On the X-axis are the number of base pairs, in this tutorial we used a Single-End sequencing technique sequencing 150bp in the forward (R1) direction. On the Y-axis we can see the quality per base with 40 being the highest and 0 being the lowest.

Do you notice something?

The quality of our bases drops dramatically at 110 - 120 bp. This can be normal and is usually due to the sequencing machine or the supplied kit for sequencing. However, we do not want low quality bases for our study: If the quality is lower it tells us that the machine cannot correctly distinguish between the 4 bases and might give us artifacts (i.e. wrong sequences).

Therefore, we want to trim these bases off to get an accurate sequence.

12.4.2. Sequence quality control and feature table construction

As discussed above we want to trim our sequences and filter them to discard artifacts and wrong sequences.

QIIME 2 plugins are available for several quality control methods, including DADA2, Deblur, and basic quality-score-based filtering. In this tutorial we present this step using DADA2. The result of both of these methods will be a FeatureTable[Frequency] QIIME 2 artefact, which contains counts (frequencies) of each unique sequence in each sample in the dataset, and a FeatureData[Sequence] QIIME 2 artefact, which maps feature identifiers in the FeatureTable to the sequences they represent.

A feature table is also called a OTU (Operational Taxonomic Unit) table. However, in the past year was a change and researchers move slowly away from OTUs,

Definitions of OTU (Operational Taxonomic Unit)
Until recently (2017)

We used to define our data as OTU (plural OTUs), specifically an "Operational Taxonomic Unit". This has changed over time from bands on an electrophoresis gel to two pieces of DNA with a particular level of hybridization to sequence data based OTUs, which are treated like "species".

Sequences at/above a given similarity threshold considered part of the same OTU (97% is the usual “species-level”).

Purpose is to minimize impact of sequencing errors

This year (2018)

We now define our OTU data as: "An operation definition of a sequence data point that explicitly models the errors of the sequencing technology.".

Other names that appeared were sub-OTU (sOTU), zOTU, ESV, ASV, or even oligotypes (we will just call them OTU here).

We now use error modelling to in-silico correct sequencing mistakes and, given the right error model, it is quite accurate. Importantly, the error model is specific to the sequencing type (e.g., 454, Illumina Hi/MiSeq).

The end result is the sequences more likely to have been provided to the sequencer.

A number of software can undertake this sequence correcting and a couple are supported by the QIIME2 package:

We will use DADA2 in this tutorial but feel free to use Deblur if you would like. Note that currently Deblur only corrects 16S rDNA reads and not fungal ITS2 reads.

Let’s see the options of DADA2:

qiime dada2 --help
    Usage: qiime dada2 [OPTIONS] COMMAND [ARGS]...

      Description: This QIIME 2 plugin wraps DADA2 and supports sequence quality
      control for single-end and paired-end reads using the DADA2 R library.

      Plugin website: http://benjjneb.github.io/dada2/

      Getting user support: Please post to the QIIME 2 forum for help with this
      plugin: https://forum.qiime2.org

      Citing this plugin: DADA2: High-resolution sample inference from Illumina
      amplicon data. Benjamin J Callahan, Paul J McMurdie, Michael J Rosen,
      Andrew W Han, Amy Jo A Johnson, Susan P Holmes. Nature Methods 13, 581–583
      (2016) doi:10.1038/nmeth.3869.

    Options:
      --help  Show this message and exit.

    Commands:
      denoise-paired  Denoise and dereplicate paired-end sequences
      denoise-single  Denoise and dereplicate single-end sequences
With DADA2 we can denoise single and paired-end reads.

Our tutorial uses Single End reads so we will use the denoise-single.

qiime dada2 denoise-single --help
    Usage: qiime dada2 denoise-single [OPTIONS]

      This method denoises single-end sequences, dereplicates them, and filters
      chimeras.

    Options:
      --i-demultiplexed-seqs ARTIFACT PATH SampleData[PairedEndSequencesWithQuality | SequencesWithQuality]
                                      The single-end demultiplexed sequences to be
                                      denoised. [required]
      --p-trunc-len INTEGER           Position at which sequences should be
                                      truncated due to decrease in quality. This
                                      truncates the 3' end of the of the input
                                      sequences, which will be the bases that were
                                      sequenced in the last cycles. Reads that are
                                      shorter than this value will be discarded.
                                      If 0 is provided, no truncation or length
                                      filtering will be performed  [required]
      --p-trim-left INTEGER           Position at which sequences should be
                                      trimmed due to low quality. This trims the
                                      5' end of the of the input sequences, which
                                      will be the bases that were sequenced in the
                                      first cycles. [default: 0]
      --p-max-ee FLOAT                Reads with number of expected errors higher
                                      than this value will be discarded.
                                      [default: 2.0]
      --p-trunc-q INTEGER             Reads are truncated at the first instance of
                                      a quality score less than or equal to this
                                      value. If the resulting read is then shorter
                                      than `trunc_len`, it is discarded.
                                      [default: 2]
      --p-chimera-method [pooled|consensus|none]
                                      The method used to remove chimeras. "none":
                                      No chimera removal is performed. "pooled":
                                      All reads are pooled prior to chimera
                                      detection. "consensus": Chimeras are
                                      detected in samples individually, and
                                      sequences found chimeric in a sufficient
                                      fraction of samples are removed. [default:
                                      consensus]
      --p-min-fold-parent-over-abundance FLOAT
                                      The minimum abundance of potential parents
                                      of a sequence being tested as chimeric,
                                      expressed as a fold-change versus the
                                      abundance of the sequence being tested.
                                      Values should be greater than or equal to 1
                                      (i.e. parents should be more abundant than
                                      the sequence being tested). This parameter
                                      has no effect if chimera_method is "none".
                                      [default: 1.0]
      --p-n-threads INTEGER           The number of threads to use for
                                      multithreaded processing. If 0 is provided,
                                      all available cores will be used. [default:
                                      1]
      --p-n-reads-learn INTEGER       The number of reads to use when training the
                                      error model. Smaller numbers will result in
                                      a shorter run time but a less reliable error
                                      model. [default: 1000000]
      --p-hashed-feature-ids / --p-no-hashed-feature-ids
                                      If true, the feature ids in the resulting
                                      table will be presented as hashes of the
                                      sequences defining each feature. The hash
                                      will always be the same for the same
                                      sequence so this allows feature tables to be
                                      merged across runs of this method. You
                                      should only merge tables if the exact same
                                      parameters are used for each run. [default:
                                      True]
      --o-table ARTIFACT PATH FeatureTable[Frequency]
                                      The resulting feature table. [required if
                                      not passing --output-dir]
      --o-representative-sequences ARTIFACT PATH FeatureData[Sequence]
                                      The resulting feature sequences. Each
                                      feature in the feature table will be
                                      represented by exactly one sequence.
                                      [required if not passing --output-dir]
      --output-dir DIRECTORY          Output unspecified results to a directory
      --cmd-config PATH               Use config file for command options
      --verbose                       Display verbose output to stdout and/or
                                      stderr during execution of this action.
                                      [default: False]
      --quiet                         Silence output if execution is successful
                                      (silence is golden). [default: False]
      --help                          Show this message and exit.
That’s a lot of options. However, we only need a few in this tutorial. Feel free to try the other options and see if you can improve the results.

We want to just trim the sequences were we think there might be the quality drop. I specified truncating 120 bp with the trunc-len flag, but you can use 110 or even lower. However, be careful to not trim to many base pairs, because in later stages we want to align the sequences to a reference database. Very short sequences might not give us the right taxonomic assignment to species or genus level. You can also trim the sequences in the beginning with the trim-left flag to remove, for example, the primers you used in the beginning, because they contain ambiguous bases like G or N and can influence the DADA2 error model. Also, trimming can influence how many of your reads overlap when merging paired end data so it’s important to consider this when choosing cut-offs.

This command might take 5-10 minutes.
qiime dada2 denoise-single \
  --i-demultiplexed-seqs ./output/demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 120 \
  --o-representative-sequences ./output/rep-seqs-dada2.qza \
  --o-table ./output/table-dada2.qza \
  --verbose
    Running external command line application(s). This may print messages to stdout and/or stderr.

    R version 3.4.1 (2017-06-30)
    Loading required package: Rcpp
    DADA2 R package version: 1.6.0
    1) Filtering ..................................
    2) Learning Error Rates
    Initializing error rates to maximum possible estimate.
    Sample 1 - 8571 reads in 2110 unique sequences.
    [...]
    Convergence after  5  rounds.
    3) Denoise remaining samples
    4) Remove chimeras (method = consensus)
    5) Report read numbers through the pipeline
                                   input filtered denoised non-chimeric
    L1S105_9_L001_R1_001.fastq.gz  11340     8571     8571         7865
    L1S140_6_L001_R1_001.fastq.gz   9736     7676     7676         7245
    L1S208_10_L001_R1_001.fastq.gz 11335     9260     9260         8270
    L1S257_11_L001_R1_001.fastq.gz  8216     6705     6705         6486
    L1S281_5_L001_R1_001.fastq.gz   8904     7066     7066         6755
    L1S57_13_L001_R1_001.fastq.gz  11750     9298     9298         8756
    6) Write output
    Saved FeatureTable[Frequency] to: ./output/table-dada2.qza
    Saved FeatureData[Sequence] to: ./output/rep-seqs-dada2.qza

Here we can now see what the program did. We gave DADA2 a input (our sequences), it filtered them based on the error rate we provided (by default maxEE = 2) and it also performed a chimera check. Chimeras are non-biological sequences or artifacts produced by the sequencer.

The output are two files: one is containing your representative sequences and the other is containing the so called OTU-table or in this case an sOTU / ASV-Table since we didn’t cluster sequences together. Have a look at the above table in the end and you will see how may sequences get dropped by the error-correction and chimera-detection.

FeatureTable and FeatureData summaries

Let’s have a look and summarize the our feature-table and representative-sequence table.

qiime feature-table summarize \
  --i-table ./output/table-dada2.qza \
  --o-visualization ./output/table-dada2.qzv \
  --m-sample-metadata-file sample-metadata.tsv
qiime feature-table tabulate-seqs \
  --i-data ./output/rep-seqs-dada2.qza \
  --o-visualization ./output/rep-seqs-dada2.qzv
    Saved Visualization to: ./output/table-dada2.qzv
    Saved Visualization to: ./output/rep-seqs-dada2.qzv

Now, again we can use Jupyter’s download function and QIIME2 View to explore the results. We will have an overview of the basic statistics and the number of features present for our sequence variants. Use the Interactive sample detail page and you can see how many sequences remained after we implemented our quality controls for each sample.

12.4.3. Phylogenetic tree construction

In order to estimate the phylogenetic metrics of diversity we need a phylogenetic tree that determines the phylogenetic relationships between our sOTUS and across the samples. The number of sequences is these trees are too many for humans to analyse and interpret, rather we can use the phylogenetic information to get some enhanced diversity statistics that are embedded in a phylogenetic framework.

So in order to infer a phylogenetic placement we have to first align the representative sequences with each other. We will use the mafft aligner for this.

qiime alignment mafft \
  --i-sequences ./output/rep-seqs-dada2.qza \
  --o-alignment ./output/aligned-rep-seqs-dada2.qza
    Saved FeatureData[AlignedSequence] to: ./output/aligned-rep-seqs-dada2.qza

Now that we have aligned the sequences, not all the regions of the alignment can be robustly aligned and might skew our results. The solution is to remove parts of the hypervariable region. QIIME 2 has an automated process for this:

qiime alignment mask \
  --i-alignment ./output/aligned-rep-seqs-dada2.qza \
  --o-masked-alignment ./output/masked-aligned-rep-seqs-dada2.qza
    Saved FeatureData[AlignedSequence] to: ./output/masked-aligned-rep-seqs-dada2.qza

Now we will build the phylogenetic tree using the fasttree algorithm:

qiime phylogeny fasttree \
  --i-alignment ./output/masked-aligned-rep-seqs-dada2.qza \
  --o-tree ./output/unrooted-tree-dada2.qza
    Saved Phylogeny[Unrooted] to: ./output/unrooted-tree-dada2.qza

It is important to know that fasttree gives an unrooted tree, i.e. there is no outgroup. However, we want to place a root in the middle of the tree (midpoint-root) in order to provide a topology that is easier to view if needed (note that this root has no evolutionary meaning).

qiime phylogeny midpoint-root \
  --i-tree ./output/unrooted-tree-dada2.qza \
  --o-rooted-tree ./output/rooted-tree-dada2.qza
    Saved Phylogeny[Rooted] to: ./output/rooted-tree-dada2.qza

12.4.4. Taxonomic classification and analysis

If you have reached this point, you will now have your sequence variants (sOTUs) and a phylogenetic tree describing their evolutionary relationships. The next step is to know which taxa (for example genera) have been found in our samples. There are a number of ways to do this (see here) but we will use a tool called the taxonomy classifier. This works by using a machine learning classifier that has been trained on data where the link between sequence variant and taxonomy is known. This is then stored in the QIIME file format .qza. In this case we have trained the classifier using the full-length GreenGenes classifier.

The taxonomic assignment can be improved by training the classifier to your specific dataset (e.g. primer pair and region of the 16S rDNA being analysed). See here for details of how to do this.
qiime feature-classifier classify-sklearn \
  --i-classifier gg-13-8-99-515-806-nb-classifier.qza \
  --i-reads ./output/rep-seqs-dada2.qza \
  --o-classification ./output/taxonomy-dada2.qza
    Saved FeatureData[Taxonomy] to: ./output/taxonomy-dada2.qza
qiime metadata tabulate \
  --m-input-file ./output/taxonomy-dada2.qza \
  --o-visualization ./output/taxonomy-dada2.qzv
    Saved Visualization to: ./output/taxonomy-dada2.qzv

Now we want to have a look at the taxonomic assignments and see what bacteria are there. You can also export these tables to use in other statistics programs.

qiime taxa barplot \
  --i-table ./output/table-dada2.qza \
  --i-taxonomy ./output/taxonomy-dada2.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization ./output/taxa-bar-plots-dada2.qzv
  Saved Visualization to: ./output/taxa-bar-plots-dada2.qzv

12.4.5. Alpha and Beta diversity

At this point, we can use any statistical software we wish, for example R. In order to do this, we need to first convert our QIIME-specific artefact file format into something readable by other programs, such as a comma separated file (csv).

We have prepared a short python script that will do the trick:

python table.py

In this tutorial, we will, however, show you QIIME’s statistical tools:

Rarefaction plotting

Rarefaction is statistics-speak for exploring and normalising your data for sampling depth.

To evaluate if our sampling depth was sufficient we will compute the alpha-rarefraction curves. The sampling depth determines if we captured the whole diversity of sequences within a sample or if we missed most of the diversity.

qiime diversity alpha-rarefaction \
  --i-table ./output/table-dada2.qza \
  --i-phylogeny ./output/rooted-tree-dada2.qza \
  --p-max-depth 4000 \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization ./output/alpha-rarefaction-dada2.qzv
    Saved Visualization to: ./output/alpha-rarefaction-dada2.qz
Diversity Analysis

Qiime2 has a workflow which automates the calculation of many useful alpha- and beta- diversity metrics and produces visualization of some of them:

  • Alpha diversity

    • Shannon’s diversity index (a quantitative measure of community richness)

    • Observed OTUs (a qualitative measure of community richness)

    • Faith’s Phylogenetic Diversity (a qualitative measure of community richness that incorporates phylogenetic relationships between the features)

    • Evenness (or Pielou’s Evenness; a measure of community evenness)

  • Beta diversity

    • Jaccard distance (a qualitative measure of community dissimilarity)

    • Bray-Curtis distance (a quantitative measure of community dissimilarity)

    • unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

    • weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

For more details about which other metrics than can be calculated see here.

An important parameter that needs to be provided to this script is --p-sampling-depth. This is the number of sequences to which your samples will be rarefied - i.e. even out the sampling depth. This is an important consideration as you want to have enough sequences to describe the diversity of your samples but samples with less sequences than this will be removed. You can also manually rarefy your table using the qiime feature-table rarefy command.

Let’s run the diversity command and then look at some of the outputs:

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny ./output/rooted-tree-dada2.qza \
  --i-table ./output/table-dada2.qza \
  --p-sampling-depth 1109 \
  --m-metadata-file sample-metadata.tsv \
  --output-dir ./output/core-metrics-results
    Saved FeatureTable[Frequency] to: ./output/core-metrics-results/rarefied_table.qza
    Saved SampleData[AlphaDiversity] % Properties(['phylogenetic']) to: ./output/core-metrics-results/faith_pd_vector.qza
    Saved SampleData[AlphaDiversity] to: ./output/core-metrics-results/observed_otus_vector.qza
    Saved SampleData[AlphaDiversity] to: ./output/core-metrics-results/shannon_vector.qza
    Saved SampleData[AlphaDiversity] to: ./output/core-metrics-results/evenness_vector.qza
    Saved DistanceMatrix % Properties(['phylogenetic']) to: ./output/core-metrics-results/unweighted_unifrac_distance_matrix.qza
    Saved DistanceMatrix % Properties(['phylogenetic']) to: ./output/core-metrics-results/weighted_unifrac_distance_matrix.qza
    Saved DistanceMatrix to: ./output/core-metrics-results/jaccard_distance_matrix.qza
    Saved DistanceMatrix to: ./output/core-metrics-results/bray_curtis_distance_matrix.qza
    Saved PCoAResults to: ./output/core-metrics-results/unweighted_unifrac_pcoa_results.qza
    Saved PCoAResults to: ./output/core-metrics-results/weighted_unifrac_pcoa_results.qza
    Saved PCoAResults to: ./output/core-metrics-results/jaccard_pcoa_results.qza
    Saved PCoAResults to: ./output/core-metrics-results/bray_curtis_pcoa_results.qza
    Saved Visualization to: ./output/core-metrics-results/unweighted_unifrac_emperor.qzv
    Saved Visualization to: ./output/core-metrics-results/weighted_unifrac_emperor.qzv
    Saved Visualization to: ./output/core-metrics-results/jaccard_emperor.qzv
    Saved Visualization to: ./output/core-metrics-results/bray_curtis_emperor.qzv

Let’s look at what files we have created:

ls ./output/core-metrics-results
    bray_curtis_distance_matrix.qza  observed_otus_vector.qza
    bray_curtis_emperor.qzv          rarefied_table.qza
    bray_curtis_pcoa_results.qza     shannon_vector.qza
    evenness_vector.qza              unweighted_unifrac_distance_matrix.qza
    faith-pd-group-significance.qzv  unweighted_unifrac_emperor.qzv
    faith_pd_vector.qza              unweighted_unifrac_pcoa_results.qza
    jaccard_distance_matrix.qza      weighted_unifrac_distance_matrix.qza
    jaccard_emperor.qzv              weighted_unifrac_emperor.qzv
    jaccard_pcoa_results.qza         weighted_unifrac_pcoa_results.qza

Now let’s look at some alpha diversity results, by running a statistical test to look at differences with Faith’s phylogenetic diversity metric. Note that we are using parameters from the mapping file.

qiime diversity alpha-group-significance \
  --i-alpha-diversity ./output/core-metrics-results/faith_pd_vector.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization ./output/core-metrics-results/faith-pd-group-significance.qzv
    Saved Visualization to: ./output/core-metrics-results/faith-pd-group-significance.qzv

Open the faith-pd-group-significance.qzv file using QIIME2 View and address the following questions.

  • Is there a difference in diversity between body sites? (BodySite)

  • Is there a difference between the people sampled? (Subject)

Now we can look at the beta diversity results.

We will look at the weighted_unifrac_emperor.qzv plot. This has used the Weighted Unifrac similarity matrix to conduct a Principle CoOrdinate Analysis and Vizualized it as a 3D plot. More details of this analysis, which is automated by the core metrics script, can be found here and here.

Look at the file and use the dropdown menu on the right (from your metadata mapping file!) to color the samples by BodySite and Subject…​Is this what you would expect based on the Alpha Diversity results?

Next, we want to see if the groupings that we observe visually are also statistically significant. We will use the "Beta-group significance" command to do this (see here for how this works).

qiime diversity beta-group-significance \
  --i-distance-matrix ./output/core-metrics-results/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file sample-metadata.tsv \
  --m-metadata-column BodySite \
  --o-visualization ./output/core-metrics-results/unweighted-unifrac-body-site-significance.qzv \
  --p-pairwise

And for the other metadata column (Subject):

qiime diversity beta-group-significance \
  --i-distance-matrix ./output/core-metrics-results/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file sample-metadata.tsv \
  --m-metadata-column Subject \
  --o-visualization ./output/core-metrics-results/unweighted-unifrac-subject-group-significance.qzv \
  --p-pairwise
  • Is there a difference between BodySite and Subject?

  • Does this match your expectations from the plot?

That is the core workflow. Not their are many more statistical tests available in QIIME2 and you can explore these in the plugins section of their website.

12.5. Self guided tutorial - Atacama soil microbiomes

Purpose
For this part, you will practice what you have learnt in the previous tutorial.

You will continue to use the Jupyter Hub installation. We will run though an analysis using a soil microbiome dataset from this tutorial. For this exercise, we provide some questions that can be used to guide your analysis, but you have to provide commands!

You can enter your own commands by using Jupyter’s Insert → Cell Below (it will insert a new command option beneath the cell you have selected).

We will start you off with some best practice for data management. Make a new directory to conduct your analysis in and change into this directory:

mkdir -p qiime2-atacama-tutorial
cd qiime2-atacama-tutorial

Then we are showing you how to download the data. First, the metadata sheet using the command wget and the option -O (that is a capital O) to select the output file:

wget -O "sample-metadata.tsv" "https://data.qiime2.org/2018.4/tutorials/atacama-soils/sample_metadata.tsv"
    Saving to: ‘sample-metadata.tsv’
    [...]
    2018-05-28 15:11:59 (24.6 MB/s) - ‘sample-metadata.tsv’ saved [11984]

You can open this sheet through Jupyter’s main window, or you can just look at the headings to get a sense of the metadata (what is the command to read the first part of a file?).

Next, you’ll download the multiplexed reads. You will download three fastq.gz files, corresponding to the forward, reverse, and barcode (i.e., index) reads. Note these are Paired-End reads, unlike in the previous tutorial. We will store the input data in a separate location because that is best practice and good housekeeping.
mkdir emp-paired-end-sequences
wget -O "emp-paired-end-sequences/forward.fastq.gz" "https://data.qiime2.org/2018.4/tutorials/atacama-soils/10p/forward.fastq.gz"
    Saving to: ‘emp-paired-end-sequences/forward.fastq.gz’
    2018-05-28 15:20:21 (5.89 MB/s) - ‘emp-paired-end-sequences/forward.fastq.gz’ saved [143193967/143193967]
wget -O "emp-paired-end-sequences/reverse.fastq.gz" "https://data.qiime2.org/2018.4/tutorials/atacama-soils/10p/reverse.fastq.gz"
    Saving to: ‘emp-paired-end-sequences/reverse.fastq.gz’
    2018-05-28 15:22:17 (5.40 MB/s) - ‘emp-paired-end-sequences/reverse.fastq.gz’ saved [161032441/161032441]
wget -O "emp-paired-end-sequences/barcodes.fastq.gz" "https://data.qiime2.org/2018.4/tutorials/atacama-soils/10p/barcodes.fastq.gz"
    Saving to: ‘emp-paired-end-sequences/barcodes.fastq.gz’
    2018-05-28 15:31:55 (4.07 MB/s) - ‘emp-paired-end-sequences/barcodes.fastq.gz’ saved [19976093/19976093]

Then, you need to import the data. What command will you use?

Remember we use the qiime tools import command. Also remember that the data is Paired End.

Next, you need to demultiplex the sequence reads. This requires the sample metadata file, and you must indicate which column in that file contains the per-sample barcodes. In this case, that column name is BarcodeSequence. It just so happens that in this data set, the barcode reads are the reverse complement of those included in the sample metadata file, so we additionally include the --p-rev-comp-mapping-barcodes parameter. After demultiplexing, we can generate and view a summary of how many sequences were obtained per sample.

You will need the demux commands, such as qiime demux emp-paired and qiime demux summarize.

Remember, to see the results, you will need to open the qzv vizualization file you just made and assess where to trim the reads.

Then you can run the DADA2 pipeline specifying where to trim each end of the forward and reverse reads. Remember: this is paired end data, so you will need the qiime dada2 denoise-paired command.

Remember you can type --help to see the options of a qiime command.

After you complete running DADA2, summarize the data using the commands qiime feature-table summarize and qiime feature-table tabulate-seqs.

Now you can use the qiime diversity plugin to analyze the beta-diversity patterns in your data. Also you can use the feature classifier and the taxa plugins to view which taxa are present.

Use the following questions to guide your analysis:

  1. What value would you choose to pass for --p-sampling-depth?

  2. How many samples will be excluded from your analysis based on this choice?

  3. Approximately how many total sequences will you be analysing in the core-metrics-phylogenetic command?

  4. What sample metadata or combinations of sample metadata are most strongly associated with the differences in microbial composition of the samples?

  5. Are these associations stronger with unweighted UniFrac or with Bray-Curtis?

  6. Based on what you know about these metrics, what does that difference suggest?

For exploring associations between continuous metadata and sample composition, the commands qiime metadata distance-matrix in combination with qiime diversity mantel and qiime diversity bioenv will be useful. These were not covered in the Moving Pictures tutorial, but try to run them (remember the --help parameter).

  1. What do you conclude about the associations between continuous sample metadata and the richness and evenness of these samples?

  2. For exploring associations between continuous metadata and richness or evenness, the command qiime diversity alpha-correlation will be useful. This was not covered in the Moving Pictures tutorial, but you can learn about it by running it with the --help parameter.

  3. Which categorical sample metadata columns are most strongly associated with the differences in microbial community richness or evenness?

  4. Are these differences statistically significant?

  5. In taxonomic composition bar plots, sort the samples by their average soil relative humidity, and visualize them at the phylum level.

  6. What are the dominant phyla in these samples?

  7. Which phyla increase and which decrease with increasing average soil relative humidity?

  8. What phyla differ in abundance across vegetated and non-vegetated sites?

Day Four

  • 09:00–09:45: Open Panel Q&A with Speakers

  • 09:45-12:00: Computer Lab 2: Genomic Analysis of Microbial Biodiversity

  • 12:00-13:00: Lunch and Networking

  • 13:00-13:30: Closing ceremony; Certificates; Awards for best Lighting Talk

  • 13:30-16:00: Consultation and networking

13. Genomic Analysis of Microbial Biodiversity

Abstract
This tutorial focuses on annotating whole genome shotgun data from multiple organisms (i.e. "metagenomics"). We introduce the tools available for inferring predicted function from ribosomal genes (PicRUST) and the FOCUS/SUPERFOCUS which uses k-mer frequencies to annotate microbial functional pathways. Students will then use the MG-RAST platform to visualize functional pathways and relationships in agricultural soil metagenomes.

Over the last decade, many tools have been developed to obtain profiles of functional (metabolic) gene content and abundance from environmental samples.

These packages can broadly be defined into three categories:

  1. Predictive tools: these rely on inferring the function of environmental genomes using taxonomic marker genes (e.g. 16S rDNA) and the genomes of closely related organisms that are in a database.

  2. Gene-centric tools: these rely on predicting gene coding regions from "shotgun" metagenomes i.e. random fragments of microbial genomes from mixed communities, and then comparing those sequences to a database.

  3. Genome-binning tools: these utilize assembled shotgun-metagenomes and abundance information to extract "population genomes" from mixed communities.

In this tutorial we will discuss an example of each, before exploring a popular online "gene-centric" shotgun metagenomics pipeline, MG-RAST.

13.1. PICRUSt

Phylogenetic Investigation of Communities by Reconstruction of Unobserved States (Langille et al, 2013, Nature Biotechnology)

PICRUSt uses the Greengenes annotation of a representative set of 16SrDNA amplicons e.g. generated by QIIME (QIIME2 coming soon….). For some of the 16S rDNA genes in the Greengenes database, or closely related taxa, there is a precomputed database of the functional genes associated with the genome of each. PICRUSt uses this profile to predict the functions present in your 16S rDNA profile, and measures how phylogenetically distinct the ribosomal genes in your sample are from those used to predict the function.

screenA
Figure 4. (from Langille et al, 2013, Nature Biotechnology)
Do you think it is a good idea to predict function when we have not actually measured the functional genes?
This comes down to how closely related the phylogeny and function of microbes is – this is an active area of research however it appears it is lineage specific and depends on factors such as lateral gene transfer. Genome binning tools are helping to shed light on this question. One advantage of this approach is it adds value to your 16S rDNA sequences – we use it to determine which samples we should focus our shotgun sequencing on.

13.2. SUPER-FOCUS

It first computes the taxonomic composition of your metagenome using a related tool, FOCUS, then uses a pre-computed database based on the genomes found to be present in your sample for functional annotation.

This is called a REDUCED DATABASE
screenB
Figure 5. from Silva et al, 2016, Bioinformatics
  • It then uses one of three aligners (RAPsearch2, DIAMOND, BLAST) to compare your sequences to the reduced database and produce a metabolic "subsystem" profile of your metagenome which can be imported into the statistical software of your choice.

13.3. GroopM

GroopM relies on k-mer frequencies, and the co-occurrence of contigs from the same genome over a gradient where the same genomes vary in abundance, to extract "population genomes" from a mixed community. This is called "differential coverage binning" as it relies on different individual genomes to have differential assembly coverage.

Genomes can then be refined, and other tools, such as marker gene sets, used to asses each population genome for contamination and full coverage.

screenC
Figure 6. from Imelfort et al, 2014, PeerJ
screenD

This is not an exhaustive list. Other popular tools include MEGAN, MetaPhlAn / HUMAnN and the EBI Metagenomics pipeline to name just a few.

How do you choose? One option is to try different software and choose one that is suitable for your data, you find easy to use and is well maintained. Sczyrba et al (2017, Nature Methods) provides a good overview of the performance of various tools. Also see Aaron Darling’s talk on the CAMI competition. Keep in mind that new tools are being published all the time.

13.4. The MG-RAST Metagenomics Analysis Server

For the "hands on" tutorial, we are going to use the MG-RAST server:

The MG-RAST server is an "end to end" graphical cloud-based solution for analysing shotgun metagenomics data. This includes quality control, sequence annotation and statistics.

Following QC and data curation, MG-RAST predicts protein encoding genes in the dataset and annotates against several taxonomic and functional databases, collectively the M5nr database (Wilke et al, 2012, BMC Bioinformatics). The output can then be exported, or visualized with several statistical tools.

screenE
screenF

You can use MG-RAST without making an account – but If you create an account you can use private data, create sample collections and collaboratively share data. Today we are using publically available data so we don’t need to login.

screenH

Note however that with an account we can upload data and MG-RAST will allow us to modify parameters and conduct quality control and will then run the annotation pipeline for the samples and metadata. Today however we are going to explore data that has already gone through this pipeline. So, head to the MG RAST server.

First, let’s look at the search function of MG-RAST (indicated by the magnifying glass). This allows us to search all of the metagenomes in the database. It is an excellent resource for conducting meta-analysis and for downloading data. Other tools such as SearchSRA can also help us find free and public data to use in our work.

screenI

Note that there is both shotgun and amplicon data and that environmental metadata is included in the table to help us filter and search.

  • Pick a shotgun metagenome and click on the link – now we can see some information about this metagenome and what study it was a part of:

screenJ
  • By clicking on the actual sample then we can explore some information:

screenK
screenL

The QC summary tells us how many sequences were good quality, and how many had significant matches to the database. There is a lot of different pieces of information here, such as sequencing error models, K-mer distributions, Base-call frequencies (these should be roughly even) and E-value plots.

  • Look through these on your own and we will go through them together. Here we will focus on the biological information.

screenM
  • Here we can explore the metabolic profile of our metagenome, and see the relative abundance of high-level metabolic pathways. Additionally, we also see a taxonomic breakdown of our samples:

1

Next we are going to conduct an analysis of metagenomes from several soil habitats to explore some of the various tools in MG-RAST. These soil metagenomes come from a publication by Fierer et al (PNAS, 2012) which described global patterns in soil metagenome profiles. It is available here.

  • Navigate to the Analysis tab – the Icon for this is a bar chart. We can then search for the metagenomes of interest using their ID numbers and add them to the analysis using the arrow key. Note that we have selected "Subsystems" as the database and removed the default RefSeq database and that we have changed the search term to ID. (Metagenome IDs: mgm4477901.3, mgm4477902.3, mgm44779805.3, mgm4477872.3, mgm4477876.3, mgm4477877.3).

2
  • Click on the green arrow and this will begin to analyse the functional gene profiles of your samples and provide the frequency of matches to the database we specified. This step could take around ten minutes.

  • Using the analysis icons, explore some of the outputs which display the functional metabolic composition of the metagenomes, perhaps the most useful for an overview will be the donut plots, bubble plots and the stacked error bar plots.

Can you see any pathways that differ in abundance between the samples and habitats?
3
4

We can also conduct beta-diversity analysis plots, similar to yesterdays QIIME2 tutorial within MG-RAST.

  • Click on the PCoA analysis button_ to view how similar the metagenome functional profiles are between your samples. You can colour them by the biome type.

What samples group together? Is this what you would expect from metagenomes from different biomes?
5
  • Click on the table tab, this will produce a summary table of the relative frequency of different metabolic pathways (you can change the level of analyses using drop down menus) – using the export button you can download this table and use it in any software package you wish to conduct statistical analysis. One popular package is the STAMP software and described this in Parkes et al, 2014, Bioinformatics.

6
7
  • Click on the plugins button and open the KRONA plot. This is an interactive way to explore the contribution of different metabolic pathways in your metagenome.

9
8

Next we are going to look at a useful way of exploring the metabolic content of our metagenomes.

  • Navigate back to the analysis tab and search for the word "coorong" adding the samples M1 and M4 only.

Note, we have selected the KO

These metagenomes are from a hypersaline lake in Australia heavily influenced by agricultural irrigation – one is from a low salinity sample and one is from an extreme salinity sample. They are described in the following two publications (Jeffries et al, 2011, PLoS One; Jeffries et al, 2012, Biogeosciences).

10
  • Open the KEGG-Mapper plugin from the analysis menu. This tells us the relative number of matches to individual enzymes in metabolic pathways – click on Glycolysis. Most enzymes have matches in both metagenomes.

What does this tell us about the coverage of our metagenomes and importance of this pathway to cells in these samples?
11
  • Now click on both the Photosynthesis and Antennae pathways. This are both involved in cyanobacteria getting energy from sunlight.

What does the distribution of the matches of these enzymes tell us about the coverage of our metagenome, and of the ecology of this ecosystem?
12
13

That’s it! You can continue to use the search and analyse buttons to explore the data and functions in MG-RAST for a set of metagenomes you are interested in.

the minions well done

1. So rapidly evolving in fact that most bioinformatics manuscripts become obsolete by the time they are published.