kscbioinformatics

Please re-direct to my new website/blog

I have re-organized digitally and migrated most of my content to my new website: https://lorenlaunen.wordpress.com/

Thanks and sorry for any inconvenience!

Loren Launen

Sarah Sanders great post on testing for bacterial motility!

via Bacterial Motility using Bacillus subtilis

ASM Microbe Day 3

Hard to even describe how exciting today’s Microbe was. Going backwards through the day – it ended with Kate Rubins, NASA astronaut (yeah – as in, spent months on the space station) and virologist, giving a keynote that was organized as a conversation with science writer extraordinaire Ed Yong. She was/is inspiring. Listening to Kate talk about her work on the space station, which was focused on figuring out how to do science in space, meant hearing phrases like “When we go to Mars we ….”. Not IF, when. I haven’t been so hopeful, and so inspired, for many years. We can solve big problems and do great things.

And there was a lot of other great stuff! I really enjoyed listening to the talk given by Harmit Malik who was the recipient of the Eli Lily Elanco award and focuses on the molecular arms race between host and viral pathogen. From this I learned that focusing on conserved regions of proteins is not necessarily helpful in understanding evolutionary change in systems where competition is driving the bus. Better to focus on regions that change quickly because that’s where genetic innovation shows up. Time to read up on the Red Queen hypothesis! I also learned that a gene essential to making the placenta in placental mammals got into mammalian genomes from a retrovirus. Bizarre and cool (the env gene, in case you are interested).

There was a great session on invasive group A strep that helped me understand the global importance of these infections and also reminded me of the famous tale of Ignaz Semmelweis. I would like to writ a book on this topic from an historical perspective for lay folk. Yup this is maybe a commitment on that. Maybe. Andrew Steer spoke in this session and also kindly answered many of my questions.

And last but most important I learned more about Vibrio vulnificus, the creature I now research. Specifically we need to look (in our isolate genomes) for the rtxA gene which encodes the MARTX protein, an important virulence factor and a means of biotyping. I had a great conversation with Douglas Teague, doctoral student at the U of S Alabama, who is working on this protein and was really helpful. I also learned a lot yesterday and today from posters out of Charles Lovell’s group (U of S Carolina) on vibrio.

Tomorrow is the day I present my own poster (which makes me nervous), and there is lots of great science to see and learn! More on day four tomorrow!

ASM Microbe 2017 day 2 comments

Another great day at Microbe! This morning I attended a session on microbes and sex (lateral gene transfer). Big takeaway – how do we make phylogenetic trees with organisms that have so much of their genome derived from lateral gene transfer? For example, Camilla Nesbo discussed the Thermotoga (specifically, members that are mesophilic from oil wells) – a group that has at least 10 percent of it’s genome derived from lateral gene transfer in from archaea (Theotoga is a bacterial taxon). Tal Dagan spoke on transduction. This was fascinating. While I teach about transduction, transformation and conjugation as the three LGT mechanisms, there are more (i.e. Cytoplasmic bridges, namotubules and outer membrane vesicles). This was news to me. I also learned that ALL protein families have been transferred laterally at least once in evolution! And of course, this is especially relevant in the prokaryotes. I learned we could use the tool PHAST to map prophages in genomes (prokaryotic ones). And about the fact that transduction is an important way in which gene duplication events occur (coined autologs). And… genome similarity strongly predicts the likelihood that two donor cell types / species / strains / whatever can share the same phage. As a last point – LGT by transduction has allowed some bacteria to acquire whole new metabolic abilities like acquisition of Photosystems! Wow! In the same session, James McInerney spoke on many ideas including evolution of eukaryotic cells and how to analyze data without making phylogenetic trees (which oversimplify). This was all great and amongst other tidbits I learned that in yeast the archaeally derived genes (which are less of the genome than the bacterially derived genes) – are more important to the cell by 100x fold. Even when not of clearly

important function – whaaaat??? So what is a yeast cell? Eukaryotic? Bacterial? Or archaeal?

At the poster session I got a lot of great V. vulnificus info. More later.

MEGA

MEGA stands for Molecular Evolutionary Genetics Analysis, a **fairly easy program to use that creates phylogenetic trees from bacterial isolate sequence fasta files. I say fairly easy because it took me a few tries to get back into my basic proficiency using the software. It’s free for both Mac and PC.

I think the biggest issue with MEGA that I’ve run across is not the alignments, pairwise distance analysis or phylogenetic tree software suites, nor all the other options one can do to spruce up the data or make it more visually appealing – it’s figuring out what format the sequence files need to be in from step to step. So, I’ve written out a flow chart of sorts that has helped me…

Read more on my lab e-notebook

Short read sequence typing (srst2) for Vv isolates

The next step in analyzing our Vibrio vulnificus isolates is DNA analysis, specifically by short read sequence typing (srst2). Srst2 allows the facile analysis of specific loci within whole genome data for molecular typing and evolutionary analysis by phlogenetic trees. It is sensitive enough to detect genes and alleles at >5x coverage (Molecular Typing by srst2 Analysis Poster, Sanders 2016).

Basically how this works is that you have your sample sequence (ex: Sample_7274) in a fastq.gz format in a folder. These are large files containing the sample’s whole genome sequence. Then, I added a 37 gene custom database, a mix of housekeeping and virulence genes that the program will seek to find, match and extract from the whole genome sequence…

Read more on my lab e-notebook

Vibrio vulnificus under Fluorescence

Despite a rather bumpy beginning with finding viable broth cultures to plate (that was yesterday’s adventure), plates grew pretty well overnight. Interestingly enough, the plate with the ideal CFU count was 1:1,000,000 at 1 mL plated instead of the previous dilution (1:10,000 at 0.1 mL plated). I was able to count 83 distinct colony forming units which fits between 30-300 ideal CFU.

Amount Plated	Dilution Factor	CFU	CFU/mL

0.1 mL	1:1,000,000	13	1.30 x 10⁸

1.0 mL	1:1,000,000	83	8.3 x 10⁷

0.1 mL	1:10,000	4 large clumps together

1.0 mL	1:10,000	too many to count

With that in hand, I smeared a tiny amount of cells onto a slide, put a drop of DAPI on top, then covered with a coverslip. DAPI binds strongly to A-T regions of DNA and can stain both live and heat fixed cells (Wikipedia). We don’t heat fix Vv because of potential aerosols…

Read more on my lab e-notebook

The Pursuit of Vv Single Colony Forming Units (CFU/mL) by Serial Dilutions

It’s been a goal of mine to capture a good *clear image of the bacteria I’m working with – Vibrio vulnificus. The tricky part is not the serial dilution, which I’ll expand upon in a moment, but getting a workable (and small) colony forming unit that can be stained with DAPI and visualized on the fluorescent microscope. The good news is that I think I’ve tackled the CFU part successfully & hope to start on fluorescence next week…

Read more on my lab e-notebook

Illumina Sequencing (for Dummies) -An overview on how our samples are sequenced.

For the past year (or so), I have been really struggling to understand the rudiments of how Illumina sequencing works, especially with the concept of “paired ends”. I needed a simple, clear explanation of the “for Dummies” variety (I love those books!). I have struggled with the variety of sources out there that describe Illumina sequencing, none of which seem to be exactly how the samples in my lab are sequenced. So here’s my version of an explanation. I think I just may finally be cracking this one conceptually…(and I made a set of video lectures about this too, audio isn’t great so consider them beta but “up”).

First, our samples are sequenced (at the Hubbard Genome Center of UNH) on an Illumina HiSeq 2500 machine. We have 250 bp Paired End (PE) reads done, and our libraries are made by size selecting for fragments not smaller than 350 bp, and not larger than 550 bp. We have about 60 samples in a batch and usually have these sequenced on one full lane. More on that and how it affects coverage in another post.

So what happens once UNH receives our high quality DNA (for us that means PCR amplifiable, looks good on a gel, lots of DNA, verified (we think) to be pure)?

The first thing that happens is library preparation.

This happens as follows (I think). The DNA is fragmented into 350 – 550 bp fragments (I am not sure if this is by sonication, or another method, that’s a question I’m going to ask Steve Simpson). Then those fragments are run through a series of PCR reactions that serve to both amplify the DNA (make more), and place adapaters on them. Adapters are sequences that belong on the ends. I’ve drawn a picture below (it’s crude, but hey, it’s mine!) of what those sequences are:

Essentially a library is a sample that contains all of your DNA of interest (in multiple copies), with these adapters (all the coloured sequence) on the ends. Now what?

Next, you need to denature the library fragments (break this double-stranded fragment into single strands), and attach the fragments to the flow cell. Flow cells in Illumina look like glass microscope slides. They have 8 channels in them, which are called lanes. Each lane can hold (in our work) up to as many as 96 samples = 96 bacterial genomes. More on how many we use per lane above and in another post as as this affects coverage. The flow cell (also called chip) has pre-attached short oligos (short sequences of DNA) poking up. I’m going to call them grafts, and they are the P7 and the P5 grafts. In our images P7 is black and P5 is red. Note that in the figure above, there are ends that are labelled as P5/P7 graft binding sites. That’s because those areas are complementary to the P7 and P5 grafts. So… base pairing will occur!

When you flow the single-stranded library across the flow cell your fragments will base pair with the grafts as shown in the figure below, and you’ll now have your library fragments attached to the flow cell.

Like this:

fragmentsonflowcell — Fragments bound to flow cell

Now the goal is to make more (a cluster). First, you synthesize the complement, as shown in this image, where dashed blue indicates new synthesis (using regular nucleotides by the way, and DNA polymerase, of course). Once you’ve done that (middle image below), you do something that seems a bit odd (but everything odd in this process is related to having molecules in the right orientation for DNA polymerase to work by the way) – you wash off all of the original template strands. You are left with two molecules attached, which contain each “side” of your original piece of DNA. One we’ll call the P5 (red bottom), and one the P7 (black bottom). Note they are NOT in antiparallel orientation, but both have their 3′ ends poking up.

Continue on with bridge amplification. Remember that all around these molecules are additional P5 and P7 grafts, protruding from the flow cell surface. And that the 3′ end of the P5 and P7 strands have regions that can bind those graft sequences. So they do, making little bridges (see image below). Once you have that bridge, you can re-synthesize the complementary strands. Then you can alter the chemical conditions to let the bridges release, linearizing the strands. Repeat. About 35 x.

This leaves you with clusters that contain both the P5 (labelled above), and the P7 (not labelled, but has black base) strands. In multiple copies.

Then you do something that isn’t super intuitive (note warning above). You wash off all of the P5 strands, leaving you with clusters that are truly clonal meaning identical. Like this image below:

At this point on the flow cell, you have individual clonal clusters scattered in little dots across the entire surface of the flow cell, each like a little island of identical trees (note figure below, each light spot shows the location of a clonal cluster on a flow cell). You are now ready to sequence.

Before we continue with this a little about synthesis in the Illumina system. Illumina has a patented method called “sequencing by synthesis”. It relies on the use of reversible dye terminators. See image below:

This is a chemically modified nucleotide. Instead of having an hydroxyl group at the 3′ carbon on the sugar, it has an “R” group that is the reversible blocking group. This blocks addition of another nucleotide in the growing strand until you chemically regenerate an hydroxyl group. Thus you can control the timing of adding a new nucleotide through phosphodiester bond formation. Also note that the base (which would be an A,C,T or G of course) is attached to a dye. The dye is called a fluorophore, meaning it is a molecule that emits a particular colour when struck with a laser light. There are four colours, each for the possible nucleotide present (each has it’s own kind of fluorophore attached). The cleavable linker is there so that you can wash off the fluorophore after the light is recorded/detected.

In sequencing by synthesis you add a nucleotide to the growing strand, hit it with a laser, record what colour comes out as a nucleotide, then wash off the fluorophore and regenerate an hydroxyl group on the 3′ carbon. Repeat. Record the flashes which = a DNA sequence.

The order in which sequencing by synthesis works with paired end methods (which are used on our samples) is as follows, picking up from the earlier figure showing a clonal cluster ready for sequencing, but focusing in on only one molecule/strand at a time.

Then,

This step takes advantage of the ability to bridge, as was used for the initial bridge amplification steps that generated the clonal clusters. What’s really cool is that the I5 sequence primer is built into the P5 graft itself, so you can immediately sequence the I5 after bridging occurs. Once that is done there is an additional step, needed to get template molecules in the right orientation to do the last step of sequencing, which is obtaining the Read 2 sequence.

That’s shown below,

Once you are back to having a clonal cluster that is all P5 strands – sequence Read 2,

When you are done sequencing Read 1, I7, I8 and Read 2, you are ready to demultiplex your data and analyze!

Why do you need Read 1 and Read 2? What is their relationship to the idea of “Paired Ends”?

In our work each read is 250 bp long. Read 1 is called the “forward” read, or the R1, and Read 2 is the “reverse” or R2 read (R1 and R2 are used in the file names – see post on that). The Illumina system knows that a Read 1 and Read 2 belong to the same piece of DNA because they will be physically “read” off the same spot on the chip. For example, in the coloured figure above (taken from the EBI-EMBL website which has a nice course explaining next generation sequencing), there is a series of images taken moments apart of the same area on a chip. The little numbers show sequence collected over time. The read 1 sequence would happen first, then the two indices (I7 and I5), and then the Read 2 last. But all the light would flash off the same spot showing all those sequenced belonged to one sample. The index sequences would “tell” the system what sample those sequences belonged to. The knowledge that Read 1 and Read 2 come from the same piece of DNA helps in assembly (more on that elsewhere).

We will write about multiplexing and demultiplexing in another post, but essentially multiplexing is the idea that you can use index sequences to label each sample’s DNA, then you can mix all the samples you have together (within reason), pour them across a flow cell etc. and sequence. Once you collect the sequences the process of demultiplexing involves using the index sequences associated with all of the Reads, and sorting them into “bins” that belong to each original sample. In our case those original samples are individual bacterial isolates in pure culture.

That’s all for now! I will tidy this post up a bit more, but am putting it live in hopes that it will be helpful to others.

Loren

NOTE – this blog has been discontinued as I have migrated to a new website/blog at https://lorenlaunen.wordpress.com/

Online bioinformatics resources

There is an almost dazzling amount of sequence data out there, with more arriving each day! Below are two graphs showing the amount of sequence data in GenBank, as number of nucleotides (bases) on the left, and total sequences on the right. Blue is simple numbers and red is those derived from whole genome sequences.

(image taken from the NCBI here).

The accumulation of data is staggering – and those numbers are only going to keep increasing. We are in a world of BIG data. As a person who loves learning how to do new things using my computer, I’m pretty excited about that. I’m also really excited about helping others discover that they can work with BIG data too – hence the introduction of bioinformatics (with a genomics focus), into my Genetics class, my research program, and the course Introduction to Genomic Bioinformatics at Keene State!

A big part of learning to work with sequence data is learning to navigate the database resources that exist to hold sequences, and to allow us to better analyze sequence data. Below are some of the major ones. Going forward, blog posts by me, my colleague Brian, and my students will be expanding on these resources. I’m putting this content on my blog today because one of my students asked me earlier “what does understanding KEGG have to do with this (Genomic Bioinformatics) class?”. Great question! And I’m not so sure that I was able to convince him that KEGG is a wonderful tool for allowing us to see how a particular gene sequence can be linked to the larger biochemical reality that is a phenotype (and really, an organism). Let’s hope I can develop better ways to conceptualize and explain this going forward as we work together, and that we can capture some of that in this and linked blog posts.

Resource List:

Data repositories (and SO much more)

The National Center for Biotechnology Information (NCBI): Based in the U.S., a data repository, host and developer of online and downloadable analysis tools, the host of GenBank (the U.S.’s National Institutes of Health annotated database) and more.

The European Molecular Biology Lab (EMBL)- European Bioinformatics Institute (EBI): Based in the E.U., a data repository as well as host and developer of online and downloadable analysis tools, and more.

The National Institute of Genetics (NIB), based in Japan, a data repository and more (as above!).

Kyoto Encyclopedia of Genes and Genomes (KEGG), based in Japan, this is an incredible database resource that essentially links genetic data (ie. sequences) ultimately with phenotype – by placing genes in the context of metabolic pathways. One of the most interesting features of KEGG is their metabolic pathway hyperlinked pathways.

UniProt, a resource formed by combining the resources of PIR (Protein Information Resource, US based), the Swiss Institute of Bioinformatics (SIB) and the EBI (above). Like KEGG UniProt focuses on protein data. A component of UniProt is UniProtKB/Swiss-Prot which includes highly curated data (meaning – double -checked, and in this case, manually so it is very trustworthy).