Bacteriophages: Genes and Genomes
Transcript of Part 3: Mycobacteriophage genomics
00:00:01.00 Hello. My name is Graham Hatfull. 00:00:03.08 I'm a professor at the University of Pittsburgh 00:00:05.23 and a Howard Hughes Medical Institute professor. 00:00:08.21 Today we are talking about bacteriophages, their genes and their genomes, 00:00:12.23 and in part three we are going to focus in on a comparative analysis 00:00:17.00 of a particular type of bacteriophages. These are the mycobacteriophages, phages that infect mycobacterial hosts. 00:00:26.09 And so I should explain why we would want to choose phages of a particular host. 00:00:37.00 And indeed, why we would want to focus on this particular group. 00:00:40.27 So, perhaps one of the most important aspects is that phages 00:00:46.10 that infect very different bacteria tend to be very unrelated to each other. 00:00:51.04 And therefore there is not much to be learned about the detailed mechanisms 00:00:56.02 of phage evolution by comparing them. 00:00:59.13 They are so different there is little to be learned. 00:01:02.23 On the other hand if we were to focus on the phages that infect a common bacterial host, 00:01:08.01 then we would argue that they must all be in some way 00:01:12.17 in genetic, at least potentially, in genetic communication with each other. 00:01:18.22 And then comes the question as to well which bacteria host should we use 00:01:22.13 in order to isolate and characterize these viruses? 00:01:27.06 And there's many of course bacteria to choose from. 00:01:32.05 If we had to think of them as ones that would be the most useful, the most interesting, 00:01:36.20 we might want to think about focusing on some bacterial pathogens. 00:01:40.17 Or alternatively bacteria that are important for other criteria. 00:01:46.15 Environmentally important, or other key aspects of their biology. 00:01:53.17 So we focused on the mycobacteriophages. 00:01:58.07 And in part because we think that the mycobacterial hosts are of sufficient importance 00:02:06.00 that they really warrant taking advantage of the viral systems that we could develop. 00:02:13.10 Not just for understanding the viruses, but for understanding the hosts that they infect. 00:02:18.17 And so I'll mention two bacterial species within this genus. 00:02:26.22 One is Mycobacterium tuberculosis, which is the causative agent of human TB. 00:02:34.19 And I'll mention a relative of Mycobacterium tuberculosis, which is called Mycobacterium smegmatis, 00:02:41.10 and this is important because it is a very helpful surrogate for us to use in the lab. 00:02:47.07 Mycobacterium tuberculosis we can grow in the lab, 00:02:51.18 but we have to be very cautious and careful with it for two reasons. 00:02:55.25 Primarily because it is a rather nasty bacterial pathogen, 00:03:01.22 and we certainly don't want any of us working in the lab to be infected with that organism. 00:03:08.19 But is has another feature that somewhat complicates its growth and manipulation in the lab. 00:03:13.22 And that is that it grows extremely slowly. It has a doubling time of about 24 hours. 00:03:18.19 So it takes a day to go from one cell to two cells with Mycobacterium tuberculosis. 00:03:24.28 That makes research pretty slow going on M. tb., 00:03:31.23 but you also have to be very careful about sterility and your aseptic technique 00:03:36.18 because almost everything out there grows faster than Mycobacterium tuberculosis, 00:03:41.24 and if you are not careful, you will end up growing that rather than M. tb. 00:03:46.09 Mycobacterium smegmatis, in contrast, is a non-pathogen. 00:03:51.20 It does not cause disease in healthy adult human beings, 00:03:56.20 and it grows relatively quickly. It has a doubling time of about three hours, 00:04:01.29 which means that we can grow a lawn, a smooth lawn, of Mycobacterium smegmatis 00:04:05.29 on Petri dishes in about 24 hours, and we can grown individual colonies in three to four days. 00:04:14.02 Mycobacterium tuberculosis is actually a very serious and important human pathogen. 00:04:21.13 About two million people a year die from Mycobacterium tuberculosis infections, from TB. 00:04:31.02 And it is estimated that Mycobacterium tuberculosis kills more people 00:04:35.27 in the world than any other single, infectious agent. 00:04:39.05 Many people that are infected with the organism actually don't get disease 00:04:45.09 because the bacterium establishes a latent infection and doesn't cause health problems. 00:04:53.20 Although, it can do either with old age, 00:04:58.15 or with a compromise of your immune system, such as for example with HIV infection. 00:05:04.29 These are not only a very prevalent... it's a very prevalent disease is tuberculosis, 00:05:10.24 but there is a growing and widespread concern 00:05:13.22 about drug resistance strains of Mycobacterium tuberculosis, 00:05:17.27 that are either difficult to treat or effectively untreatable. 00:05:23.18 There's clearly a need for new strategies for diagnosis, prevention, and cure of tuberculosis. 00:05:32.02 And so we think these are good reasons to focus on the phages 00:05:36.20 that infect these organisms in the hope that they could contribute towards that specific cause. 00:05:43.25 And so this is an important point. The mycobacteriophages can really lead us in two directions. 00:05:51.10 They can tell us about viral diversity and the evolution of bacteriophages, 00:05:56.09 and at the same time they can provide tools for controlling TB 00:06:01.01 and in fact can provide elements that we need to manipulate TB to understand it and to work with it. 00:06:09.05 I am not going to focus here too much on the specific applications of the mycobacteriophages. 00:06:16.27 I thought that I would just mention one in passing, 00:06:19.28 which is the use of mycobacteriophages as a novel type of diagnostic system 00:06:24.28 in order to test whether a person is infected with TB 00:06:30.21 and indeed whether it is a drug resistant or a drug sensitive strain. 00:06:35.19 This is a strategy which was first described by my colleagues Bill Jacobs and Barry Bloom. 00:06:41.07 The idea is to make so called reporter mycobacteriophages, 00:06:45.26 recombinant phages that carry a gene that can report 00:06:50.27 and tell us about the metabolism of the mycobacterial cell. 00:06:57.04 So you can construct reporter phages that carry a gene 00:07:01.11 such as firefly luciferase that will make the bacteria emit light. 00:07:06.13 Or we can make reporter phages that carry green fluorescent protein from jellyfish 00:07:13.06 that when that is introduced by infection of the host, it makes the cell fluoresce. 00:07:18.12 And we can use these properties, fluorescence or light emission, 00:07:23.10 in order to then monitor what type of bacteria a particular patient is infected with. 00:07:30.18 And so this is an idea that I think shows considerable promise 00:07:33.05 and is currently undergoing further research and development. 00:07:37.20 If we want to compare the genomes of mycobacteriophages 00:07:48.13 in order to understand how they are related to each other, 00:07:52.06 how they've evolved, what their diversity is, well, 00:07:55.23 we need to have the mycobacteriophages in order to characterize. 00:08:00.08 And so we have gone out over the past few years 00:08:03.21 to isolate new mycobacteriophages and to genomically characterize them. 00:08:10.04 And whilst this has been a major focus in my laboratory, 00:08:14.17 this has also proven a very successful approach for both high school students 00:08:22.23 and undergraduate students to become involved in research endeavors 00:08:28.28 by going out and isolating new mycobacteriophages and sequencing them. 00:08:33.03 And now with the Howard Hughes Medical Institute science education alliance, 00:08:38.27 there are hundreds of students who are contributing to this cause, 00:08:42.16 and because of this we now have many new mycobacteriophages to characterize and to compare. 00:08:51.02 The process is relatively simple. 00:08:53.11 We start with a sample of soil or compost or wherever you might think to go 00:09:01.20 and look and to find out if there are some bacteriophages present, 00:09:04.18 The sample is mixed up with some liquid. The particulate matter is removed. 00:09:12.02 And we simply incubate some of that in the presence of our permissive bacterial host, 00:09:18.12 which is Mycobacterium smegmatis. 00:09:20.07 We lay those out on a Petri dish, as shown here, and we look for plaques, 00:09:26.02 for areas where a phage that was present in our original sample 00:09:32.09 has now infected these cells to form a plaque. 00:09:34.25 We can then pick an individual plaque, purify it, remove all of the other contaminants, 00:09:43.15 and we can propagate it in the laboratory until we have a high titer or a concentrated stock. 00:09:49.12 From that we can make DNA. 00:09:51.08 The DNA can be sequenced to give us tens of thousands of nucleotide sequence information, 00:10:00.20 and then we use computational approaches and bioinformatics 00:10:04.20 to predict where all the genes are in these genomes, and then we can compare them. 00:10:11.01 So we are using Mycobacterium smegmatis as our host, 00:10:16.24 fast growing, non-pathogen, and our samples predominantly come from soil and compost. 00:10:23.06 We have usually just simply plated out the sample with our permissive host, 00:10:28.24 but because the specific phages that we're after can be present at relatively low concentrations, 00:10:37.05 there is an approach that can be used with enrichment, where you simply take your soil or your compost sample, 00:10:43.00 you mix it and incubate it with some permissive host cells, 00:10:47.21 in this case Mycobacterium smegmatis, that allows even the small number of particles 00:10:53.23 that may be present to infect, to reproduce themselves, 00:10:57.27 and so that when it comes to the plating and the identification of 00:11:03.06 plaques they're present at higher concentrations. 00:11:05.20 There's a couple of different approaches, 00:11:08.03 but this is a relatively reproducible and simple process for discovering new phages. 00:11:16.02 So by this point thousands of mycobacteriophages have been isolated 00:11:19.06 using Mycobacterium smegmatis as a host. 00:11:22.16 I should state I think that some of these infect smegmatis, 00:11:27.14 but don't infect Mycobacterium tuberculosis, whereas others do. 00:11:32.19 And so we use a surrogate strain, Mycobacterium smegmatis, as a host, 00:11:37.23 but it is likely that the host range, the cell preferences of the phages that we isolate 00:11:43.28 are going to be all over the place and at this stage are not well defined. 00:11:47.13 We've... the most recent publication that describes the characterization of these 00:11:56.07 appeared earlier this year in 2010, and described a comparative analysis of 60 of these. 00:12:04.14 But because of the impact of the science education alliance program 00:12:10.21 as well as the ongoing studies of Pittsburgh, 00:12:12.17 the number of new phages and sequenced genomes, it is positively exploding. 00:12:18.27 And at this point in the middle of October in 2010, 00:12:23.28 154 completed genome sequences and much analysis awaiting to be done. 00:12:33.11 All of these phages, it turns out, even though they don't have to be, 00:12:38.20 are double stranded DNA tailed phages. 00:12:42.14 We haven't isolated any RNA phages or any single stranded DNA phages. 00:12:48.02 They are all double stranded DNA, tailed phages. 00:12:51.12 Now in part one of this lecture we saw that perhaps the most common order of bacteriophages 00:12:59.06 are the Caudovirales, the double stranded DNA, tailed viruses. 00:13:02.21 Just like these that I showed you. I also told you that there's three common types. 00:13:09.06 The so-called Siphoviridae with the long flexible tails, the Myoviridae with the contractile tails, 00:13:14.24 and the Podoviridae that have short stubby tails. 00:13:17.21 If we just compare the morphotypes of these 60 genomes, 00:13:23.13 which have been analyzed and published, 00:13:27.09 53 of them are of this Siphovirus type, 7 of them are of the Myovirus type. 00:13:33.08 We have no Podoviruses at all. 00:13:37.21 And so these numbers appear to hold true for the larger collection 00:13:42.05 of mycobacteriophages, and therefore we have growing confidence in the idea 00:13:47.15 that there really are no Podoviruses amongst the mycobacteriophages. 00:13:53.00 We don't know whether this is because phages with the short stubby tails 00:13:58.20 are physically incapable of infecting bacteria like the mycobacteria 00:14:04.25 that have thick and chemically complex cell walls, 00:14:08.06 or whether it's just a reflection of a restriction 00:14:12.12 of evolutionary opportunities to generate those types of phages. 00:14:19.04 So that's a little bit of a mystery as we don't have any Podoviruses, 00:14:24.18 but we have lots of examples of these other two morphotypes. 00:14:28.28 When we look at the genomes there are some basic parameters 00:14:34.21 that we can see that are helpful in thinking about what these genomes are like. 00:14:38.11 First of all, the average length of all them is 72,588 base pairs. 00:14:47.01 We don't really understand why mycobacteriophages would have that particular length. 00:14:51.07 Phages of other bacterial hosts often have very different average lengths 00:14:58.02 including those that are only half as long as the average mycobacteriophage genome. 00:15:03.09 And so we don't really know what determines this parameter, 00:15:08.02 either for the mycobacteriophages or indeed for any other phages. 00:15:12.28 There's also a large range in size from a little under 42,000 base pairs 00:15:20.23 up to about 164 and a half thousand base pairs. 00:15:25.06 So there is a lot of diversity in terms of size range. 00:15:29.25 The GC content on average for all of these 60 genomes is about 63 and a half percent. 00:15:37.00 A number which closely mirrors the GC content of the bacterial host Mycobacterium smegmatis. 00:15:44.08 And that's not a surprise because it has been seen from the analysis of phages of other bacterial hosts 00:15:52.27 that the GC content of the phages often mirrors that of the hosts. 00:15:57.09 What's perhaps more surprising, however, is that the range of GC content 00:16:03.10 amongst these phages is actually really amazingly broad 00:16:07.09 spanning from 56.3% at the lower end up to 69% at the upper end. 00:16:14.28 And we've been trying to think for some time as to what this span of GC content reflects. 00:16:23.16 One attractive idea although it remains to be fully tested is that these particular mycobacteriophages 00:16:33.13 whilst they have a common host in Mycobacterium smegmatis 00:16:37.12 may not necessarily have been infecting Mycobacterium smegmatis 00:16:43.17 as their preferred bacterial host in the environment from which we recovered the phages 00:16:49.13 in their recent ecological and evolutionary times. 00:16:54.11 In other words, they may have preferences for infecting some other bacterial host 00:17:00.20 that we have yet to figure out what that is. 00:17:03.24 But that might account for the range of GC content that we would see. 00:17:09.06 And so one of the things that we would like to do to test this idea 00:17:11.28 is to actually determine the specific host range 00:17:15.18 on a whole range of bacteria that are related to the Mycobacteria 00:17:21.24 to see if we can discern a pattern or a correlation between GC content and the host preferences. 00:17:27.17 And finally if we look at the number of genes that are present, 00:17:31.21 of these 60 genomes there is a total of 6858 open reading frames or putative protein coding genes, ORFs, 00:17:40.19 about a hundred and fourteen ORFs on average per genome. 00:17:45.27 And interestingly the average ORF size, the average size of an open reading frame, is only 616 base pairs. 00:17:55.14 That's about two thirds of the average size of a bacterial gene. 00:18:02.22 And this appears to be a parameter which is true not just for the mycobacteriophages, 00:18:08.08 but for other bacteriophages that people have looked at. 00:18:11.01 And we've been interested as to why this number 00:18:13.13 should be quite so different from that of the bacterial host. 00:18:17.14 It fits, however, I think, with the idea that illegitimate recombination 00:18:23.10 is playing a key role in how these genomes evolve. 00:18:29.06 And in fact we can see that many of the segments of DNA that appear to have come in 00:18:34.26 relatively recently from other genomes tend to be on the small side. 00:18:39.07 And therefore we can think of this process of evolution, as we talked about in part two, 00:18:46.29 may actually contribute to driving the average gene size down. 00:18:52.20 So we can take our 60 genomes, and we can ask the question: 00:18:58.25 "how are they related to each other at the nucleotide sequence level?" 00:19:03.27 And we can use an approach that we saw in part two, 00:19:10.17 which is where we can compare the nucleotide sequences in a dot plot analysis. 00:19:16.20 And one way of doing this is illustrated here. 00:19:21.29 Now what we've done is to take our 60 genomes, 00:19:24.29 and we've simply joined them together end to end to make a long concatamer, 00:19:30.26 and we've done that in random order. 00:19:33.01 We've just taken our sixty sequences joined them together 00:19:36.14 to get a long span and then simply compared them with each other. 00:19:40.02 Not surprisingly there is a diagonal line from the top left to the bottom right 00:19:45.22 because that simply tells us that every phage genome is identical to itself. 00:19:51.22 That is a good thing. 00:19:53.00 And then there's a number of diagonal lines you can see 00:19:57.05 where a particular phage in this part of the array 00:20:01.24 is similar to a second phage that is sitting in a different part of the array. 00:20:09.08 And because the genomes are in a random order in this concatamer, 00:20:13.10 these various types of relationships are scattered over this dot plot. 00:20:21.06 And we can see though, I think, that we have phage genomes that are similar to each other, 00:20:27.16 but there must be many that are completely dissimilar to each other at the nucleotide sequence level. 00:20:33.06 So having done this and identified, generally speaking, who is most closely related to who else, 00:20:40.13 what we can do is we can take each of the genomes 00:20:44.03 and we can change the order in which we've arrayed them in this concatamer, 00:20:50.10 and then repeat this computational comparison. 00:20:54.09 So when we do that, this is what the plot looks like. 00:20:57.14 And so all we've done is simply to group the genomes together that are similar to each other. 00:21:04.26 So for example if you look in the top right hand corner all of those genomes that are similar to each other are positioned 00:21:10.10 next to them in the top left hand part of the plot. 00:21:13.03 We can take this gross nucleotide sequence similarity 00:21:18.02 to put the genomes together into what we refer to as clusters. 00:21:24.04 Such as Cluster A, Cluster B, C, D, E, etc. 00:21:27.10 And so those clusters go up to cluster I, 00:21:30.13 and on the right hand side where it says Sin, 00:21:36.05 this corresponds to what we refer to as singleton genomes. 00:21:41.02 And out of these 60 genomes, there are 5 that are singletons, 00:21:45.11 which means that each of those has no close relatives 00:21:49.10 either here or anywhere through the biological world. 00:21:56.00 There is some important texture to this grouping and these clusterings, 00:22:01.25 and we can readily identify some clusters as being, having more than one closely related type. 00:22:11.14 And we therefore subdivide the cluster into sub-clusters. 00:22:16.09 You can see here for the cluster C that there are many of these genomes, 00:22:21.13 in fact almost all of them are very similar to each other, 00:22:25.08 and constitute sub-cluster C1, and then there is a single genome over here 00:22:31.10 which is related to the other C cluster genomes, but less so, so that constitutes sub-cluster C2. 00:22:41.10 So we have a large number of different types of genomes, 00:22:43.27 more than twenty substantially different types of genomes, 00:22:47.02 just within this group of 60 that we are looking at. 00:22:51.21 And so each of these genomes, and you can see them identified by name here, 00:22:57.05 as we zoom in on the different clusters and sub clusters. 00:23:00.18 Here we are looking at clusters A through to E. 00:23:03.15 Sub cluster C as I indicated can be divided into sub-cluster C1 00:23:10.25 with Bxz1, Cali, Catera, Rizal, ScottMcG, and Spud. 00:23:16.26 And then Myrna is the sole member of cluster C2. 00:23:20.22 And these are the remaining clusters, F, G, H, and I. 00:23:27.24 And then here are the singletons over on the right hand side here: 00:23:31.14 Corndog, Giles, TM4, Wildcat, and Omega. 00:23:37.00 And so we can take each of these genomes that we've assorted with each other 00:23:47.17 according to their nucleotide sequence similarity, 00:23:50.18 or if they are singletons, they're one of a type. 00:23:53.14 We can generate the genome maps, 00:23:55.26 and we can see what features they have and what they look like. 00:23:58.18 This is showing Giles, which I introduced previously in part 2 of the lecture, 00:24:03.29 and you can see its densely packed genes with the rightwards transcribed genes above the DNA, 00:24:12.22 and the leftwards transcribed genes below the DNA. 00:24:15.20 It is densely packed and we've color coordinated these genes according to their relatives. 00:24:22.03 And so we now have these genome maps for all of these phage genomes 00:24:28.11 and these maps then can be compared, 00:24:31.01 and in fact the genes and the predicted proteins can be compared as well. 00:24:35.23 So we look at these 60 mycobacteriophages, 00:24:39.20 and we see that the genes are tightly packed with few non-coding regions. 00:24:42.29 There's many, many genes, but there appears to be few operons. 00:24:49.11 Meaning that we think that there may be a hundred genes, but there may be only 2, 3, or 4 sites 00:24:56.09 for transcription initiation or promoters that are used to express these genes. 00:25:02.19 We actually know very little about the patterns of gene expression of any of these phages, 00:25:07.05 but the bioinformatic predictions are that there will be blocks of genes that are transcribed together. 00:25:15.16 The virion genes, those are the genes that encode the structural components, the heads and the tails, 00:25:24.23 those genes typically tend to be grouped together in the genome, 00:25:28.19 and they have a common order or synteny 00:25:32.09 which is conserved even though the genomic sequences may be extremely different to each other. 00:25:40.23 Especially once we examine the parts of the genomes outside of these virion genes, 00:25:46.14 we find vast numbers of genes, many of them relatively small, 00:25:51.03 which have a completely unknown function. 00:25:54.25 And we have failed to predict what they can do simply from comparing them with other genomes. 00:26:01.22 And so what we've done is to create a computer program. 00:26:08.06 This was a program call Phamerator, and it was written by a colleague of mine, Dr. Steve Cresawn, 00:26:13.27 which can then begin to analyze all of the genes and how they are related to each other 00:26:19.05 by comparing them at the amino acid sequence level. 00:26:22.24 This is really important because so far I have shown you how 00:26:26.29 we can compare genomes at the nucleotide sequence level, 00:26:30.05 I also showed you that we have lots of examples because we have many different types of genomes. 00:26:36.08 that appear to not share nucleotide sequence similarity 00:26:41.10 even though they are in genetic communication with each other, at least in principle, 00:26:46.10 because of the use of the common host. 00:26:49.05 Just because they don't have nucleotide sequence similarity 00:26:53.00 doesn't mean that they are completely unrelated. 00:26:56.00 And in fact, once we start to look at the gene relationships 00:27:01.16 by comparing the amino acid sequences 00:27:04.22 we can begin to see the patterns that reflect the common origins of the phages, 00:27:10.15 even though they no longer share nucleotide sequence similarity. 00:27:14.06 And so this program that Steve Cresawn wrote 00:27:19.07 called Phamerator facilitates this in a very important process. 00:27:25.07 What it does is it takes each of these open reading frames out of 60 genomes, 00:27:30.06 we have these 6,854 genes. 00:27:33.14 It takes each of the predicted proteins 00:27:37.10 and compares them with everything else 00:27:39.23 using alignment programs such as BLASTp and Clustal. 00:27:45.23 Genes which are related to each other because 00:27:49.21 they meet a particular threshold of similarity we group together. 00:27:55.13 And we put them in groups, and those groups are called phamilies or phams. 00:27:59.20 And of these 6,858 genes we have a total of 1,523 distinct phamilies or sequences. 00:28:12.04 A large proportion of those are what we refer to as "orphams". 00:28:18.13 They are phamilies but they only contain a single member. 00:28:21.25 Not because we believe that other members don't exist 00:28:26.14 but because this population of phages appears to be very diverse 00:28:31.12 and presumably quite large, 00:28:33.10 and we simply haven't yet identified the relatives 00:28:36.29 of these orphams that constitute these phamilies. 00:28:40.03 And so this is about 45% of all of our phamilies only have a single member. 00:28:45.16 This Phamerator program is extremely helpful for generating the maps 00:28:51.25 and displaying the relationships that help us understand 00:28:54.19 the mosaic components by which these are put together. 00:28:59.10 And so here I am showing segments of four genomes, 00:29:01.22 that you can see, just parts that are aligned 00:29:07.03 showing the boxes here and the numbers above the boxes such as here at the top in the middle, 1406, 00:29:16.13 refers to a particular phamily. That's a phamily number for which that gene is a member. 00:29:22.12 And then in this display we can color coordinate the degree 00:29:27.14 of sequence similarity at the nucleotide level between the various genomes. 00:29:32.26 And this is actually reflecting a part that I showed you... a part of these genomes 00:29:36.22 that we talked about in part two. 00:29:39.18 Now we can do this type of representation with large numbers of these genomes. 00:29:47.05 When we look at particular clusters, any particular cluster 00:29:51.03 can have genomes that are very similar to each other, 00:29:55.28 or they can be actually quite diverse, depending on the particular cluster 00:30:00.25 that you look at and the degree of sequence similarity. 00:30:04.03 I am just illustrating this with the clustered G phages, 00:30:09.06 for which in our expanded set we actually have 4 members now, 00:30:13.29 and the color coordination, the purple between these four genomes illustrates how very closely related they are. 00:30:21.25 And when we compare the colors of the genes at the protein levels, 00:30:25.21 you can see that these are also very similar. 00:30:30.13 This method is very powerful in part because it is a very easy way 00:30:35.18 of seeing rather smaller differences 00:30:37.27 that nonetheless have played a key role in how these genomes have evolved. 00:30:42.20 For example, down in the right hand end you can see these convolutions 00:30:46.22 here of segments that have been lost from one genome or gained by another. 00:30:51.13 And in fact this illustrates the finding of a new mobile genetic element, 00:30:57.10 a new ultra small transposon that appears to play a role in these particular... 00:31:02.08 in the evolution of these particular genomes. 00:31:05.09 So in part two we saw a lot about how mosaicism is the key architectural feature 00:31:14.16 of bacteriophage genomes. Because now we are looking at this group of mycobacteriophages 00:31:22.12 infecting a common host, 00:31:24.15 we have lots of examples where even though there is no nucleotide sequence similarity, 00:31:29.23 we can see that the genes are shared through common amino acid sequence similarity. 00:31:37.15 And therefore we can look at patterns that are contributing in generating the process of genome mosaicism 00:31:46.09 even in the absence of substantial sequence similarity. 00:31:51.23 And what we find is a massive amount of mosaicism 00:31:56.11 where the modules that contribute to the structure of the genome often correspond to simply to single genes. 00:32:08.14 So modules correspond to single genes when we conduct this type of analysis. 00:32:15.03 And we've developed a particular tool for representing this, representing the phylogenies if you like, 00:32:24.08 where we can take individual Phams- here is one Pham 233 and here is another Pham, Pham 471- 00:32:31.27 and in these representations, we've simply drawn as points around the circle 00:32:38.24 all of the genomes that we have available to us, 00:32:41.16 and for that particular sequence family we've drawn an arc between those genomes 00:32:48.28 that have a member of that Pham. 00:32:52.29 And therefore it essentially represents or reflects the phylogeny 00:32:57.26 or the evolutionary history of this particular family of sequences. 00:33:02.09 In the top part of the figure I've just shown a small segment of phage Omega from genes 125 to 128. 00:33:11.27 Gene 126 in Omega has a relative that we can see through amino acid sequence similarity 00:33:19.23 to a gene in this genome called Cjw1. 00:33:24.23 Gene 127 in Omega has a relative in a genome called KBG. 00:33:32.11 In that case, gene 84. But, and this is important, the context, the flanking sequences in each case is different. 00:33:46.15 Ok, the sequences to the left of Omega 126, which corresponds to Omega gene 125, 00:33:53.06 are completely unrelated to Cjw1 gene 72, 00:33:58.17 which is at the left part of that gene in Cjw1. 00:34:03.10 And the same goes for the KBG comparison. 00:34:08.26 So in this case we can see that we don't have any nucleotide sequence similarity between these, 00:34:13.15 but we can dissect these evolutionary relationships that show 00:34:21.09 that these two adjacent genes in this example in Omega 00:34:24.16 have clear and distinct evolutionary histories. They have different phylogenies. 00:34:30.02 And this is one example, but we have clearly thousands of examples of Phams 00:34:36.00 which share and exhibit these types of relationships. 00:34:40.26 And this has considerable importance when you start to think 00:34:46.02 about questions of phylogeny of whole phage genomes. 00:34:50.23 Why not just take whole phage genomes and construct a phylogenetic tree 00:34:54.16 so you can see how they are all related to each other? 00:34:58.16 The problem is that all of the bits of the genomes because they are mosaic, 00:35:03.10 built from modules and pieces, and all those bits and those pieces have distinct evolutionary histories. 00:35:10.04 They have different phylogenies. There is arguably no single, clear, evident 00:35:16.17 phylogeny for a genome as a whole. 00:35:18.27 The genome represents an individual phage, 00:35:23.04 and its evolutionary history is reflected in a multiplicity of events 00:35:28.17 that have put those pieces together in that particular combination, in that order, in that particular virus. 00:35:36.02 I am just showing some other genome maps here of 00:35:43.19 some of these genomes illustrating again that for some of these genes 00:35:52.05 that are encoding the structural proteins, we know what they do. 00:35:55.20 But most of these other abundance of genes with large numbers of genes, 00:36:00.02 we really have absolutely no idea what they do. 00:36:03.01 And we would certainly like to know what their functions are and indeed what their structures are. 00:36:08.00 And so now if we expand our analysis to include the unpublished information, 00:36:13.12 and this was done for a 153 genomes that are completely sequenced, 00:36:19.09 over 17,000 open reading frames, almost 3000 Phamilies of distinct and different protein sequences. 00:36:26.12 The number of Orphams has come down slightly. 00:36:30.26 It is about 41% as we have started to find some of the relatives of genes that were previously Orphams. 00:36:38.22 And amazingly, if we take these almost 3000 phamilies, 00:36:43.11 and we compare them against the sequence databases, 00:36:46.13 we find that about 80% of them are novel genes. They are novel sequences. 00:36:52.21 There are no relatives of either other phages or anything else that has been sequenced in the database. 00:36:59.18 Even of the 20% of Phams that do match, so you know there is a related protein out there in the databases, 00:37:08.14 about half of those are for genes for which people don't know what they do anyway. 00:37:14.09 So database searching is an interesting exercise with these bacteriophage genes. 00:37:23.17 It provides rather little information as to what the functions of the genes are. 00:37:27.06 It is obviously very helpful when they do, 00:37:29.24 but the amazing thing is we just don't know what most of these genes do, and we would like to. 00:37:34.27 In this particular system we've made some headway 00:37:40.05 in developing a tool that can now help us address this question. 00:37:46.03 It is called BRED, or bacteriophage recombineering of electroporated DNA, 00:37:50.26 and it provides a simple, reproducible technique 00:37:56.19 for constructing mutants in mycobacteriophage genomes, either deletions, insertions, point mutations. 00:38:04.26 This method is published, and I won't go through its details here, 00:38:09.02 but it really just requires a simple electroporation step, 00:38:13.24 an ability to put phage DNA and a synthetic substrate together 00:38:18.04 inside a cell, and those techniques are well established 00:38:20.23 for doing so, followed by nothing more complicated than simply doing 00:38:27.27 a polymerase chain reaction or PCR screens 00:38:30.27 amongst a dozen or so of the progeny plaques that are recovered 00:38:37.04 in order to find those that have the mutation that you need. 00:38:40.11 And this is all accomplished through the establishment of a so-called mycobacterial recombineering system 00:38:47.01 that enables this to happen 00:38:49.12 at much higher frequency than you would normally see it. 00:38:52.20 And so this is very powerful because we can use that type of approach 00:38:57.05 to now go and ask what those genes do. 00:39:00.05 And indeed we can use it to try to develop applications for some of what we've found and what we are learning 00:39:07.23 that might be useful for the genetics of mycobacteria or specifically control of tuberculosis. 00:39:13.17 And I will give one brief example of that which is a couple of genes called Lysin A and Lysin B. 00:39:21.04 In this case I am again showing Giles as an example. 00:39:27.06 In part one we talked about an important step that happens at the conclusion of lytic growth, 00:39:34.01 and that is that in order for the phage particles that have been generated by infection to get out 00:39:41.06 then the cell wall needs to be compromised. It needs to be broken open. The cell needs to be lysed. 00:39:49.03 And the phage encodes the enzymes that enable that to happen. 00:39:52.09 We know very little about the process in mycobacteriophages, 00:39:58.16 but we were surprised in looking at the genomes that there are two candidate genes that are involved, 00:40:04.14 lysin A and lysin B, and we were able to use this engineering technique 00:40:12.14 to construct, to find mutations where we've removed either one of these genes 00:40:17.19 and examined what the behaviors of the phages were. 00:40:20.26 That way would enable us to figure out exactly what roles these genes are playing in lysis. 00:40:27.20 And I won't show you all the detailed experiments that gave us the conclusions as to what these do, 00:40:35.29 except I think the results are very clear. 00:40:39.19 And that is that in this portrayal of what the mycobacterial cell wall looks like, 00:40:44.13 where you have an inner membrane. 00:40:45.20 You have the peptidylglycan of the cell wall. 00:40:49.14 There is a sugar layer called arabinogalactan, 00:40:53.20 and covalently attached to this is the so called mycobacterial outer membrane, 00:40:59.14 which is composed of an interesting type of lipids called the mycolic acids. 00:41:04.23 And this is found in the mycobacteria and it is found in tuberculosis, 00:41:08.11 but most bacteria don't have this type of outer wall structure. 00:41:16.15 So not surprisingly we find that genomically the Lysin B enzyme 00:41:24.18 is found predominantly only encoded by mycobacteriophages. 00:41:28.28 And that was part of what clued us in to Lysin B playing a role in perhaps degrading this cell wall structure. 00:41:37.00 What we now know is that the Lysin A is the enzyme that degrades the peptidylglycan. 00:41:43.11 And Lysin B is this novel enzyme that actually cleaves the mycobacterial outer membrane 00:41:50.27 from this arabinogalactan layer and therefore facilitates complete lysis 00:41:56.28 of the cell during the process of release of the progeny viruses at the conclusion of the lytic cycle. 00:42:06.16 And we are obviously very interested in these enzymes 00:42:09.05 because they are enzymes that degrade the cell walls of mycobacteria. 00:42:14.27 And therefore we like the idea that these enzymes could perhaps 00:42:19.19 play potentially useful roles either in the lab to try to break open and to destroy mycobacteria. 00:42:26.29 And perhaps even in a clinical setting perhaps to either help to inactivate mycobacteria 00:42:34.05 or perhaps to act synergistically with antibiotics 00:42:37.14 to make them work better and quicker in killing 00:42:40.11 Mycobacterium tuberculosis in an infected patient. 00:42:44.21 So I gave you just one example there of how we can begin to identify what these genes do 00:42:51.21 and how some of them may be useful. 00:42:53.07 We have seen that mycobacteriophages are highly diverse. 00:42:56.14 They have these architecturally mosaic genomes, 00:42:59.17 and we can dissect this mosaicism not just by looking at the nucleotide sequence similarities, 00:43:05.10 but by comparing amino acid sequence similarities, a feature that is really greatly enhancing 00:43:12.26 and aided by the fact that we have now this large number of phages 00:43:19.01 and phage genomes that infect a common host. 00:43:22.06 And I think that that raises the idea that there is probably a lot to be learned 00:43:27.12 from generating similar collections of bacteriophages that infect other bacterial hosts. 00:43:34.26 And the larger these collections grow, the greater the insights 00:43:39.04 and the resolution of the information that we can gain 00:43:42.12 from how similar they are, how related to each other, and the specific mechanisms by which they have evolved. 00:43:48.14 80% of the genes are of unknown function, and we and others have our work cut out 00:43:56.01 to try to find out what these are, what they do, 00:43:59.17 what they look like structurally, and why they are there. 00:44:04.09 We are beginning to learn about how they got to be there in these genomes. 00:44:08.12 Now we need to know what they do. 00:44:10.04 I've told you that the techniques have now been established 00:44:14.02 that we can begin to readily manipulate these genomes. 00:44:17.00 Tools that one again could imagine applying to other bacterial hosts 00:44:22.14 and other viruses in order to address these questions. 00:44:26.10 And I think that we have now a very powerful tool box 00:44:30.14 in this large set of phages, in this large number of genomes, 00:44:36.07 that can be used to understand what makes 00:44:41.06 Mycobacterium tuberculosis, a major human pathogen, tick. 00:44:46.26 And how we can exploit and use those genes and those genomes 00:44:51.21 for contributing towards the diagnosis, the prevention, and cure of human TB. 00:45:00.10 I would like to finish by acknowledging those who have helped to support this research, 00:45:07.19 both the National Institutes of Health and the Howard Hughes Medical Institute. 00:45:11.23 All the work that I have talked about was performed by a truly stunning set of colleagues, 00:45:18.28 and I've listed many of their names there. 00:45:25.01 As I mentioned throughout that the genomic work has in part been done 00:45:30.26 by a large number of undergraduate students and high school students, 00:45:35.23 both in Pittsburgh and beyond. I don't have you all listed here, 00:45:39.13 but the contributions I think are really massive, 00:45:42.20 and I acknowledge that contribution, and thank you for that. 00:45:46.24 And so thank you for your attention to this iBioSeminars lecture.