• Decrease font size
  • Return font size to normal
  • Increase font size
U.S. Department of Health and Human Services

Animal & Veterinary

  • Print
  • Share
  • E-mail

Comparative Genomic Fingerprinting (CGF) as a Typing Tool by Matthew Gilmour, Ph.D.

DR. GILMOUR: Thanks Heather. Sorry I have a slight bit of a cold today, but I think with the microphone it should be no problem.


So I am down from Winnipeg, Canada at the National Microbiology Lab there which is part of the Public Health Agency of Canada. Also, beside our association with CIPARS, we are the home to PulseNet Canada. Within PulseNet Canada, we have nationally implemented subtyping programs for, you know, Listeria, Salmonella, E. coli, Shigella, and Vibrio, but wedon’t have one for Campylobacter

I believe that is probably why I was invited down here today because I think you guys have a similar appetite as well for a subtyping method, maybe as an alternative to PFGE or MLST for Campylobacter.

I should probably preface the talk by stating I am actually a big fan of MLST for the most part.  For Listeria and other organisms, it works very well. But for Campylobacter, it does have some caveats which I will go through. But the main point of the presentation is to present the method that was actually developed within the Public Health Agency of Canada, principally at the Laboratory for Food Zoonosis which Lucie is a part of as well.  

It was developed first off by Dr. Chris Ronn, then taken over by Dr. Eduardo Taboada where they have gone through a pretty extensive collection of Canadian Campylobacter strains and done a comparative analysis between their new method which they call comparative genomic fingerprinting versus PFGE versus MLST versus flaA SVR, etcetera.

So as I said, maybe there is an appetite for a new Campylobacter subtyping method. Maybe some of you are starving for a new method, maybe some of you just have a curious appetite, but nonetheless, I am just going to present today a new tool for your consideration and you can take from it what you will.


So the word from Chris and Ed, it was actually born out of high-throughput genomics and what they started with was microarray-based comparative genomics. So Ed and Chris had developed actually a whole genome-based chip to represent the Campylobacter chromosome and had gone through actually quite a large collection of Canadian Campylobacter strains to characterize their genetic content, i.e., find regions of genetic variability.

So what they have gone through is they parsed down this very robust dataset with, you know, thousands, hundreds of thousands of data points, examining each gene in each strain and parsed it down to the informative traits that are representative of both lineages and phylogeny, and then done these comparisons off to MLST that will show here.


But maybe you are not interested in a new Campylobacter subtyping method, but what you may be interested in is the platform itself, because as we are entering this age of high-throughput complete genome sequencing using the massively parallel DNA approaches, again you do need a method to take these very robust datasets, i.e., genomes -- like for Salmonella  you are getting 5,000 genes, for Listeria you are getting 3,000 genes -- you know we are not at the point yet where you can routinely sequence every isolate that comes through your door, so you still need some kind of platform to screen for the informative traits.

So again, taking it down from maybe 3,000 genes to 100 traits that you want to screen for on a more routine basis in a larger population of strains.

The image here is of the new technology that is 454 titanium sequencing where instead of doing one reaction at a time, you are actually doing about 200,000 reactions in parallel. So with 200,000 reactions, --- nanoreactions each getting about 500 to 600 base pairs within a single run on these machines, you are getting about 100 million DNA base pairs of data. So again, a very robust dataset.

And as the costs come down, they maybe suitable for routine use, although for now we still do need these platforms that we can implement in the lab on a routine basis to actually screen through a larger population of strains for subtyping, for phylogenetics, etcetera. 


So this is a little bit older data for the incidence of Campylobacter in Canada. Campy here is the upper green line. Then the incidence of Salmonella is the blue line, so about a half fold incidence. The other classical enterics Shigella, Yersinia, and E. coli O157 are down near the bottom, so a much lower incidence.

So this kind of gives us an indication that, you know, maybe Campylobacter is endemic within Canada, but it is kind of a false assumption because with these other organisms like Salmonella and E. coli we do have these routine subtyping methods so we can detect clusters. We can detect outbreaks. We can detect emerging clones.

But in the absence of a subtyping method that we can routinely implement for Campylobacter, we just get this straight green line. We just know we have a lot of Campylobacter, but we are not learning a lot about any individual event that is happening within our public health system.

So there we have it.  With this large number of Campylobacter, we need a high-throughput method to screen through and subtype these organisms. Then the problem becomes infinitely worse when you start -- and again this is only from the clinical perspective -- but taking from the veterinary health, food safety, etcetera perspectives, your sample size gets even larger. So again, your need for a high-throughput method is even greater.

So again, I don’t really mean to bash MLST here, but I do have a couple of slides to show again some of the caveats of it.


One would be its sampling density.  The standard Campy MLST assay is seven loci. You can actually see that four of them are clustered altogether in one region of the chromosome, two here and one down here. This actually leaves over half of the chromosome that is not even sampled at all. 

So if subtyping is a method to capture genetic content as a surrogate for genetic variability, in a way MLST is failing this test here.


The principle of this is that this is again variability plots done by Ed, again borne from the microarray analysis. Kind of the colored blocks here show that there are clusters of variability within the Campylobacter chromosome. These, again, are regions that would not be sampled by MLST.


MLST is actually intended to more capture slow evolutionary events, so just standard mutation that happens within cells and then it is vertically transmitted. So some of the examples I have here are, you would have an ancestor, nothing happens to it. It divides and you get a descendant that is exactly the same. So there is stable vertical transmission.

Here is a case where the ancestor, there is a SNP in one gene and the descendent that MLST would detect would be a single-locus variant. But as Collette even mentioned this morning, Campylobacter is highly recombinogenic and it also takes up exogenous DNA so you are going to get events where it actually can take in a whole new locus or a whole new allele at an individual locus.

So the issue of MLST here is that you actually get an overestimation of genetic diversity. Because if you get one allele changing out of seven, you seemingly have like 13 percent of your genome changing. So again, it is an overestimation of change. That one change broadly affects the clonal complex of this sequence type that your clonal complex lands in.


But definitely we like the idea of the multilocus approach.  We are not at the phase yet, again, where we can sample the entire chromosome for every strain, so we do -- you know, there are seven that you can do with MLST but I think you can definitely do more. 

But with that metric you cannot increase the workload in the lab. MLST is already kind of timely to complete in the lab, so we do not want to propose a new method that, yes you can sample more genes, but it is harder to do so.

So we can’t increase the workload, but we also cannot increase the cost. So the solution here is to target loci whose allelic status is just easier to define, so easier and cheaper.

So remember, to determine allelic status with an MLST, if it is a dedicated PCR, you amplify a region and then you do the sequencing of that region and then an analysis of the resulting data. That is relatively easy to do, but it does take time and it is not very cheap.

So another way to do this is to screen for just the presence or absence of genes, and that is where comparative genomics fingerprinting proposes itself to be a binary assay, so either, again, just a present or absent status. So here all you have to do is the PCR, either the gene is there or it is not which is actually pretty easy to do in the lab.


Again, here is another demonstration of that variability within C. jejuni. There are definitely pockets along the chromosome that are highly variable and if you target genes within these, you actually come up with a pretty good subtyping method.


This is just one of Ed’s complicated diagrams showing the selection of loci. So again, ultimately coming from this comparative genomic hybridizations using the microarray, showing the different lineages of all the strains that they have analyzed, and a nice little heat diagram.

So I guess he parsed it down to 236 markers at the start that would be representative of those Campylobacter lineages and he has gone through and just either picked, you know, 10, 35, or again another 35 here, and found ones that replicate the typology of the tree. I think between 35 and 100, he actually found you could -- you know, the total genetic content of the Campylobacter could be replicated with this screening method.


So the flip side to that is, again, going back to the idea of the genetic sampling density. Instead of just targeting pockets of the chromosome, you can actually sample regions along the entire length of the chromosome, again, as a surrogate marker for total genetic content of the strain that you are currently investigating.


So to develop the actual assay, it is quite simple. You are designing primers in SNP-free regions. So these are regions that you would expect to amplify without any problem. If the gene is there, you will get an amplicon and you will know it. You do it in a staggered fashion and you do multiplex assays. 

Ed’s current version of it is a 40-plex assay so it is eight multiplexes of five. So you have eight individual assays, each with five primer sets, and then the five amplicons of a staggered size.

It may sound kind of troublesome or complicated to do 40 PCRs per assay in the lab, but there are actually multiple venders out there right now that have automated PCR gel systems, so instead of just running a standard agarose gel within a 96-well plate, you can run it on this little robot and actually get these types of outputs quite easily.

So I think that would definitely be one of the principle bottlenecks of the assay if you were running it in an agarose gel system because you don’t want -- you know, for every isolate you would be running eight lanes. But in the 96-well plate format which these robotics companies are obviously well suited for, it is actually no problem at all. I am going to get into the data in a second here. 

But so Cliff Clark in our lab did MLST on about 500 Campylobacter isolates, so just the manual PCR sequencing and then analysis. That took him the course of years to do. Then he sent those same 500 isolates off to Ed’s lab in Lethbridge and it took two weeks to analyze those. 

We have the QIAxcel Automated Gel Platform there. It is actually not even that expensive. It is about 40,000 brand new, but Qiagen with all the discounts and stuff, it comes down to about 25,000. We have even found that the staff prefers running just even their normal agarose gels on this automated platform instead of going towards traditional, like pour gel, let it solidify, and load it yourself. 

From a laboratory perspective, it is actually pretty handy because you avoid all the issues of didium bromide. You know, disposal, etcetera.


So here is kind of what one of the outputs could look like. Again, this is entirely scalable. Ed has a 40-gene version now. But if you wanted to go up to, say 120 genes that is also possible. The key aspect here is that you can easily -- you know the average technician can easily fire through 96 samples per day at a pretty low cost per.


So the actual sample size that was compared in this study came from the Canadian C-EnterNet Sentinel Site Surveillance Program which is actually based in the Waterloo region which is kind of, if you drew a line between Detroit and Toronto, it is kind of in the middle there. Their goal is, again, integrated surveillance. So they are looking at human components, animal components, retail food components, and also environmental. So again trying to sample from one site all the different possible sources of gastrointestinal illness.

So in total in the study we have 493 Campylobacter isolates. They are all jejuni except for 70 that are actually C. coli. More than half of them are human clinicals. Then another third of them are chicken retail meat, so these are chicken breast samples. There is only one beef and three pork, but that is just representative, you know, the sampling was done in beef and pork, but you are obviously finding more Campylobacter in chicken than in the other two commodities.

The also do the on-farm surveillance, so they are finding they have 30 from bovine feces and 27 from swine manure. Then also from local water supplies from actually the river that goes through Waterloo, 16 water samples. 

The sampling was completed between 2005 and 2008, and then just to aid the epidemiologic and investigative components, there was actually one outbreak during the course of this study. It was at a summer camp.


This is just not to be read, but just to show kind of the scope of information that is has been captured on each of these isolates. This is just my screen shot of about five percent of the data.


So what Cliff and Ed did go through and do was get PFGE data, FlaA-SVR, FlaA-Peptide, Oxford porA, Oxford mop*, MLST, clonal complex, all the existing favorites out there for Campylobacter typing. They threw them into the mix and then on the right here this is the comparative genomic fingerprinting so you see the eight multiplexes of five. Then you get, you know, absence of the gene is red, a red zero. Presence of the gene is a green one.

So just a comment on the unbiased nature of this study, certainly with Ed being kind of one of the creators of the assay, I think he certainly had a bias that at the end of it CGF, comparative genomic fingerprinting would be a winner. But on our lab, we had not preference at all. We just wanted to implement a subtyping method that was quick, cheap, and actually did the job as intended. So that is where Cliff Clark came in, to be kind of that unbiased eye to make sure whatever kind of came out the winner truly was the winner. It was not just because it was our home brew assay.


Okay so Cliff and Ed have put in the paper for publication and it is literally about 32 tables. 20 to 25 of them are supplementary; seven are right in the manuscript itself. Unfortunately, they had no figures, so I tried to pull what I can from it. 

Largely what they did, because there was an absence of epidemiologic characteristics to investigate here to actually decide on that level which is the best typing, they have gone toward some more quantitative description of the methods just to find out which one can find diversity.

The easiest way to find diversity is usually the Simpson’s Index of Diversity which is just if you grab two isolates from a population, what is the probability that they have different types? So you can see here that for all the methods that they tested -- well for the most part it is over 90 percent, bordering on 100 percent. So that just speaks to the diversity of Campylobacter. Again, just randomly grab two isolates. The chances are they do not have the same subtype.

Just another point of interest on this slide is with the CGF. So CGF40 is their nomenclature for the 40-gene assay. Because it is a binary assay, you can easily change your thresholds of similarities so you can have 100-percent assay or 90-percent assay because obviously here CGF40 at 100 percent, it is closing in on 100 percent Simpson’s Index of Diversity. So you need to kind of actually scale down the thresholds a little bit.


Here is another way of almost looking at the same thing. It is just with all the different methods showing within the population of isolates. The number of types that you have, you know, the number of types with only one isolate. Again you are just seeing here the same measures of the Simpson’s Index of Diversity. Again, the chance of just randomly grabbing two isolates out of the pile, they are probably going to be different isolates -- sorry different types.

You can see this nice analysis of PFGE here. The number of types per -- sorry, the number of isolates per type is down to almost one, so almost a random distribution of PFGE versus the isolate you are grabbing.


This is another very colorful way that they have looked at it. It is congruence. It is called a Wallace Coefficient. Just a quick explanation of that is again if you are testing -- if you had method A and two isolates have the same type and then took those same two isolates and looked in method B, would they have the same type in method B? If the answer is yes, your congruence would be high. The two methods are congruent. So a congruence value or coefficient value of one would represent that. 

So easy examples of that are, you know, flaA-SVR versus flaA-peptide. Obviously they are congruent methods. Sequence typings versus clonal complex, they have a congruence value of 1.  But as a good sign here for comparing MLST versus CGF, CGF versus clonal complex is actually 0.773. 


I think probably just a better way to actually show this is to look at some of the data here. 

So on the far left of the screen is clonal complex as determined by MLST. The next column in is the flaA-SVR type, and then of course, we have our comparative genomics fingerprinting here. You can actually just see visually the same clustering that, you know, isolates that have the same clonal complex have generally the same CGF cluster type through the four different examples here. We have clonal complex 137, 222, etcetera down the line.

But the added value of the comparative genomic fingerprinting is that there is slightly more resolution so you can see. If you only did MLST, you would know these all to be clonal complex 137. But if you also do comparative genomic fingerprinting, you get the same clustering but you get a few other data points as well to help you actually identify individual events within your investigations.

So you can actually think back to that slide where I showed kind of those clusters of variability that were scattered across the Campylobacter chromosome. That is actually what is being captured by CGF at this point here.


Just out of interest, PFGE versus comparative genomic fingerprinting, there obviously were not a lot of PFGE clusters within the dataset. Here is just a few of them, but again you can see how CGF obviously clusters in a related fashion towards PFGE. But you get a little bit more resolution as well.


Probably the oversimplification here is that source attribution is difficult for any of the methods for Campylobacter. Unfortunately, most of the types map back to only one specific source. So if you add these up, you’re getting close to 80 or 90 percent of your isolates only mapped to one individual source.

So 56 percent or 193 of the 493 isolates only came to human for CGF. The same for sequence typing here, 46 percent of the isolates in the population were only present in human populations. So that problem actually gets worse as you go down the line.

You can see here, looking for isolates that are present in both human and chicken populations, only 4 percent by CGF match, only 9 percent by sequence typing match. So again, to do these studies of source attribution with any of these methods is actually extremely difficult.


We did have the one outbreak during the sampling period. Again at that summer camp, it was an outbreak of 22 cases, so we had 22 isolates. There is good data here in that all the epidemiologically linked isolates all have the same characteristics. They all have CGF pattern 8_1, sequence type 45, flaA SVR 21, and down the line.

Kind of the one point of interest that favors CGF in the data point is that those same types were obviously observed elsewhere in the population for the other method, so you can see clonal complex 45 being present in numerous other samples throughout the population that are unrelated to the event. But CGF, at least at the threshold that is being applied here, is relatively specific to the event.

So if you were routinely running CGF in the lab, you would see this cluster of the 8_1 and it would be an event that would mark off a threshold for further investigation for you. But if you were just running MLST, you would get the very common clonal complex of 45 and it probably wouldn’t trigger off to you that there was an event happening here.


So just again to be unbiased, one of the biggest knocks against CGF is that there is actually a very limited stock of information out there. This is not -- well one, it is not a published method. This is a paper that has just been submitted for publication. 

There isn’t a database out there like there is for MLST where you can upload sequence types and you can download other people’s and you can compare information. You can put your current types in context of other historical types as is necessary for subtyping. 

This can’t really be done for CGF because we have a very limited dataset. It is at least 500 isolates, but it is locked up in our Phac IT warehouse. It is not accessible to anyone else except beyond, you know, collaboration. So that is something for CGF we would need to work on before it could ever be implemented on a more widespread basis.


And then just to thank Dr. Eduardo Taboada who basically invented the method. But it was Cliff Clark who ran many of these analyses. Dr. Frank Poilari is the lead for the C-Enternet Surveillance Program, so I will also thank him. And also some of the technicians in the lab.  Thank you.


DR. HARBOTTLE: So I think we will have to ask our organizers here if we have time for questions. We are about nine minutes over schedule. It is about 11:09. We have a few minutes. Okay. So if anyone has questions for the speakers, please come to the microphone.