New position and new study field

It’s already been a month since I moved from King’s Buildings to the Roslin Institute in March 2017. It wasn’t a big move in terms of distance (Roslin is just outside Edinburgh city), but it meant quite a bit in terms of research field, since I changed HIV molecular epidemiology for bacterial genomics, particularly Staphylococcus. Although for some people this might not seem too different, it implies dealing with genomes 300 hundred times larger and a much more complex genetic organisation.

I will be working in Prof. Ross Fitzgerald’s group (the “Laboratory for Bacterial Evolution and Pathogenesis”) in a Wellcome Trust funded project that aims to investigate the molecular basis of S. aureus host-adaptation.

S. aureus is an important pathogen that affects humans, livestock and wildlife, and has undergone numerous host-switching events during its evolutionary history leading to the emergence of new pandemic clones.

In collaboration with researchers from around the world, we will collect and sequence the whole genomes of hundreds of S. aureus isolates, and apply genome-wide association and evolutionary genomic analysis to understand the genetic basis for this pathogen’s host-tropism and epidemic clone emergence. This fascinating project will involve more colleagues at the University of Edinburgh but also at the University of Glasgow, who will apply different approaches to the same topic.

Despite the difficulties of starting a new job and having to learn new concepts and techniques, everyone at the lab and in the Institute in general has been tremendously welcoming and I feel very well taken care of. Hopefully the future will bring many successes!


24th Conference on Retroviruses and Opportunistic infections (CROI) in Seattle, WA – February 2017

I was lucky enough, one more year, to attend the CROI conference, the biggest convention regarding HIV research. I presented the work “Analysis of Nearly Full-Genome HIV-1 Sequences from Uganda: Results from PANGEA_HIV” as a poster.

As you can see from the title, in this communication I shared the preliminary results from the analysis of nearly full-genome HIV sequences generated from samples taken in Uganda. We found a remarkable proportion of A1/D recombinant sequences, but low rates of drug resistance mutations across different genes and transmission between different populations.


Just arrived to Seattle: with my boss Andy Leigh Brown and my colleagues Manon Ragonnet and Emma Hodcroft

The samples studied corresponded to individuals from several cohorts studied by my colleagues of the MRC-UVRI in Entebbe, and are part of the PANGEA_HIV project, dedicated to increase the understanding of HIV transmission dynamics in Africa by producing and using HIV sequence data. The 685 samples used in this particular dataset corresponded to contemporary sequences (sampled between 2009 and 2014) from 3 cohorts: i) a rural population in Masaka district in the south-west of Uganda, ii) fishing communities (“fisherfolk”) who work in different sites around the shores of Lake Victoria, and iii) female sex workers from Kampala, the capital city. Additionally, we analysed historical samples which were taken as part of a serological surveillance study in Kampala hospitals in 1986 from patients with AIDS. They were analysed with Illumina MiSeq next-generation sequencing in the Wellcome Trust Sanger Institute and processed using a pipeline created for this purpose by colleagues at UCL, which produced consensus sequences longer than 1Kb for 609 (89%) of the samples (565 being contemporary and 44 historical).

Given the long-term co-circulation of subtypes A1 and D in Uganda a frequent recombination between them was to be expected, however the levels found here (52% in contemporary sequences and 80% in historical ones) were much higher than those previously reported. This is obviously related to the fact that we analysed full genomes, as opposed to the traditional analysis of partial pol sequences which limit our ability to detect recombination breakpoints. These results were obtained with the SCUEAL subtyping tool adapted to HIV full-genome sequence analysis.

To test the level of HIV transmission between different populations, we looked for transmission clusters, i.e. groups of closely related sequences in phylogenetic trees, among contemporary sequences. We found 54 of them (44 sequence pairs, 10 triplets), which involved 21% of the sample. Most clusters involved individuals from the same population only (mainly corresponding to fisherfolk), although not always sampled in the same population. This could reveal a compartmentalised epidemic in which different populations don’t frequently interact despite being mobile populations. Alternatively, this might be due to the fact that we need a deeper sampling strategy to reveal undetected clusters.

We tested for the presence of drug resistance mutations in different genes: protease, reverse transcriptase, integrase and gp120. In contemporary sequences the level of resistance was very low, as expected in low-income settings. However the analysis of gp120 sequences revealed a high level of usage of the X4 co-receptor in historical sequences. Usage of X4 (as opposed to R5) confers resistance to entry inhibitors – but this is common considering that these samples come from 1980s patients with chronic infection who were suffering from AIDS.

We believe these results help understanding how HIV is transmitted in different populations of Uganda. However, the use of full genome sequences provides new methodological challenges that will have to be sort out. Fortunately, PANGEA_HIV is generating more samples that will help us on this matter, and will shed new light into this topic. So more updates to come!

You can take a look at the poster here.

Article published in Scientific Reports

Finally! Over the past Christmas break, our new paper Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic was published in the journal Scientific Reports (open access!).

In this study we compared how employing different HIV genes (with different length) and different sampling coverage levels affects the reconstruction of the correct HIV phylogeny using simulated sequence data.

This is an important question since more and more full-genome sequence data is becoming available but we don’t have enough experience on its application to the reconstruction of HIV phylogenies –the vast majority of studies so far have used partial pol sequences. Are trees reconstructed using full genomes the most accurate ones? Do other gene(s) provide good approximations?

However, to answer this we need to know what the real phylogeny is, and the best way to do so in a large scale is using simulated data: we used a simulated HIV epidemic (developed by Emma Hodcroft and Samantha Lycett) resembling an “African Village” scenario, in which all sexual contacts were recorded. Selecting the contacts that gave rise to transmissions produced the true transmission tree. Along this tree, associated HIV sequence data was simulated applying realistic, different evolutionary rates to different genes.

We created different combinations of gene datasets (full genome, gag-pol, gag, full pol, partial pol, and env) and sampling coverage (full coverage [100%], 60%, 20% and 5%). For each combination, 100 replicates were created, and for each of them we built a maximum likelihood tree which was compared to the true tree.

We found that the accuracy of the trees was significantly proportional to the length of the sequences used, with the full genome datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets.

Thanks to the increasingly affordable generation of full HIV genomes, we will be able to analyse longer genetic regions that, according to our according to our results, will improve the reliability of phylogenetic reconstruction. The short pol sequences generated for resistance testing that are traditionally used in most molecular epidemiology studies are substantially less reliable, especially with low sampling depths.

Contagion, an ‘infectious’ public engagement event at the Science Museum, London (26/10/16)

I had the incredible opportunity of joining my colleagues at the Farr Institute (with whom I collaborate as part of the ICONIC project) in “Contagion”, an evening of science outreach in the quirky and fun Science Museum, London.


One of our banners #datasaveslives

Contagion was a public engagement event held last October 26 that focused on revealing to the public different aspects about important infectious diseases (HIV, Ebola, Zika, Polio, Malaria…) and the approaches that researchers take to study them. Contagion, generously funded by the Bill & Melinda Gates Foundation, was part of the Lates programme, which according the Science Museum consists on “adults-only, after-hours theme nights that take place in the Museum on the last Wednesday of every month. Each entry in this hugely popular ongoing series of events centres on a different theme: from sex to climate change, from big data to childhood.”


Very curious people (yes, there was booze!)

The slogan for the Farr Institute stand was “Tracking viruses in space and time”, and we did our best to describe what we can learn from the analysis of genetic material from viruses, especially using phylogenetics, and how that information can help us to improve global health. Viral epidemics are becoming more and more global, and information about how viruses transmit and spread around the world is key if we want to implement measures to tackle this expansion.

Lots of people with quite different backgrounds were very interested in our work, which provided an evening of fascinating debate and learning. It was great fun!

Visit to the MRC/UVRI in Entebbe, Uganda

I had the pleasure of staying at the MRC-funded Uganda Virus Research Institute (UVRI) for the first two weeks of October. The MRC/UVRI is an 80-year-old institution located in Entebbe, Uganda, that conducts public health related research. The motivation of the visit stems from the long-term collaboration between the Leigh Brown group at the University of Edinburgh and the Research Unit on AIDS at the UVRI.


Clinical Diagnostic Labs, MRC/UVRI (from

My visit was made possible by a MUIIplus travel grant for visiting scientists. The MUII (Makerere University/UVRI Infection and Immunity) programme works with regional research centres and leading international Universities to ensure collaborative training activities including short courses, research attachments and research fellowships.

During my visit, I assisted on the implementation and installation of bioinformatics resources at the UVRI, as well as on the instruction of UVRI students and staff in those methods. We were all most interested on making available sophisticated methods for phylogenetic and phylodynamic analysis of HIV sequences, particularly RAxML and BEAST. These analyses are currently being applied to the UVRI database of HIV pol sequences associated with epidemiological data –which includes samples from different Ugandan populations.

This technology will allow UVRI staff to gain independence and experience on analysing HIV phylodynamics, and will provide resources for future analyses. Further capacity building will be gained in the next few months through the collaboration with UMIC, which will increase the computing capability of the UVRI. I also had the opportunity of giving a seminar going through my research career, and explaining basic concepts on HIV molecular epidemiology.

Virus Genomics & Evolution 2016

1st Virus Genomics & Evolution at the Wellcome Genome Campus, Hinxton, Cambridge, UK – June 2016


One of the buildings at the Wellcome Genome Campus conference centre (phone pic)

I attended the first edition of this interesting Conference covering multidisciplinary approaches of the application of virus genome sequencing to the study of epidemiology, pathogenesis and public health implications of viruses. It was a very successful mixed of well-renowned experts in different fields and young students (ironically I’m neither).

I had the opportunity of presenting again the preliminary results of the ICONIC project a a poster, highlighting the tremendous HIV strain variability that we found in our London samples —somehow  more typical of sub-Saharan African settings than of a European capital— which included many complex recombinants.

You can take a look at the poster here.



HIV Dynamics & Evolution 2016

23rd HIV Dynamics & Evolution Workshop in Woods Hole, MA, USA – April 2016

I presented the work “Improvement in Phylogeny Reconstruction through Use of Simulated Genome Data” as a poster.


Woods Hole harbour (phone pic)

In this work, we aimed to evaluate the effect of utilising different HIV viral genes and different sampling depths when reconstructing simulated phylogenies.

HIV molecular epidemiology studies generally use partial pol sequences because of their availability and the fact that its analysis was demonstrated to be sufficient to reconstruct HIV transmissions. But is pol the best gene to reconstruct HIV phylogenies? The increasing accessibility of whole genome sequencing allows using other genes or even full genomes, which raises some questions: are full genome trees the most accurate ones? Do other gene(s) provide good approximations?

To answer, we need to have a known, recombination-free phylogeny, and simulated data can provide it: the PANGEA_HIV phylodynamic methods comparison exercise simulated an HIV epidemic resembling an “African Village” scenario, in which all sexual contacts were recorded. Selecting those which gave rise to transmissions produced the transmission tree (‘true tree’).

Associated viral sequence data was also simulated along the true phylogenies. Different substitution rates applied to different genes (with a rate twice as high for env as for gag and pol) and different codon positions (1st + 2nd vs 3rd).

With a sequence dataset that initially included 4,662 sequences, we applied different sub-sampling levels which consisted on different genes (gagpolenv, gagpol, gag, pol, env and partial pol (PR+RT)) and sampling depths (100%, 60%, 20% and 5%; with 10 replicates for the last two levels). We constructed maximum-likelihood trees for each combination using RAxML (GTR+Γ), and compared their topologies to the corresponding true tree’s using the CompareTree metric. This similarity metric provides the proportion (from 0 to 1) of identical splits in the two trees to be compared.

We found that the accuracy of tree reconstruction increased in almost direct proportion to the length of the sequences used. Thus, the genome datasets showed the best performance (average metric=0.96 [range=0.93-0.98]). They were followed by gagpol (0.95 [0.91-0.98]), env (0.93 [0.89-0.95]) and pol (0.93 [0.92-0.96]), in that order. Finally, gag (0.88 [0.84-0.90]) and partial pol (0.87 [0.87-0.88]) showed the worst performances.

In the subsampled datasets, the 60% sampling level showed very similar results to the fully sampled dataset. The 20% sampling level had considerable overlap in performance among larger fragments, but smaller regions had substantially poorer results. And the 5% sampling level showed good results in some repeats, but we found a high variability among the replicates, which questions its accuracy.

In conclusion, using longer sequences derived from whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

The poster can be downloaded here.