Genomic variants file
HaploCoV uses a collection of genomic variants with high frequency (at a specific time or place) to augment a “target nomenclature” and identify candidate novel variants or lineages. By genomic variant, here we mean a variant in the genome, or better w.r.t the reference genome assembly of SARS-CoV-2. Although the specification of the adjective “genomic” might sound verbose, we prefer to use it throughout the manual to avoid confusion between “viral variants” or variants of the virus.
The utility computeAF.pl included in HaploCoV can be used to analyse a file in HaploCoV format and identify high frequency genomic variants. By default computeAF.pl will flag all the genomic variants that displayed a frequency of 1%, for more than 15 non consecutive days during the pandemic (i.e. or derived from the input data).
The output of computeAF.pl are genomic variants files.These files have a very streamlined format which is briefly illustrated below. Each genomic variant is reported according to the following format:
where position= genomic coordinate on the reference genome, ref= reference sequence on the genome and alt= alternative sequence on the genome.
A genomic variants file consists of 2 columns separated by tabulations. The first column reports a genomic variant, the second the list of places (country, macro-areas, etc) where the genomic variant shows a prevalence above the threshold. Genomic variants are reported in no specific order. An example of a genomic variants file looks like:
Several lists/collections of pre-computed genomic variants files are already available from the main github repository of HaploCoV. These files enable users to execute their analyses with sets of genomic variants, defined according to different criteria and which are suitable for different use cases.
Pre computed genomic variants files
The files global_list.txt, area_list.txt or country_list.txt form the main repository can be used to provide lists of genomic variants that showed a high frequency:
at global level: global_list.txt;
in at least a macro geographic area: area_list.txt (see here);
in at least a country: country_list.txt.
Each is updated/regenerated to incorporate new data on a bi-weekly basis (every Wednesday). If you do not want to compute high frequency alleles yourself, you can download the files directly from github. On a unix system this can be done by using the wget command.
For example:
1. global_list.txt wget https://raw.githubusercontent.com/matteo14c/HaploCoV/master/global_list.txt
2. area_list.txt wget https://raw.githubusercontent.com/matteo14c/HaploCoV/master/country_list.txt
3. countries_list.txt wget https://raw.githubusercontent.com/matteo14c/HaploCoV/master/global_list.txt
HaploCoV does also feature additional sets of genomic variants files, which might be suitable for different use cases. These files are found under the folder “alleleVariantSet” and include:
Designations files in HaploCoV
In HaploCoV viral lineages/variants are defined by considering the complete collection of genomic variants that are observed in at least 50%+1 of the genomes assigned by a designation. The main repository in github includes linDefMut, a file that provides the complete list of genomic variants that define lineages of SARS-CoV-2 according to the Pango nomenclature. In HaploCoV we refer to this type of file as: designations files. In a designation file each lineage/designation is reported in a single line, followed by the complete list of its defining genomic variants. Genomic variants are indicated according to the convention described above. To provide and example:
indicates that lineage A is defined by 2 genomic variants: 8782_C|T and 28144_T|C respectively.
Novel/custom definitions of lineages and/or groups can be specified simply by adding a definition line in the linDefMut file, or equivalent.
For example, if HaploCoV identifies a novel variant/lineage for you, and you want to track/assign/analyse that variant/lineage, all you have to do is add its “definition” line to linDefMut.
HaploCoV.pl (see –varfile option) can report/write designation files, which can be easily concatenated with linDefMut. For example if you have your additional interesting designations in a file called “novel.txt” you can add them by using the cat command in a unix environment:
`cat novel.txt >> linDefMut `
Novel designations
Novel designations of lineages/variants are be indicated by a suffix, that is happended to the name of the parental lineage, in HaploCoV. By default the suffix is composed by the letter N followed by a, dot and a progressive number.
For example if HaploCoV identifies 2 novel candidate lineages within the Pango lineage B.1, the names will be:
The string/letter used to indicate novel variants is set by the –suffix option in augmentClusters.pl.