HaploCoV.pl
Once data has been converted in HaploCoV format, the complete workflow can be executed by applying HaploCoV.pl. HaploCoV.pl is the workhorse of HaploCoV and is the recommended way to execute our software.
Users can specify a list of geographic regions/areas or countries and intervals of time to be considered in their analyses by providing a configuration file in text format (locales file, see next). HaploCoV.pl will process the configuration file and apply the complete workflow to each entity therein listed. For every distinct country, area or region results will be provided in the form of a report (.rep) file.
The report will list candidate SARS-CoV-2 variants/lineages showing a significant increase of their VOC-ness score and/or prevalence, and which are probably worth to be monitored. More details on the interpretation of the report are provided in the section How to interpret HaploCoV’s results and What to do next.
Options
HaploCoV.pl accepts the following options:
–file: name of the input file (metadata file in HaploCoV format);
–locales: configuration file with the list of regions and countries to analyse;
–param: configuration file with the set of parameters to be applied by HaploCoV in your analysis;
–path: path to your HaploCoV installation;
–varfile: write designations file.
Execution
An example of a valid command line is reported below:
perl HaploCov.pl --file linearDataSorted.txt --locales italy.loc
The italy.loc configuration specifies the geographic regions/countries and time interval to included in the analyses. The content of the file is briefly summarized below. A more comprehensive discussion of locales configuration files is reported in the next section.
location qualifier start-date end-date genomic-variants
Italy country 2022-07-01 2023-01-26 custom
Only sequences collected in Italy, from 2022-07-01 to 2023-01-26 will be considered by HaploCoV, according to this configuration. Since the type of analysis was set to “custom” and the target geographic region to Italy the final output will be file will be named "Italy_custom.rep".
Configuration (Locales file)
Locale(s) configuration files are used by HaploCoV.pl to set the main parameters for the execution of your analyses. These files are used to configure the place/places and intervals of time that HaploCoV will analyse. There is no limit to the maximum number of geographic locations and time-intervals that can be specified. As outlined in the example below, however, each needs to be indicated in a separate line in your locales file.
Locales files need to have a tabular format and contain 5 columns separated by tabulations. The file locales.txt included in the Github repository provides a valid example of a locales configuration file.
Locales File Location
qualifier
start-date
end-date
genomic-variants
Italy
country
2022-01-01
2022-11-11
areas_list.txt
Thailand
country
2022-01-01
2022-11-11
custom
world
area
2022-01-01
2022-01-01
custom
The file includes the following columns, in this set order:
Output (and intermediate files folder)
The name of the main output by HaploCoV.pl is set automatically by the program by combining the value provided in the location (1rst) column, with value/values reported in the *genomic-variants” (5th) column of your locales configuration file. In the example above 3 different output files will be generated:
Italy_areas_list.txt.rep;
Thailand_custom.rep;
world_custom.rep.
Each execution of HaploCoV results in several temporary/intermediate files. Normally you will not need to read/process/use these files, however for your convenience all the intermediate files will be saved in a folder.
The same conventions applied for naming the main output files are also used to give names to the folders with intermediate files. In the example outlined above, intermediate files will be saved in 3 different folders, called:
Italy_areas_list.txt_results;
Thailand_custom_results;
world_custom_results;
Additional explanations concerning the intermediate files produced by HaploCoV and what to make of them are provided in the section: Intermediate files and what to make of them.
Genomic variants files (Configuration II)
HaploCoV uses collections of genomic variants with high frequency in a specific country/region/locale to define and identify novel candidate variants/lineages of SARS-CoV-2.
For your convenience, a collection of pre-computed genomic variants files is available in the main repository under the folder alleleVariantSet. If you want to use one of these files, you simply have to enter the file/files name in the fifth column of your locales configuration file (comma separated). HaploCoV will detect the file and run all the analyses for you.
Precomputed sets of genomic variants/files can broadly be categorized into 4 main classes:
All these files can be used alone, or in any combination by HaploCoV to derive novel designations. For example if a user wants to use the “1020_1080_list.txt” file from the HighVar folder and the “Dec2022_list.txt” from the HighFreq folder, the following configuration locales file will be used:
location |
qualifier |
start-date |
end-date |
genomic-variants |
|---|---|---|---|---|
Italy |
country |
2022-01-01 |
2022-11-11 |
1020_1080_list.txt,Dec2022_list.txt |
Please see the section Genomic variants file above for additional information on the content of the files and the rationale used to create them.
If the pre-computed files do not suit their use case, users do also have the option to derive custom sets of genomic variants by analysing the selected locale and time-frame only. In this case the keyword custom needs to be indicated in the 5th column of the locales file (see below). High frequency genomic variants will be computed based on the current selection.
Locales: special/reserved keywords
When the reserved word world is used in the 1rst column of your locales all the sequences in the metadata file will be analysed irrespective of the geographic origin.
In the 5th (genomic-variants) you can use the reserved world custom if you need to re-compute high frequency genomic variants based on your selection of genomic sequences, instead of using a pre-computed genomic-variant file provided by HaploCoV. When custom is specified, high frequency genomic variants are determined on the fly based on the user selection.
Parameters file (configuration III)
HaploCoV.pl executes all the tools and utilities in HaploCoV for you and in the right order. However, the workflow is relatively complex, and every tool uses a series of parameters that need to be configured. The parameters file is a special configuration file that can be used to set and configure all the parameters used by every single tool in the workflow. A default file with a standard configuration (called parameters) is included in the main repository. This file should suit most use cases/scenarios. However users are free to edit it according to their needs. The file can be edited with any text editor. To facilitate this process, users can take advantage of the file parametersDetailed (here) in the main repository, which provides an explicit list of all the parameters that can be modified/set and their default values.
The format is quite straightforward, each tool is indicated in a line, and the parameters to be set in the following lines. Values are separate by tabulations. Comments need to be prepended with an “#” symbol. When no parameters are specified the default values are used. In example:
computeAF.plaugmentClusters.pl--size 10--dist 4will set computeAF.pl to use its default parameters; while for augmentClusters.pl –dist will be set to 4 and –size to 10.
For a complete list of all the parameters accepted by every tool, please refer to the corresponding section in the manual or see the file parametersDetailed file.
Designations file
The –varfile option can be set to instruct HaploCoV to report an designations file with the list of novel candidate SARS-CoV-2 variants identified by the tool, and the collection of their defining genomic variants.
–varfile can be set to one of 3 possible values:
“n” the designations file is not produced (default);
“b” the designations file includes only variants that passed both the thresholds (score and prevalence);
“a” the designations file includes, variants that passed any of the thresholds (score or prevalence).
For a more extended explanation of the meaning, format and possible usage/application of this output file, users are kindly invited to read the section: Genomic variants file.
Intermediate files and what to make of them
At every execution HaploCoV will create a temporary folder with 6 intermediate files (see above). Although, normally you are not supposed to use these files, a brief explanation concerning their meaning and content is reported in the following section. All these files are produced by different tools in the HaploCoV workflow. More detailed explanations can also be found in the corresponding (to each tool) section in the manual.
Intermediate files produced by HaploCoV.pl (prefix of the name might change according to the input file names, suffix are reported):
How to interpret HaploCoV’s results
The main output of HaploCoV consists in a file in .rep format. This is a simple text file that provides relevant information about novel (candidate) SARS-CoV-2 variants that demonstrated:
an increase in their “VOC-ness” score;
an increase in their prevalence (regionally or globally);
both.
The report contains 3 main sections, which are discussed below. The file India_custom.rep in the main HaploCoV repository, provides an example of .rep file. The file contains an analysis of novel”variants in India, between 2021-01-01 and 2021-04-30, that is when the Delta and Kappa variant of SARS-CoV-2 emerged and started to spread in the country.
Header and sections
Headers and sections of a .rep file are specified/set by “#” symbols. The 4 first lines summarize the results and report the umber of novel candidate variants that:
passed both the prevalence and score threshold;
passed only the score threshold;
passed only the prevalence thresholds.
After the header, 3 sections follow in the same order indicated by the above numbered list.
Each section is introduced by a # symbol, and concluded by the sentence: “A detailed report follows”. In the report each candidate lineage/variant is introduced by a # followed by a progressive number and its name. Names are according to the convention explained in the section Novel designations, briefly:
name of the parental , dot , one letter suffix(N by default) , progressive number .
Main features of the newly identified lineages/variants are reported in two conceptually distinct sections: Scores and Prevalence.
Scores and novel genomic variants
Reports the following information:
1. The parental lineage of a candidate variant (Parent:). The parental is the lineage/variant from which the lineage/variant defined by HaploCoV descends. As an example:
Parent: B.1 indicates that the parental lineage is B.1
The VOC-ness score of the parental, and candidate new lineage/variant (Score parent: and Score subV:, respectively). The larger the difference between the 2 scores is, the more likely it is that the new lineage/variant should have “enhanced” VOC-like features. A difference of 5 or above in particular should be considered a strong indication, since in our experience score-differences of 5 or higher have been recorded only when comparing (known) VOC variants as defined by the WHO with their parental lineage.
An example of a output line is reported below:
Score parent: 3.28 - Score subV: 15.10A detailed comparison of the genomic variants gained or lost by the novel candidate lineage/designation w.r.t its parental. Which includes the following information:
Genomic variants are provided in the form of a list separated by spaces (” “) and in the same format indicated above:
<genomicposition>_<ref>|<alt>
An example of the output is reported below:
Genomic variants:defined by: 210_G|T 241_C|T 3037_C|T 4181_G|T 21618_C|G 22995_C|A 19220_C|Tgained (wrt parent): 21618_C|G 22995_C|A 19220_C|Tlost (wrt parent):In this case the novel candidate lineage/variant is defined by 3 additional genomic variants compared to its parental.
Prevalence
This part of the report summarizes the observed prevalence of novel candidate variants/lineages over a time span defined by the user(4 weeks by default) at different locales. The aim is to identify/flag variants that had a high prevalence (default 1% or more) and which demonstrated a significant increase in their spread (2 fold or more). Please refer to Prevalence report for more detailed instructions on how the prevalence of a variant is computed and reported by HaploCoV. The prevalence report comprises 3 sections.
Prevalence above the threshold (1% by default)
Here we report the number of distinct intervals and the complete list of locales where/when a prevalence above the minimum prevalence threshold was observed.
For example:
AsiaSO::India::Delhi:5 AsiaSO::India::WestBengal:1Indicates that the novel candidate lineage/variant had a prevalence above the minimum cut-off value at 5 distinct intervals in Delhi and at only a single interval in West Bengal.
Increase (2 fold by default)
For every interval/span of time (default 4 weeks) where the novel candidate lineage/variant had a prevalence above the user defined threshold, and an increase of X folds (X=2 by default) or higher, this section reports:
the place were the increase was observed;
the prevalence at the initial time point of the interval;
and the prevalence at the last time point of the interval.
For example:
Interval: 2021-04-01 to 2021-04-28, increase at 1 locale(s)List of locale(s): AsiaSO::India::Delhi:0.03-(76),0.08-(117)Indicates that in the interval of time comprised between April 1rst and April 28th, at Dehli the candidate lineage/variant increased its prevalence from 0.03 (3%) to 0.08 (8%). The numbers in brackets, 76 and 117 respectively, indicate the total number of genomic sequences used to estimate the prevalence.
The sentence The candidate variant/lineage did not show an increase in prevalence greater than the threshold at any interval or locale is used when no data are available and/or the novel variant did not show an increase in its prevalence.
Prevalence in time
This section reports the latest prevalence of the candidate variant/lineage as estimated by HaploCoV. For example:
Latest prevalence:AsiaSO 2021-04-30 0.0294-(136)AsiaSO::India 2021-04-30 0.0294-(136)indicates that the latest prevalence of the candidate lineage/variant at April 30th 2021, was 0.029 (~3%) in South Asia and India.