Computational resources
The HaploCoV workflow can be executed in a reasonable time on a modern laptop. However, be aware that some of the input files might be extremely large in size. For example, the complete fasta file with all SARS-CoV-2 genome sequences available from the GISAID database has a size in the excess of 300 Gb. While complete metadata files from GISAID are over 8 Gb in size. Moreover, some tasks/processes can potentially take up to a few days (see for example 6 Assign genomes to new groups) on a single processor. In the light of the above considerations we would kindly invite users to make sure that they have access to the required computational resources before executing the HaploCoV workflow. The table below briefly summarizes the requirements in terms of time, RAM memory and disk-space required by eacht tool in HaploCoV, and for the complete workflow.
Tool |
Input files |
RAM (peak memory) |
Time |
Output (size) |
|---|---|---|---|---|
addToTable.pl |
sequences.fasta > 300G; metadata.tsv ~10G |
6.0G - 8.0G |
~20 genomes per hour (on a single CPU) |
6.0G - 8.0G |
NextStrainToHaploCoV.pl |
metadata.tsv ~10G |
< 1G |
~10 min |
3.0 - 4.0G |
computeAF.pl |
HaploCoV-formatted metadata. 4.0G - 9.0G |
4.0G - 6.0G |
~20 min |
~ 2.0G |
augmentClusters.pl |
HaploCoV-formatted metadata. 4.0G - 9.0G |
6.0G - 8.0G |
~20 min |
~10 K |
assign.pl / p_assign.pl |
HaploCoV-formatted metadata. 4.0G - 9.0G |
< 1.0G |
4M genomes /per hour |
HaploCoV-formatted metadata. 4.0G - 9.0G |
LinToFeats.pl |
lineages/variants definition ~10 Kb |
< 1.0G |
< 5 min |
~20 K |
report.pl |
lineages/variants features ~20 Kb |
< 1.0G |
< 2 min |
~20 K |
subset.pl |
HaploCoV-formatted metadata. 4.0G - 9.0G |
< 1.0G |
5-10 min |
HaploCoV-formatted metadata. 4.0G - 9.0G |
increase.pl |
HaploCoV-formatted metadata. 4.0G - 9.0G |
< 1.0G |
10-15 min |
10 M. Prevalence report |
HaploCoV.pl |
HaploCoV-formatted metadata. 4.0G - 9.0G |
6.0G - 8.0G |
3 hours on full data |
20 K. HaploCoV report |
If you already have all your metadata in HaploCoV format, executing the full workflow should require less than 3hrs. If you use a locales file to restrict the analyses to a specific time-interval or geographic region, execution times should be considerably reduced (see HaploCoV: workflow). In any case execution times might change also depending on your computational environment.