5000-het: Dataset of Nucleotide Sequences with a Form of Evolutionary Sequence Length Heterogeneity

Liu, Baqiao; Shen, Chengze; Warnow, Tandy

doi:10.13012/B2IDB-3974819_V1

5000-het: Dataset of Nucleotide Sequences with a Form of Evolutionary Sequence Length Heterogeneity

Cite this dataset:

Liu, Baqiao; Shen, Chengze; Warnow, Tandy (2022): 5000-het: Dataset of Nucleotide Sequences with a Form of Evolutionary Sequence Length Heterogeneity. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-3974819_V1

Use this persistent URL to link to this dataset:

Metadata


Dataset Description	Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often does not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., in GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/). For more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.
Subject	Technology and Engineering
Keywords	simulated data; sequence length heterogeneity; multiple sequence alignment;
License	CC BY
Corresponding Creator	Tandy Warnow
Downloaded	491 times
Related Materials (1) Article Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow bioRxiv 2023.06.12.544642; doi: https://doi.org/10.1101/2023.06.12.544642

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-3974819_V1		2022-08-05

Files

Change Log

Contact the Research Data Service for help interpreting this log.

Dataset	update: {"all_globus"=>[nil, true]}	2026-01-16T15:41:55Z
Dataset	update: {"all_medusa"=>[nil, true]}	2026-01-16T15:36:46Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1101/2023.06.12.544642", "uri"=>"10.1101/2023.06.12.544642", "uri_type"=>"DOI", "citation"=>"Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow\r\nbioRxiv 2023.06.12.544642; doi: https://doi.org/10.1101/2023.06.12.544642", "dataset_id"=>2371, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo", "note"=>"", "feature"=>false}	2023-08-23T16:33:56Z
Dataset	update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Technology and Engineering"]}	2022-09-28T17:45:35Z
Dataset	update: {"description"=>["Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.", "Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often does not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., in GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het."]}	2022-08-05T01:44:31Z
Dataset	update: {"description"=>["Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often come under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.", "Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het."]}	2022-08-05T01:43:48Z
Dataset	update: {"description"=>["Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often come under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, seed README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.", "Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often come under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het."]}	2022-08-05T01:38:26Z

5000-het: Dataset of Nucleotide Sequences with a Form of Evolutionary Sequence Length Heterogeneity

Metadata

Dataset Description

Subject

Keywords

License

Corresponding Creator

Downloaded

Related Materials (1)

Versions

Files

Change Log