Illinois Data Bank - Dataset

Version DOI Comment Publication Date
1 10.13012/B2IDB-3974819_V1 2022-08-05

2.07 KB File
1.2 GB File

Contact the Research Data Service for help interpreting this log.

RelatedMaterial create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1101/2023.06.12.544642", "uri"=>"10.1101/2023.06.12.544642", "uri_type"=>"DOI", "citation"=>"Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow\r\nbioRxiv 2023.06.12.544642; doi: https://doi.org/10.1101/2023.06.12.544642", "dataset_id"=>2371, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo", "note"=>"", "feature"=>false} 2023-08-23T16:33:56Z
Dataset update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Technology and Engineering"]} 2022-09-28T17:45:35Z
Dataset update: {"description"=>["Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.", "Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often does not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., in GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het."]} 2022-08-05T01:44:31Z
Dataset update: {"description"=>["Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often come under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.", "Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het."]} 2022-08-05T01:43:48Z
Dataset update: {"description"=>["Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often come under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, seed README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.", "Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often come under question compared to empirical datasets. In particular, simulated data often do not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/).\r\n\r\nFor more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het."]} 2022-08-05T01:38:26Z