Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment

Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy

doi:10.13012/B2IDB-2567453_V1

Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment

Cite this dataset:

Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy (2022): Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2567453_V1

Use this persistent URL to link to this dataset:

Metadata


Dataset Description	This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment". The zip file has the following structure (presented as an example): salma_paper_datasets/ \|_README.md \|_10aa/ \|_crw/ \|_homfam/ \|_aat/ \| \|_... \|_... \|_het/ \|_5000M2-het/ \| \|_... \|_5000M3-het/ ... \|_rec_res/ Generally, the structure can be viewed as: [category]/[dataset]/[replicate]/[alignment files] # Categories: 1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate. 2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM). 3. homfam: There are the 10 largest Homfam datasets, each with one replicate. 4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates. 5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper. # Alignment files There are at most 6 `.fasta` files in each sub-directory: 1. `all.unaln.fasta`: All unaligned sequences. 2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included. 3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences). 4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included. 5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences). 6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included. >If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing. >If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing. >If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing. # Additional file(s) 1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.
Subject	Technology and Engineering
Keywords	SALMA;MAFFT;alignment;eHMM;sequence length heterogeneity
License	CC0
Funder	U.S. National Science Foundation (NSF)-Grant:2006069
Funder	U.S. Department of Energy (DOE)-Grant:DE-NA0003525
Corresponding Creator	Chengze Shen
Downloaded	459 times
Related Materials (2) Article Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow bioRxiv 2023.06.12.544642; doi: https://doi.org/10.1101/2023.06.12.544642 Article Shen, C., Liu, B., Williams, K.P. et al. EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment. Algorithms Mol Biol 18, 21 (2023). https://doi.org/10.1186/s13015-023-00247-x

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-2567453_V1		2022-08-08

Files

Change Log

Contact the Research Data Service for help interpreting this log.

Dataset	update: {"all_globus"=>[nil, true]}	2026-01-16T15:42:46Z
Dataset	update: {"all_medusa"=>[nil, true]}	2026-01-16T15:36:42Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1186/s13015-023-00247-x", "uri"=>"10.1186/s13015-023-00247-x", "uri_type"=>"DOI", "citation"=>"Shen, C., Liu, B., Williams, K.P. et al. EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment. Algorithms Mol Biol 18, 21 (2023). https://doi.org/10.1186/s13015-023-00247-x", "dataset_id"=>2370, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo", "note"=>nil, "feature"=>nil}	2023-12-13T21:24:35Z
Dataset	update: {"description"=>["This upload contains all datasets used in Experiment 2 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]}	2023-09-13T17:29:08Z
Creator	update: {"given_name"=>["Kelly", "Kelly P."]}	2023-09-13T17:23:29Z
Dataset	update: {"description"=>["This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiment 2 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]}	2023-09-13T17:23:29Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1101/2023.06.12.544642", "uri"=>"10.1101/2023.06.12.544642", "uri_type"=>"DOI", "citation"=>"Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow\r\nbioRxiv 2023.06.12.544642; doi: https://doi.org/10.1101/2023.06.12.544642", "dataset_id"=>2370, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo", "note"=>"", "feature"=>false}	2023-08-23T16:33:08Z
Dataset	update: {"description"=>["This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"{EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]}	2023-07-26T19:05:34Z
Dataset	update: {"title"=>["Datasets for SALMA: Scalable ALignment using MAFFT-add", "Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment"], "description"=>["This upload contains all datasets used in Experiments 2 and 3 of the SALMA paper (pending submission): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"SALMA: Scalable ALignment using MAFFT-Add\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"{EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n\|_README.md\r\n\|_10aa/\r\n\|_crw/\r\n\|_homfam/\r\n \|_aat/\r\n \| \|_...\r\n \|_...\r\n\|_het/\r\n \|_5000M2-het/\r\n \| \|_...\r\n \|_5000M3-het/\r\n ...\r\n\|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]}	2023-07-26T19:05:17Z
Dataset	update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Technology and Engineering"]}	2022-09-28T17:42:53Z

Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment

Metadata

Dataset Description

Subject

Keywords

License

Funder

Funder

Corresponding Creator

Downloaded

Related Materials (2)

Versions

Files

Change Log