Illinois Data Bank - Dataset

Version DOI Comment Publication Date
1 10.13012/B2IDB-2567453_V1 2022-08-08

2.72 KB File
573 MB File

Contact the Research Data Service for help interpreting this log.

RelatedMaterial create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1186/s13015-023-00247-x", "uri"=>"10.1186/s13015-023-00247-x", "uri_type"=>"DOI", "citation"=>"Shen, C., Liu, B., Williams, K.P. et al. EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment. Algorithms Mol Biol 18, 21 (2023). https://doi.org/10.1186/s13015-023-00247-x", "dataset_id"=>2370, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo", "note"=>nil, "feature"=>nil} 2023-12-13T21:24:35Z
Dataset update: {"description"=>["This upload contains all datasets used in Experiment 2 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]} 2023-09-13T17:29:08Z
Creator update: {"given_name"=>["Kelly", "Kelly P."]} 2023-09-13T17:23:29Z
Dataset update: {"description"=>["This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiment 2 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]} 2023-09-13T17:23:29Z
RelatedMaterial create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1101/2023.06.12.544642", "uri"=>"10.1101/2023.06.12.544642", "uri_type"=>"DOI", "citation"=>"Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow\r\nbioRxiv 2023.06.12.544642; doi: https://doi.org/10.1101/2023.06.12.544642", "dataset_id"=>2370, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo", "note"=>"", "feature"=>false} 2023-08-23T16:33:08Z
Dataset update: {"description"=>["This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"{EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]} 2023-07-26T19:05:34Z
Dataset update: {"title"=>["Datasets for SALMA: Scalable ALignment using MAFFT-add", "Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment"], "description"=>["This upload contains all datasets used in Experiments 2 and 3 of the SALMA paper (pending submission): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"SALMA: Scalable ALignment using MAFFT-Add\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.", "This upload contains all datasets used in Experiments 2 and 3 of the EMMA paper (to appear in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. \"{EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment\".\r\n\r\nThe zip file has the following structure (presented as an example):\r\nsalma_paper_datasets/\r\n|_README.md\r\n|_10aa/\r\n|_crw/\r\n|_homfam/\r\n |_aat/\r\n | |_...\r\n |_...\r\n|_het/\r\n |_5000M2-het/\r\n | |_...\r\n |_5000M3-het/\r\n ...\r\n|_rec_res/\r\n\r\n\r\nGenerally, the structure can be viewed as:\r\n[category]/[dataset]/[replicate]/[alignment files]\r\n\r\n# Categories:\r\n1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.\r\n2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).\r\n3. homfam: There are the 10 largest Homfam datasets, each with one replicate.\r\n4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.\r\n5. rec\\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.\r\n\r\n# Alignment files\r\nThere are at most 6 `.fasta` files in each sub-directory:\r\n1. `all.unaln.fasta`: All unaligned sequences.\r\n2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.\r\n3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).\r\n4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.\r\n5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).\r\n6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.\r\n\r\n>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.\r\n>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.\r\n>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.\r\n\r\n# Additional file(s)\r\n1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences."]} 2023-07-26T19:05:17Z
Dataset update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Technology and Engineering"]} 2022-09-28T17:42:53Z