--- file_transformation: - 2019-07-26 finished CrossMap projections from gnm1 into gnm2. See file crossmap_arahy_v05.sh for details (S. Cannon) changes: - 2018-07-26 Initial repository preparation - 2020-04-21 Corrected GFF and fasta files to deal with 26 gene models that were duplicated in gnm2.ann1 relative to gnm1.ann1 (24 on 05 and 12 and one on 03 and 13; the latter is not strictly speaking a duplication but a split of the old gene model into two pieces resulting in two smaller models). The previously duplicated models have been renamed on the following chromosomes. The ID modification differs from the original ID string by the last character (changing to Z or 0). - Arahy.13 from M66GIB to M66GIZ - Arahy.05 from D5QJZZ to D5QJZ0 - Arahy.05 from Y72RXW to Y72RXZ - Arahy.05 from D1YXCG to D1YXCZ - Arahy.05 from PL6VNF to PL6VNZ - Arahy.05 from PFH739 to PFH73Z - Arahy.05 from E9WDF1 to E9WDFZ - Arahy.05 from V1AVP1 to V1AVPZ - Arahy.05 from C24P33 to C24P3Z - Arahy.05 from 2XVX2B to 2XVX2Z - Arahy.05 from ZIF87W to ZIF87Z - Arahy.05 from 942P0F to 942P0Z - Arahy.05 from NRP8RH to NRP8RZ - Arahy.05 from C3UJQ4 to C3UJQZ - Arahy.05 from 10QJ8K to 10QJ8Z - Arahy.05 from VRT43M to VRT43Z - Arahy.05 from 9RVJ5A to 9RVJ5Z - Arahy.05 from XEJ3YL to XEJ3YZ - Arahy.05 from FJ4QJG to FJ4QJZ - Arahy.05 from 1M918L to 1M918Z - Arahy.05 from T4T98J to T4T98Z - Arahy.05 from 60JIWJ to 60JIWZ - Arahy.05 from 3C43J1 to 3C43JZ - Arahy.05 from X4R883 to X4R88Z - Arahy.05 from 56DA4Y to 56DA4Z - Arahy.05 from 6M2V35 to 6M2V3Z - 2020-09-17 added explicit gene family mappings file arahy.Tifrunner.gnm2.ann1.4K0L.legfed_v1_0.M65K.gfa.tsv.gz - 2022-03-05 adf updated gene family mappings file arahy.Tifrunner.gnm2.ann1.4K0L.legfed_v1_0.M65K.gfa.tsv.gz - 2022-08-09 adf: fixed bgzip and tabix for gene models file - 2023-02-12 adf: fix non-unique IDs per https://github.com/legumeinfo/datastore-issues/issues/153 using datastore-specifications/scripts/add_IDs_to_gff_features.pl --clobber CDS --clobber five_prime_UTR --clobber three_prime_UTR - 2023-02-13 adf: address https://github.com/legumeinfo/datastore-issues/issues/154 by extracting gfa with an explicit transcript2gene lookup derived from the gff and ignoring proteins in the hmmsearch results with family matches but no gene: - zcat arahy.Tifrunner.gnm2.ann1.4K0L.gene_models_main.gff3.gz | awk '$3 == "mRNA" {print $9}' | sed 's/.*ID=\([^;]*\).*Parent=\([^;]*\).*/\1\t\2/' > transcript2gene - zcat hmmsearch_legfed_v1_0/proteins.hmmsearch.tbl.gz | hmmsearch_extract.pl --transcript2gene transcript2gene --ignore_missing_gene - 2023-05-07 adf: fix https://github.com/legumeinfo/datastore-issues/issues/167 using regex :g/arahy.Tifrunner.gnm2.Arahy./s//arahy.Tifrunner.gnm2.chr/ on files: -arahy.Tifrunner.gnm2.ann1.4K0L.cds.bed.gz -arahy.Tifrunner.gnm2.ann1.4K0L.gene_models_main.gff3.gz -arahy.Tifrunner.gnm2.ann1.4K0L.info_fwd_chain.txt.gz -arahy.Tifrunner.gnm2.ann1.4K0L.info_rev_chain.txt.gz - 2024-06-24 adf: remade the mrna/cds/protein fasta files from the gff using gffread, to placate the intermine annotation loader which will no longer (as of 5.1.0.4) turn a blind eye to entries in these fasta files that don't correspond to gff records. Basically, this has the effect of excluding the models from the earlier annotation listed in the arahy.Tifrunner.gnm2.ann1.4K0L.info_unmapped_models.txt.gz file although it's possible that other subtle differences between what was in those files and what the gff described will also be reflected in the gffread-generated contents.