Note that several fields contain database cross-references (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon). For PO id, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)Annotation Association File Format
Collaborating databases and projects provide the POC project a tab delimited file, known informally as an "association file". This file carries links between database objects and PO terms. The database object may represent one of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL, etc. Columns in the file are described below. A sample file containing associations from the Gramene database is provided for comparison.File Name
po_aspect_objecttype_organism_organization.assoc
For example:
aspect: growth/anatomy/development.
po_anatomy_gene_arabidopsis_tair.assoc
po_growth_gene_arabidopsis_tair.assoc
po_anatomy_gene_oryza_gramene.assoc
po_growth_gene_oryza_gramene.assoc
objecttype: gene/mutant/germplasm etc.
organism: is always GENUS e.g. arabidopsis/oryza/zea.
organization: the isntitute/project which is contributing the association files. The file name should be in "lowercase" and white spaces replaced by "underscore". Ideally the association files for growth and anatomy should be merged into a single file. However, for the moment we are keeping them separate to make sure things are working fine. If and when we merge the associations, the "aspect" will be removed from the file names.
For example: po_objecttype_organism_organization.assoc
File Format
The GO Annotation File (GAF) 2.0 format comprises 17 tab-delimited fields, several of which are not mandatory. This includes two new columns (16 and 17) that were not part of the GAF 1.0 format.
Make sure the column order is strictly followed, including spaces for columns that are left blank.
Also see the Gene Ontology Annotation Format web page for more information.
* denotes required fields
Column Content Example 1. * DB GR 2. * DB Object ID GR:0060905 3. * DB Object Symbol lrd10 4. Qualifier 5. * PO ID PO:0007014 6. * DB:Reference(|DB:Reference) GR_ref:5655|PMID:2676709 7. * Evidence IMP 8. With (or) From 9. * Aspect G 10. DB Object Name lesion resembling disease-10 11. DB Object Synonym(|Synonym) spl4|bl5|spotted leaf-4 12.* DB Object Type gene 13.* taxon(|taxon) taxon:4527 14.* Date 20050303 15.* Assigned by GR 16. Annotation Extension part_of(PO:0028002) 17. Gene Product Form ID UniProtKB:P12345-2
Description of the content:
Column 1. DBThe database contributing the association file.
Column 2. DB Object ID
One of the values in the table of database abbreviations.
This field is mandatory, cardinality 1. This column refers to the database from which the identifier in DB object ID (column 2) is drawn. This is not necessarily the group submitting the file. For example, if a UniProtKB ID is the DB object ID (column 2), DB (column 1) should be UniProtKB.A unique identifier in DB for the item being annotated.
Column 3. DB Object Symbol
This field is mandatory, cardinality 1. In GAF 2.0 format, the identifier must reference a top-level primary gene or gene product identifier: either a gene, or a protein that has a 1:1 correspondence to a gene. Identifiers referring to particular protein isoforms or post-translationally cleaved or modified proteins are not legal values in this field. The DB object ID (column 2) is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).A (unique and valid) symbol to which DB_Object_ID is matched.
Column 4. Qualifier
Can use ORF name for otherwise unnamed gene or protein.
If gene products are annotated, use gene product symbol if available. Many gene product annotation entries can share a gene symbol.
This field is mandatory, cardinality 1. The DB Object Symbol field should be a symbol that means something to a biologist wherever possible (a gene symbol, for example). It is not an ID or an accession number (DB object ID [column 2] provides the unique identifier), although IDs can be used as a DB Object Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).Flags that modify the interpretation of an annotation.
Column 5. PO ID
One (or more) of NOT, contributes_to, colocalizes_with.
This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to). See also the documentation on qualifiers page in the GO annotation guide.The PO identifier for the term attributed to the DB Object ID.
Column 6. DB:Reference
This field is mandatory, cardinality 1.The unique identifier appropriate to DB for the authority for the attribution of the POid to the DB Object ID. This may be a literature reference or a database record. The syntax is DB:accession_number. Note: If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
Column 7. Evidence
This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. GR:8789|PMID:2676709).One of the following evidence codes: IMP, IGI, IPI, IAGP, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA.
Column 8. With (or) From
This field is mandatory, cardinality 1.One of:
Column 9. Aspect
DB:gene_symbol
DB:gene_symbol[allele_symbol]
DB:gene_id
DB:protein_name
DB:sequence_id
GO:GO_id This field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1. Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information).
For cardinality >1 use a pipe to separate entries (e.g. TAIR:Atg111111|TAIR:Atg222222). Note that a gene/locus ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products. 'PO:PO_id' is used only when the evidence code is 'IC', and refers to the PO term(s) used as the basis of a curator inference. In these cases the entry in DB:Reference (column 6) will be that used to assign the PO term(s) from which the inference is made. This field is mandatory for evidence code IC. The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, PO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations.Indicates the branch of the PO to which the PO ID (column 5) belongs
Column 10. DB Object Name
Either A (plant anatomical entity) or G (plant growth stage and development stage)
This field is mandatory; cardinality 1.Name of the object, e.g., gene or gene product.
Column 11. Synonym
This field is not mandatory, cardinality 0, 1 [white space allowed].Any aliases, e.g., Gene_symbol [or other text].
Column 12. DB Object Type
Note that we strongly recommend that gene synonyms are included in the association file, as this aids the searching of PO.
This field is not mandatory, cardinality 0, 1, >1 [white space allowed].A description of the type of gene product being annotated.
Column 13. Taxon
If a Gene Product Form ID (column 17) is supplied, the DB Object Type will refer to that entity; if no gene product form ID is present, it will refer to the entity that the DB object symbol (column 2) is believed to produce and which actively carries out the function or localization described.
One of the following: protein_complex; protein; protein_structure; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the Sequence Ontology; germplasm (stock/cultivar); mutant; QTL. If the precise product type is unknown, gene_product should be used.
The object type (gene_product, transcript, protein, protein_complex, etc.) listed in the DB Object Type field must match the database entry identified by the Gene Product Form ID, or, if this is absent, the expected product of the DB Object ID. Note that DB Object Type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the PO term or the evidence on which the annotation is based. For example, if your database entry represents a protein-encoding gene, then protein goes in the DB Object Type column. The text entered in the DB Object Name and DB Object Symbol should refer to the entity in DB Object ID. For example, several alternative transcripts from one gene may be annotated separately, each with the same gene ID in DB Object ID, and specific gene product identifiers in Gene Product Form ID, but list the same gene symbol in the DB Object Symbol column. This field is mandatory, cardinality 1.Taxonomic identifier(s)
Column 14. Date
For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, the first ID is that of the species encoding the gene or gene product; the second ID is that of the other organism in the interaction, such as the species using the gene product.
This field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000).Date on which the annotation was made; format is YYYYMMDD
Column 15. Assigned by
This field is mandatory, cardinality 1The database which made the annotation.
Column 16. Annotation Extension
One of the values in the table of database abbreviations.
Used for tracking the source of an individual annotation.
Default value is value entered in column 1 (DB).
Value will differ from column 1 for any annotation that is made by one database and incorporated into another.
This field is mandatory, cardinality 1.Contains cross references to a PO term that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate PO relationship (for now, part_of or participates_in; use of other relations may be allowed in the future).
Column 17. Gen Product Form ID
One or more of: relation(PO:id)
This field is not mandatory, cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries, e.g., part_of(PO:0009025)|participates_in(PO:0001054). Example 1: If a gene product is localized to the leaf tip of a vascular leaf, the PO ID (column 5) would be leaf tip (PO:0025142), and the annotation extension column would contain a cross-reference to part_of vascular leaf (PO:0009025). Example 2: If a gene product is localized in a leaf during senesence, the PO ID (column 5) would be leaf (PO:0009025), and the annotation extension column would contain a cross-reference to participates_in leaf senescence stage (PO:0001054). See additional information and discussion on the PO Annotation Extensions (column 16) wiki page.As the DB Object ID (column 2) entry must be a canonical entity - a gene OR an abstract protein that has a 1:1 correspondence to a gene - this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column. The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2 When the Gene Product Form ID (column 17) is filled with a protein identifier, the value in DB Object Type (column 12) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers. When the Gene Product Form ID (column 17) is filled with a functional RNA identifier, the DB Object Type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA. This column may be left blank; if so, the value in DB Object Type (column 12) will provide a description of the expected gene product. This field is not mandatory, cardinality 0 or 1. More information and examples are available from the GO wiki page on column 17.