Plant Ontology association file format

Annotation Association File Format
Collaborating databases and projects provide the POC project a tab delimited file, known informally as an "association file". This file carries links between database objects and PO terms. The database object may represent one of gene, transcript, protein, protein_structure, complex, germplasm (stock/cultivar), mutant, QTL, etc. Columns in the file are described below. A sample file containing associations from the Gramene database is provided for comparison.
File Name

po_aspect_objecttype_organism_organization.assoc

For example:
po_anatomy_gene_arabidopsis_tair.assoc
po_growth_gene_arabidopsis_tair.assoc
po_anatomy_gene_oryza_gramene.assoc
po_growth_gene_oryza_gramene.assoc

aspect: growth/anatomy/development.
objecttype: gene/mutant/germplasm etc.
organism: is always GENUS e.g. arabidopsis/oryza/zea.
organization: the isntitute/project which is contributing the association files.
The file name should be in "lowercase" and white spaces replaced by "underscore".
Ideally the association files for growth and anatomy should be merged into a single file. However, for the moment we are keeping them separate to make sure things are working fine.
If and when we merge the associations, the "aspect" will be removed from the file names.
For example: po_objecttype_organism_organization.assoc

File Format
The GO Annotation File (GAF) 2.0 format comprises 17 tab-delimited fields, several of which are not mandatory. This includes two new columns (16 and 17) that were not part of the GAF 1.0 format.
Make sure the column order is strictly followed, including spaces for columns that are left blank.
Also see the Gene Ontology Annotation Format web page for more information.
* denotes required fields

Column Content Example

1. * DB GR

2. * DB Object ID GR:0060905

3. * DB Object Symbol lrd10

4. Qualifier

5. * PO ID PO:0007014

6. * DB:Reference(|DB:Reference) GR_ref:5655|PMID:2676709

7. * Evidence IMP

8. With (or) From

9. * Aspect G

10. DB Object Name lesion resembling disease-10

11. DB Object Synonym(|Synonym) spl4|bl5|spotted leaf-4

12.* DB Object Type gene

13.* taxon(|taxon) taxon:4527

14.* Date 20050303

15.* Assigned by GR

16. Annotation Extension part_of(PO:0028002)

17. Gene Product Form ID UniProtKB:P12345-2

Description of the content:
Column 1. DB
The database contributing the association file.
One of the values in the table of database abbreviations.
This field is mandatory, cardinality 1.
This column refers to the database from which the identifier in DB object ID (column 2) is drawn. This is not necessarily the group submitting the file. For example, if a UniProtKB ID is the DB object ID (column 2), DB (column 1) should be UniProtKB.
Column 2. DB Object ID
A unique identifier in DB for the item being annotated.
This field is mandatory, cardinality 1.
In GAF 2.0 format, the identifier must reference a top-level primary gene or gene product identifier: either a gene, or a protein that has a 1:1 correspondence to a gene. Identifiers referring to particular protein isoforms or post-translationally cleaved or modified proteins are not legal values in this field.
The DB object ID (column 2) is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB object ID field) or annotations to a protein object (protein ID in DB object ID field).
Column 3. DB Object Symbol
A (unique and valid) symbol to which DB_Object_ID is matched.
Can use ORF name for otherwise unnamed gene or protein.
If gene products are annotated, use gene product symbol if available. Many gene product annotation entries can share a gene symbol.
This field is mandatory, cardinality 1.
The DB Object Symbol field should be a symbol that means something to a biologist wherever possible (a gene symbol, for example). It is not an ID or an accession number (DB object ID [column 2] provides the unique identifier), although IDs can be used as a DB Object Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).
Column 4. Qualifier
Flags that modify the interpretation of an annotation.
One (or more) of NOT, contributes_to, colocalizes_with.
This field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to).
See also the documentation on qualifiers page in the GO annotation guide.
Column 5. PO ID
The PO identifier for the term attributed to the DB Object ID.
This field is mandatory, cardinality 1.
Column 6. DB:Reference
The unique identifier appropriate to DB for the authority for the attribution of the POid to the DB Object ID. This may be a literature reference or a database record. The syntax is DB:accession_number.
Note: If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
This field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. GR:8789|PMID:2676709).
Column 7. Evidence
One of the following evidence codes: IMP, IGI, IPI, IAGP, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA.
This field is mandatory, cardinality 1.
Column 8. With (or) From
One of:
DB:gene_symbol
DB:gene_symbol[allele_symbol]
DB:gene_id
DB:protein_name
DB:sequence_id
GO:GO_id
This field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1.
Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence documentation for more information).
For cardinality >1 use a pipe to separate entries (e.g. TAIR:Atg111111|TAIR:Atg222222).
Note that a gene/locus ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if the database does not have identifiers for individual gene products.
'PO:PO_id' is used only when the evidence code is 'IC', and refers to the PO term(s) used as the basis of a curator inference. In these cases the entry in DB:Reference (column 6) will be that used to assign the PO term(s) from which the inference is made. This field is mandatory for evidence code IC.
The ID is usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, PO ID, etc.). Identifiers from the Center for Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with' column for ISS annotations.
Column 9. Aspect
Indicates the branch of the PO to which the PO ID (column 5) belongs
Either A (plant anatomical entity) or G (plant growth stage and development stage)
This field is mandatory; cardinality 1.
Column 10. DB Object Name
Name of the object, e.g., gene or gene product.
This field is not mandatory, cardinality 0, 1 [white space allowed].
Column 11. Synonym
Any aliases, e.g., Gene_symbol [or other text].
Note that we strongly recommend that gene synonyms are included in the association file, as this aids the searching of PO.
This field is not mandatory, cardinality 0, 1, >1 [white space allowed].
Column 12. DB Object Type
A description of the type of gene product being annotated.
If a Gene Product Form ID (column 17) is supplied, the DB Object Type will refer to that entity; if no gene product form ID is present, it will refer to the entity that the DB object symbol (column 2) is believed to produce and which actively carries out the function or localization described.
One of the following: protein_complex; protein; protein_structure; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the Sequence Ontology; germplasm (stock/cultivar); mutant; QTL. If the precise product type is unknown, gene_product should be used.
The object type (gene_product, transcript, protein, protein_complex, etc.) listed in the DB Object Type field must match the database entry identified by the Gene Product Form ID, or, if this is absent, the expected product of the DB Object ID. Note that DB Object Type refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the PO term or the evidence on which the annotation is based. For example, if your database entry represents a protein-encoding gene, then protein goes in the DB Object Type column. The text entered in the DB Object Name and DB Object Symbol should refer to the entity in DB Object ID. For example, several alternative transcripts from one gene may be annotated separately, each with the same gene ID in DB Object ID, and specific gene product identifiers in Gene Product Form ID, but list the same gene symbol in the DB Object Symbol column.
This field is mandatory, cardinality 1.
Column 13. Taxon
Taxonomic identifier(s)
For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, the first ID is that of the species encoding the gene or gene product; the second ID is that of the other organism in the interaction, such as the species using the gene product.
This field is mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000).
Column 14. Date
Date on which the annotation was made; format is YYYYMMDD
This field is mandatory, cardinality 1
Column 15. Assigned by
The database which made the annotation.
One of the values in the table of database abbreviations.
Used for tracking the source of an individual annotation.
Default value is value entered in column 1 (DB).
Value will differ from column 1 for any annotation that is made by one database and incorporated into another.
This field is mandatory, cardinality 1.
Column 16. Annotation Extension
Contains cross references to a PO term that can be used to qualify or enhance the annotation. The cross-reference is prefaced by an appropriate PO relationship (for now, part_of or participates_in; use of other relations may be allowed in the future).
One or more of: relation(PO:id)
This field is not mandatory, cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries, e.g., part_of(PO:0009025)|participates_in(PO:0001054).
Example 1: If a gene product is localized to the leaf tip of a vascular leaf, the PO ID (column 5) would be leaf tip (PO:0025142), and the annotation extension column would contain a cross-reference to part_of vascular leaf (PO:0009025).
Example 2: If a gene product is localized in a leaf during senesence, the PO ID (column 5) would be leaf (PO:0009025), and the annotation extension column would contain a cross-reference to participates_in leaf senescence stage (PO:0001054).
See additional information and discussion on the PO Annotation Extensions (column 16) wiki page.
Column 17. Gen Product Form ID
As the DB Object ID (column 2) entry must be a canonical entity - a gene OR an abstract protein that has a 1:1 correspondence to a gene - this field allows the annotation of specific variants of that gene or gene product. Contents will frequently include protein sequence identifiers: for example, identifiers that specify distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. Identifiers for functional RNAs can also be included in this column.
The identifier used must be a standard 2-part global identifier, e.g. UniProtKB:OK0206-2
When the Gene Product Form ID (column 17) is filled with a protein identifier, the value in DB Object Type (column 12) must be protein. Protein identifiers can include UniProtKB accession numbers, NCBI NP identifiers or Protein Ontology (PRO) identifiers.
When the Gene Product Form ID (column 17) is filled with a functional RNA identifier, the DB Object Type (column 12) must be either ncRNA, rRNA, tRNA, snRNA, or snoRNA.
This column may be left blank; if so, the value in DB Object Type (column 12) will provide a description of the expected gene product.
This field is not mandatory, cardinality 0 or 1.
More information and examples are available from the GO wiki page on column 17.

Note that several fields contain database cross-references (dbxrefs) in the format dbname:dbaccession. The fields are: POid (where dbname is always PO), DB:Reference, With, Taxon (where dbname is always taxon). For PO id, do not repeat the 'PO:' prefix (i.e. always use PO:0000000, not PO:PO:00000000)

Column	Content	Example
1. *	DB	GR
2. *	DB Object ID	GR:0060905
3. *	DB Object Symbol	lrd10
4.	Qualifier
5. *	PO ID	PO:0007014
6. *	DB:Reference(\|DB:Reference)	GR_ref:5655\|PMID:2676709
7. *	Evidence	IMP
8.	With (or) From
9. *	Aspect	G
10.	DB Object Name	lesion resembling disease-10
11.	DB Object Synonym(\|Synonym)	spl4\|bl5\|spotted leaf-4
12.*	DB Object Type	gene
13.*	taxon(\|taxon)	taxon:4527
14.*	Date	20050303
15.*	Assigned by	GR
16.	Annotation Extension	part_of(PO:0028002)
17.	Gene Product Form ID	UniProtKB:P12345-2