Frequently Asked Questions: Difference between revisions

Revision as of 16:06, 29 April 2020

Frequently asked questions related to front-end and data.

Protein FAQs

FAQs related to protein centric data and views.

What does protein-centric data include?

Protein-centric data include data types and information about a particular protein-coding gene, or that can be mapped to the canonical protein sequence representing that gene. Examples include pathways, Gene Ontology, localization, etc.

What is a UniProtKB canonical accession?

UniProtKB canonical accession is an accession assigned to the protein isoform chosen to be the canonical sequence to which all positional annotation refers in the UniProtKB entry page.
UniProtKB represents different isoforms of the same protein by assigning it a protein accession followed by the number of the isoform.
- For example: for protein accession P38398, different isoforms are represented as P38398-1, P38398-2, P38398-3, P38398-4, etc. where P38398-1 is the chosen canonical accession.
UniProtKB uses specific criteria for choosing the canonical sequence for the entry. To know more about canonical isoforms and the canonical choosing criteria, refer to the UniProtKB help page.

How do I find a UniProtKB accession for my protein?

UniProtKB accession is used to represent a protein by UniProtKB database. Visit UniProtKB for additional information.
In GlyGen, you can find the UniProtKB accession for your protein either by providing a protein name, protein sequence, gene name or even by adding different cross-references (e.g. RefSeq accession, KEGG etc.):
- Go to the EXPLORE tab on GlyGen the home page and select Protein Search.
- Select Advanced Search and add the desired input by selecting the appropriate box.

How can I search all the glycosylation sites present on my protein?

Go to the EXPLORE tab on the GlyGen home page and select Protein Search.
Under Simple Search, select Protein from the Any category dropbox and add the UniProtKB accession.
Click on Search.
Click on the UniProtKB accession from the list page and navigate to the Glycosylation section.
You can also view glycosylation sites through our highlighting feature in the Sequence section.

Proteoform FAQs

FAQs related to proteoform and site centric data and views.

What is a proteoform?

The term "proteoform" designates, "all of the different molecular forms in which the protein product of a single gene can be found, including changes due to genetic variations, alternatively spliced RNA transcripts and post-translational modifications," (Smith et al. Nat Methods. 2013 Mar;10(3):186-7. PMID:23443629).

A protein can have many different glycosylated proteoforms that differ from each other in a number of ways: the site(s) that are glycosylated, the glycan structure at each modified site, and the presence of any other modifications. For example, the figure below shows several different glycosylated proteoforms of the sex hormone binding globulin (SHBG) that are represented in the Protein Ontology (PRO) . These forms have an OGalNAc modification at Thr36 and one of several different N4GlcNAc modifications at Asn380 or Asn396.

What does proteoform-centric data include?

Proteoform-centric data include data types and information that is about some particular proteoform or that pertains to a specific amino acid site. A proteoform is the specific protein product of a gene; multiple proteoforms might arise due to differences in genetic variation, alternative splicing or translation start site selection, or post-translational modifications.

Why is the representation of proteoforms useful?

Many aspects of protein function-activity, sub-cellular localization, and interaction partners, for example, are influenced by the precise combination of modifications on the protein. When proteins are represented and annotated in knowledge resources without regard to modification state, these distinctions can be lost. In contrast, proteoform-level representation makes it possible to associate annotation unambiguously with the most relevant protein form. For example, the figure below shows the relative abundances of 12 different glycosylated proteoforms of SHBG (six different glycan structures at two different sites, Asn380 and Asn396). This abundance of information is associated with the corresponding proteoform terms in Protein Ontology (PRO) .

How can we represent and proteoforms through PRO ID's when proteins have many potential glycosylation sites and glycan forms?

The policy of the Protein Ontology (PRO) is to only represent proteoforms that have been experimentally observed. For example, if glycosylation is observed on two different sites, each of the singly modified forms can be represented in PRO; however, the proteoform with glycosylation on both sites will not be represented unless there is solid experimental evidence for its existence. In this way, the number of proteoforms per protein does not grow too large, and the forms represented are those that are of highest interest to biologists.

Glycan FAQs

FAQs related to glycan centric data and views.

What does glycan-centric data include?

Glycan-centric data include data types and information such as motifs, type/sub-type, mass, cross-references to different databases (UniCarbKB, ChEBI etc.) and sequences (e.g. IUPAC, WURCS, GlycoCT, etc.).

How do I find a GlyTouCan for my glycan composition?

If the glycan composition is known, users can utilize the GlyGen composition search to search for a glycan within the GlyGen set which captures ~30% of the entire GlyTouCan collection. To search for a GlyTouCan accession against the entire GlyTouCan , use the compo2wurcs tool developed by GlyCosmos .

How do I find a GlyTouCan for my glycan structure?

You can use one of the following ways to get a GlyTouCan accession:
- Visiting the GlyTouCan database:
  1. Through GlyTouCan Graphic Input
    1. Draw the glycan structure and provide linkage information (if known).
    2. Click SEARCH
  2. Through GlyTouCan Text Input
    1. Add either the WURCS or GlycoCT sequence and click SEARCH
- Downloading the Glycan Builder 2 Application:
  1. Download the Glycan Builder 2 application from here.
    - In the application, click the VIEW tab and enable SNFG Notation (Symbol Nomenclature for Glycans).
    - Draw the structure by choosing the appropriate monosaccharides and add linkage information (if known).
    - Once the structure is complete, click the STRUCTURE tab, then select get string from structure, and then from the drop-down menu select either the glycoct_condensed or WURCS format.
    - Copy and paste the glycoct_condensed or WURCS value into the GlyTouCan Text Input search and click on the SEARCH button.

How can I convert my glycan sequence to different formats (e.g IUPAC, WURCS, GlycoCT, LinearCode, etc.)?

GlycanFormatConverter (PMID:30535258) allows users to convert glycan sequences from different import formats (such as IUPAC-Condensed, IUPAC-Extended, GlycoCT, KCF, LinearCode, WURCS) to the desired export formats (such as IUPAC-Short,IUPAC-Condensed,IUPAC-Extended,GlycoCT,WURCS,GlycanWeb).
Please see the below example (IUPAC-extended to WURCS) for step-by-step instructions on using the tool or click here to visit the GlycanFormatConverter GitHub repository.
Go to GlycanFormatConverter Swagger and select iupacextended2wurcs
In the import field add your IUPAC string and specify "txt"/"json" in the format field.
Once you select Try it out, you can see the WURCS string under Response Body

How can I search all all the glycans present on my protein?

Go to the EXPLORE tab on the GlyGen home page and select Glycan Search.
Under Simple Search, select Protein from Any category add the UniProtKB accession in the search box.
Click on Search.

General FAQs

Other FAQs

Does GlyGen have a tutorial on how to use the site?

Tutorials for specific searches can be accessed through the EXPLORE tab on GlyGen home page.

What are the different resources integrated into GlyGen?

Click here to understand which resources are integrated into GlyGen.

What licence does GlyGen use?

GlyGen’s data is licensed under CC BY 4.0. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially as long as you give appropriate credit and attribution. To learn about CC BY 4.0, click here.

GlyGen’s source code and data processing scripts are licenced under GNU General Public License v3.0 which gives the user freedom to use the software for any purpose, to change the software to suit the needs, to share the software with your friends and neighbors, and also the freedom to share the changes a user makes. To learn about GNU GPL V3.0, visit the GNU licenses page.

What is GlyGen's policy regarding copyright and database distribution?

You must give appropriate credit to GlyGen (by referring to GlyGen for the whole resource or Object ID for individual datasets).
You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Datasets made available through GlyGen include archival of versions. We recommend authors to use Creative Commons Attribution licence (CC BY) for all versions when they share their datasets with GlyGen.
This means that users of GlyGen are entitled to use, reproduce, disseminate or display these datasets provided the original authors and GlyGen are attributed.

How can I access the GlyGen Data?

GlyGen data can be accessed through the DATA tab on the GlyGen homepage or directly via GlyGen data page. Data can also be be accessed by querying SPARQL endpoint (coming in future) or through GlyGen Web Service API

How can I download GlyGen data?

To download a list of results (with one or more filters) or to download data specific to an individual record:
- Go to the EXPLORE tab on the GlyGen home page and select a category for your search (e.g. protein, glycan, glycoprotein).
- Use the Simple Search option to search data based on a single filter.
  - For example, select Protein from Any category to search data only by UniProtKB accession.
- Use the Advanced Search option to apply additional filters to your search.
- Click on the Search button to navigate to the list page and click the download button in the top right corner to access the results.
- To download an individual record, select a record from the list page and click the download button.
To download the entire set of data:
- Go to the DATA tab on the GlyGen homepage.
- Select view more on the desired data object and select the download button.
- To know more about GlyGen data objects go to General FAQ 7.

What is a GlyGen data object?

GlyGen data is organized into individual data objects which are assigned a unique GlyGen ID (e.g. GLYDS000001).
Each GlyGen data object can be accessed through the DATA tab on the GlyGen home or directly via the Glygen data page.
Each data object has a README describing the source, contributors, integration process and quality control workflow, etc. Go to General FAQ 8 for more information.
The data objects can be filtered based on different categories (Protein, Proteoform, Glycan), by species (Homo sapiens, Mus musculus, Rattis norvegicus etc.), and by file type (.csv, .fasta, .png, etc.).

What is the format of the readme files?

All readme files follow the BioCompute specifications. Technical specifications can be found here .

How is data integrated into GlyGen?

see the right graph.

How can I submit my data?

Please use our Contact page to submit your data to GlyGen.

How often is the GlyGen data updataed?

GlyGen data is released every six months during March-April and August-September of each year.
Minor updates to the data are made when a bug or a data issue has been found and addressed.

How can I access the previous versions of the data?

Go to the DATA tab on the GlyGen home page.
Click view details on the desired data object.
Select the desired version and date from the Version dropbox on the top left corner and click on download.

I am not able to retrieve any information for my UniProt Accession or GlyTouCan Accession?

Go to General FAQ 14 for more information.

Which UniProtKB and GlyTouCan accessions are inclded in GlyGen?

GlyGen maintains a strict protein (UniProtKB) and glycan (GlyTouCan) accession list compiled based on different criteria.
The UniProtKB accession list serves as a reference for generating protein-centric and proteoform-centric datasets, whereas the GlyTouCan accession list is used for generating glycan-centric datasets. The criteria and the statistics may be subject to change with every major and minor release.
- Protein:
  1. Currently, GlyGen stores protein information (UniProtKB accessions) for only Human [Homo sapiens], Mouse [Mus musculus] and Rat [Rattus Norvegicus]species.
  2. The list consists of total 20,997 UniProtKB canonical accessions complied via gene name grouping [UniProtKB version-Nov 2017].
- Glycan:
  1. The GlyTouCan accession list is compiled when the accession falls under one or more selected criteria: Visit GlyGen dataset objects (Human: GLYSD144, Mouse: GLYSD145, Rat: GLYSD146 ) for more information.

The reference or the source database reflects different data than what is represented in the GlyGen pages for the same annotation?

The data in GlyGen is downloaded on a specific date (“freeze date”) from the source which may also represent a specific release of that database (e.g. UniProtKB Nov. 2018 release).
If the source data changes after the GlyGen freeze date, or if there is a synchronization delay, there could be a discrepancy in both databases.

Does GlyGen filter out publication information on the Protein detail page?

Yes, GlyGen filters out some publications of large scale studies that are based on genome sequencing, protein sequencing, cDNA, chromosomes that do not provide any functional annotation related to the protein.

What is ECO or eco_id (eco_identifier)?

The Evidence & Conclusion Ontology (ECO) is a controlled vocabulary that describes scientific evidence, which results from a variety of research methods, as well as interpretations by authors and scientific curators. ECO is used to document specific evidence to support conclusions that result from a scientific investigation.
Examples:
- Inferred from Experiment (EXP) ECO:0000269 experimental evidence used in manual assertion
- Traceable Author Statement (TAS) ECO:0000304
- Inferred from Sequence Ontology (ISO) ECO:0000266
- Inferred from Sequence or Structural Similarity (ISS) ECO:0000250
Click here to browse the ECO.

References

Kahsay, R.; Vora, J.; Navelkar, R.; Mousavi, R.; Fochtman, B.; Holmes, X.; Pattabiraman, N.; Ranzinger, R.; Mahadik, R.; Williamson, T.; Kulkarni, S.; Agarwal, G.; Martin, M.; Vasudev, P.; García Castro, L.; Edwards, N.; Zhang, W.; Natale, D.; Ross, K.; Mazumder, R. (2020)."GlyGen data model and processing workflow". Bioinformatics. btaa238. https://doi.org/10.1093/bioinformatics/btaa238
York, W. S., Mazumder, R., Ranzinger, R., Edwards, N., Kahsay, R., Aoki-Kinoshita, K. F., Campbell, M. P., Cummings, R. D., Feizi, T., Martin, M., Natale, D. A., Packer, N. H., Woods, R. J., Agarwal, G., Arpinar, S., Bhat, S., Blake, J., Castro, L., Fochtman, B., Gildersleeve, J., … Zhang, W. (2020). GlyGen: Computational and Informatics Resources for Glycoscience. Glycobiology, 30(2), 72–73. PMID: 31616925 https://doi.org/10.1093/glycob/cwz080

External links

GlyGen.org: Informatics Resources for Glycoscience

https://www.glygen.org/

@@ Line 1: / Line 1: @@
-[[File:Glygen-symble.svg|thumb]]
+[[File:Glygen.jpg|alt=|thumb]]
 [[File:Logo-glygen-136-top-icon-blue.png|thumb]]
 Frequently asked questions related to front-end and data.