Frequently Asked Questions

From GlyGen Wiki
Jump to navigation Jump to search

The frequently asked questions are a collection of user questions related to the GlyGen frontend, backend, and data. The answers to these questions contain definition and explanations of terms, such as Proteoform or Uniprot Accession number, and short how-to's for using the GlyGen portal and related tools, such as "How can I search all the glycosylation sites present on my protein?". The list of questions is subdivided into questions related to protein, proteoform, and general questions. You can also use the GlyGen's contact page to reach out to us with any additional questions or queries.


Protein FAQs

FAQs related to protein centric data and views.

How to Use Simple Protein Search?

How to Use Simple Glycoprotein Search?

What does protein-centric data include?

  • Protein-centric data include data types and information about a particular protein-coding gene, or that can be mapped to the canonical protein sequence representing that gene. Examples include pathways, Gene Ontology, localization, etc.

What is a UniProtKB canonical accession?

  • UniProtKB canonical accession is an accession assigned to the protein isoform chosen to be the canonical sequence to which all positional annotation refers in the UniProtKB entry page.
  • UniProtKB represents different isoforms of the same protein by assigning it a protein accession followed by the number of the isoform.For example: for protein accession P38398, different isoforms are represented as P38398-1, P38398-2, P38398-3, P38398-4, etc. where P38398-1 is the chosen canonical accession.
  • UniProtKB uses specific criteria for choosing the canonical sequence for the entry. To know more about canonical isoforms and the canonical choosing criteria, refer to the UniProtKB help page.
  [top]

How do I find a UniProtKB accession for my protein?

  • UniProtKB accession is used to represent a protein by UniProtKB database. Visit UniProtKB for additional information.
  • In GlyGen, you can find the UniProtKB accession for your protein either by providing a protein name, protein sequence, gene name or even by adding different cross-references (e.g. RefSeq accession, KEGG etc.):
  • Go to the EXPLORE tab on GlyGen the home page and select Protein Search.
  • Select Advanced Search and add the desired input by selecting the appropriate box.

How can I search all the glycosylation sites present on my protein?

  • Go to the EXPLORE tab on the GlyGen home page and select Protein Search.
  • Under Simple Search, select Protein from the Any category dropbox and add the UniProtKB accession.
  • Click on Search.
  • Click on the UniProtKB accession from the list page and navigate to the Glycosylation section.
  • You can also view glycosylation sites through our highlighting feature in the Sequence section.

Proteoform FAQs

  • FAQs related to proteoform and site centric data and views.

What is a proteoform?

  • The term "proteoform" designates, "all of the different molecular forms in which the protein product of a single gene can be found, including changes due to genetic variations, alternatively spliced RNA transcripts and post-translational modifications," (Smith et al. Nat Methods. 2013 Mar;10(3):186-7. PMID:23443629).
  • A protein can have many different glycosylated proteoforms that differ from each other in a number of ways: the site(s) that are glycosylated, the glycan structure at each modified site, and the presence of any other modifications. For example, the figure below shows several different glycosylated proteoforms of the sex hormone binding globulin (SHBG) that are represented in the Protein Ontology (PRO) .
  • These forms have an OGalNAc modification at Thr36 and one of several different N4GlcNAc modifications at Asn380 or Asn396.
  [top]

What does proteoform-centric data include?

  • Proteoform-centric data include data types and information that is about some particular proteoform or that pertains to a specific amino acid site. A proteoform is the specific protein product of a gene; multiple proteoforms might arise due to differences in genetic variation, alternative splicing or translation start site selection, or post-translational modifications.

Why is the representation of proteoforms useful?

(A)Relative abundance of several glycosylated forms of SHBG(Sumer-Bayraktar et al. PMID:23001782). (B)PRO definition of one of the forms showing the associated abundance information.
  • Many aspects of protein function-activity, sub-cellular localization, and interaction partners, for example, are influenced by the precise combination of modifications on the protein. When proteins are represented and annotated in knowledge resources without regard to modification state, these distinctions can be lost. In contrast, proteoform-level representation makes it possible to associate annotation unambiguously with the most relevant protein form. For example, the figure below shows the relative abundances of 12 different glycosylated proteoforms of SHBG (six different glycan structures at two different sites, Asn380 and Asn396). This abundance of information is associated with the corresponding proteoform terms in Protein Ontology (PRO) .

How can we represent and proteoforms through PRO ID's when proteins have many potential glycosylation sites and glycan forms?

  • The policy of the Protein Ontology (PRO) is to only represent proteoforms that have been experimentally observed. For example, if glycosylation is observed on two different sites, each of the singly modified forms can be represented in PRO; however, the proteoform with glycosylation on both sites will not be represented unless there is solid experimental evidence for its existence. In this way, the number of proteoforms per protein does not grow too large, and the forms represented are those that are of highest interest to biologists.

How can search for all proteins that bear my glycan/GlyTouCan Ac?

  • Go to the Explore tab and click on Glycoprotein.
  • In the Advanced Search, add your GlyTouCan accession in the Interacting Glycan field and click on Search.
  • Please refer to How do I find a GlyTouCan for my glycan structure on how to find a GlyTouCan accession for your desired glycan.
  [top]

Glycan FAQs

FAQs related to glycan centric data and views.

How to use simple glycan search?

  • This tutorial illustrates how to search for a glycan or collection of glycans based on their general properties, structural features, attachment to a glycoprotein(s), mechanisms of biosynthesis, etc.

What does glycan-centric data include?

  • Glycan-centric data include data types and information such as motifs, type/sub-type, mass, cross-references to different databases (UniCarbKB, ChEBI etc.) and sequences/string representation (e.g. IUPAC, WURCS, GlycoCT, etc.).

How do I find a GlyTouCan accession for my glycan composition or structure?

How can I register my glycan structure into GlyTouCan?

  • To register a glycan composition:
  • To register a glycan structure or sequence/string-representation:
    • Visit the GlyTouCan website.
    • Click on Registration and sign in through your desired google account.
      • To draw your structure and click on Graphical Input (PubMed)
        • Once the structure is complete, a sequence is generated from the input and sent to the database to check if it was previously registered. If the structure is found in the database, a link and accession number indicating the structure ID will be displayed; otherwise, a final confirmation screen will be shown. If the submit button is entered to indicate confirmation, the newly “submission ref” and graphical representation will be displayed. As all sequences will be stored in GlycoCT (condensed) format, the sequence initially input will be displayed under original structure, while the sequence converted into WURCS will be displayed under Structure.
      • To register sequence/text-representation (WURCS/GlycoCT) use the Text Input.
        • Please refer to FAQ 3.6 to convert your sequences into a desired format.

Why do identical glycan structure images have different GlyTouCan IDs?

  • Identical images (e.g. G70994MS/G29068FM or G86089ZC/G64581RP) might have different GlyTouCan accessions because the WURCS string is different. For example for
    • G70994MS it is WURCS=2.0/1,1,0/[axxxxh-1x_1-5_2*NCC/3=O]/1/
    • G29068FM it is WURCS=2.0/1,1,0/[uxxxxh_2*NCC/3=O]/1/

How can I convert my glycan sequence/string-format to different desired formats (e.g IUPAC, WURCS, GlycoCT, LinearCode, etc.)?

  • GlycanFormatConverter (PMID:30535258) allows users to convert glycan sequences from different import formats (such as IUPAC-Condensed, IUPAC-Extended, GlycoCT, KCF, LinearCode, WURCS) to the desired export formats (such as IUPAC-Short,IUPAC-Condensed,IUPAC-Extended,GlycoCT,WURCS,GlycanWeb).
  • Please click here to visit the GlycanFormatConverter GitHub repository or see the below example (IUPAC-extended to WURCS) for step-by-step instructions on using the tool:
    • Go to GlycanFormatConverter Swagger and select iupacextended2wurcs
    • In the import field add your IUPAC string and specify "txt"/"json" in the format field.
    • Once you select Try it out, you can see the WURCS string under Response Body

How can I search all the glycans present on my protein?

  • Go to the EXPLORE tab on the GlyGen home page and select Glycan Search.
  • Under Simple Search, select Protein from Any category add your UniProtKB accession in the search box.
  • Click on Search.

How do I input the structure of the required glycan?

You can use the GNOme (beta testing is currently underway) widget to search for your desired glycan structure or composition.

please see: https://gnome.glyomics.org/restrictions/GNOme_GlyGen.browser.html?GalNAc=1&Gal=1

By clicking on structures you can move further along glycan tree to retrieve a more specific structure (i.e with defined linkage information)

for eg:  https://gnome.glyomics.org/restrictions/GNOme_GlyGen.browser.html?saccharide=G16612QG  

You can right-click on your desired structure to see the glycan entry in GlyGen database, where you can retrieve additional information. (go to "Found Glycoprotein" section on the individual glycan entry page to find the associated proteins).

Alternatively, you can use the GlyGen composition search to search for your glycan with desired monosaccharides composition.

e.g.: https://www.glygen.org/glycan_list.html?id=8691cfa82cb8eea7bf6b68995ae37bc2

You can click on the desired GlyTouCan accession to retrieve more information about the glycan.

Another option is to use the GlyTouCan's graphic input tool to draw your desired glycan and retrieve the GlyTouCan accession.

Glycan data flow

Once you have the GlyTouCan accession you can use GlyGen's glycan or glycoprotein searches to find your desired output.

How is glycan data collected, integrated, and processed in GlyGen?

GlyGen processes glycoprotein-centric and glycan-centric data from different sources like GlyConnect, UniCarbKB, GlyTouCan, MatrixDB, etc in coordination between GlyGen team members at George Washington University and Georgetown University. GlyGen is also part of the GlySpace alliance and coordinates with team members biocuration activities. The overview of this process can be seen in image.




General FAQs

Does GlyGen have a tutorial on how to use the site?

  • Tutorials for specific searches can be accessed through the EXPLORE tab on GlyGen home page.

What are the different resources integrated into GlyGen?

  • Click here to understand which resources are integrated into GlyGen.

What licence does GlyGen use?

  • GlyGen’s data is licensed under CC BY 4.0. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially as long as you give appropriate credit and attribution. To learn about CC BY 4.0, click here.
  • GlyGen’s source code and data processing scripts are licenced under GNU General Public License v3.0 which gives the user freedom to use the software for any purpose, to change the software to suit the needs, to share the software with your friends and neighbors, and also the freedom to share the changes a user makes. To learn about GNU GPL V3.0, visit the GNU licenses page.
  [top]

What is GlyGen's policy regarding copyright and database distribution?

  • You must give appropriate credit to GlyGen (by referring to GlyGen for the whole resource or Object ID for individual datasets) and may do so in any reasonable manner, but not in any way that suggests the licencor endorses you or your use.
  • Datasets made available through GlyGen include archival of versions. We recommend authors to use Creative Commons Attribution licence (CC BY) for all versions when they share their datasets with GlyGen.
  • This means that users of GlyGen are entitled to use, reproduce, disseminate or display these datasets provided the original authors and GlyGen are attributed.

How can I access the GlyGen Data?

  • GlyGen data can be accessed through the DATA tab on the GlyGen homepage or directly via GlyGen data page. Data can also be be accessed by querying SPARQL endpoint (coming in future) or through GlyGen Web Service API

How can I download GlyGen data?

  • To download a list of results (with one or more filters) or to download data specific to an individual record:
  • Go to the EXPLORE tab on the GlyGen home page and select a category for your search (e.g. protein, glycan, glycoprotein).
  • Use the Simple Search option to search data based on a single filter.
    • For example, select Protein from Any category to search data only by UniProtKB accession.
  • Use the Advanced Search option to apply additional filters to your search.
  • Click on the Search button to navigate to the list page and click the download button in the top right corner to access the results.
  • To download an individual record, select a record from the list page and click the download button.
  • To download the entire set of data:
    • Go to the DATA tab on the GlyGen homepage.
    • Select view more on the desired data object and select the download button.
    • To know more about GlyGen data objects go to General FAQ What is a GlyGen data object.
        [top]

What is a GlyGen data object?

  • GlyGen data is organized into individual data objects which are assigned a unique GlyGen ID (e.g. GLYDS000001).
  • Each GlyGen data object can be accessed through the DATA tab on the GlyGen home or directly via the Glygen data page.
  • Each data object has a README describing the source, contributors, integration process and quality control workflow, etc. Go to General FAQ 8 for more information.
  • The data objects can be filtered based on different categories (Protein, Proteoform, Glycan), by species (Homo sapiens, Mus musculus, Rattis norvegicus, SARS-CoV-1, SARS-CoV-2 etc.), and by file type (.csv, .fasta, .png, etc.).
  [top]

What is the format of the README files?

  • All readme files follow the BioCompute specifications. Technical specifications can be found here .
Data integration process in GlyGen.

How is data integrated into GlyGen?

How can I submit my data?

  • Please use our Contact page to submit your data to GlyGen.

How often is the GlyGen data updated?

  • GlyGen data is released every six months during March-April and August-September of each year.
  • Minor updates to the data are made when a bug or a data issue has been found and addressed.

How can I access the previous versions of the data?

  • Go to the DATA tab on the GlyGen home page.
  • Click view details on the desired data object.
  • Select the desired version and date from the Version dropbox on the top left corner and click on download.

Which UniProtKB and GlyTouCan accessions are inclded in GlyGen?

  • GlyGen maintains a strict protein (UniProtKB) and glycan (GlyTouCan) accession list compiled based on different criteria.
  • The UniProtKB accession list serves as a reference for generating protein-centric and proteoform-centric datasets, whereas the GlyTouCan accession list is used for generating glycan-centric datasets. The criteria and the statistics may be subject to change with every major and minor release.
  • Protein:
    • Please see the following species specific GlyGen datasets to see the current GlyGen protein masterlists:
      • Homo sapiens [TaxID:9606]: GLYDS000001
      • Mus musculus [TaxID:10090]: GLYDS000007
      • Rattus norvegicus [TaxID:10116]:GLYDS000244
      • Hepatitis C virus (genotype 1a, isolate H) [TaxID:11108]: GLYDS000344
      • Hepatitis C virus (genotype 1b, isolate Japanese) [TaxID:11116]: GLYDS000345
      • SARS coronavirus (SARS-CoV-1) [TaxID:694009]: GLYDS000467
      • SARS coronavirus (SARS-CoV-2 or 2019-nCoV) [TaxID:2697049]:GLYDS000434
  • Glycan:
    • Please see the GlyGen dataset GLYDS000281 to access the current GlyGen glycan masterlist.
  [top]

The reference or the source database reflects different data than what is represented in the GlyGen pages for the same annotation?

  • The data in GlyGen is downloaded on a specific date (“freeze date”) from the source which may also represent a specific release of that database (e.g. UniProtKB Nov. 2018 release).
  • If the source data changes after the GlyGen freeze date, or if there is a synchronization delay, there could be a discrepancy in both databases.

Does GlyGen filter out publication information on the Protein detail page?

  • Yes, GlyGen filters out some publications of large scale studies that are based on genome sequencing, protein sequencing, cDNA, chromosomes that do not provide any functional annotation related to the protein.

What is ECO or eco_id (eco_identifier)?

  • The Evidence & Conclusion Ontology (ECO) is a controlled vocabulary that describes scientific evidence, which results from a variety of research methods, as well as interpretations by authors and scientific curators. ECO is used to document specific evidence to support conclusions that result from a scientific investigation.
  • Examples:
    • Inferred from Experiment (EXP) ECO:0000269 experimental evidence used in manual assertion.
    • Traceable Author Statement (TAS) ECO:0000304
    • Inferred from Sequence Ontology (ISO) ECO:0000266
    • Inferred from Sequence or Structural Similarity (ISS) ECO:0000250
    • Click here to browse the ECO.

How to access or download all the GlyGen dataset BCOs?

The individual BCOs can be accessed from data.glygen.org. Use filter options to locate the required BCO and click it for detail. select the required version and the download option is on the right.

The json download script for all the GlyGen and OncoMX BCOs that are stored in MongoDB.

References

  1. Kahsay, R.; Vora, J.; Navelkar, R.; Mousavi, R.; Fochtman, B.; Holmes, X.; Pattabiraman, N.; Ranzinger, R.; Mahadik, R.; Williamson, T.; Kulkarni, S.; Agarwal, G.; Martin, M.; Vasudev, P.; García Castro, L.; Edwards, N.; Zhang, W.; Natale, D.; Ross, K.; Mazumder, R. (2020)."GlyGen data model and processing workflow". Bioinformatics. btaa238. https://doi.org/10.1093/bioinformatics/btaa238
  2. York, W. S., Mazumder, R., Ranzinger, R., Edwards, N., Kahsay, R., Aoki-Kinoshita, K. F., Campbell, M. P., Cummings, R. D., Feizi, T., Martin, M., Natale, D. A., Packer, N. H., Woods, R. J., Agarwal, G., Arpinar, S., Bhat, S., Blake, J., Castro, L., Fochtman, B., Gildersleeve, J., … Zhang, W. (2020). GlyGen: Computational and Informatics Resources for Glycoscience. Glycobiology, 30(2), 72–73. PMID: 31616925 https://doi.org/10.1093/glycob/cwz080
  3. Sumer-Bayraktar, Z., Nguyen-Khuong, T., Jayo, R., Chen, D. D., Ali, S., Packer, N. H., & Thaysen-Andersen, M. (2012). Micro- and macroheterogeneity of N-glycosylation yields size and charge isoforms of human sex hormone binding globulin circulating in serum. Proteomics, 12(22), 3315–3327. https://doi.org/10.1002/pmic.201200354 PMID: 23001782 https://pubmed.ncbi.nlm.nih.gov/23001782/

External links

  [top]