ML-Ready Datasets: Difference between revisions

From GlyGen Wiki
Jump to navigation Jump to search
(Added citations)
(Provided more GlyGen specific information.)
Line 1: Line 1:
Machine Learning (ML)-Ready datasets are structured datasets that have been pre-processed and organized which makes them suitable for training ML models. These datasets typically do not require any further modifications and allow users with little to no scripting experience or domain knowledge to streamline the model development process. Bioinformatics has greatly benefited from ML-ready datasets in areas such as protein structure prediction, biomarker discovery, clinical data analysis, and systems biology. Various -omics datasets, including glycomics, proteomics, genomics, and transcriptomics, have been leveraged to drive these advancements. ML-ready datasets also provide novice bioinformaticians with the ability to explore ML techniques without requiring extensive expertise in the subject matter. This, in turn, promotes the growth of bioinformaticians by helping them understand the nuances of -omic data curation and how these factors impact modeling approaches.
Machine Learning (ML)-ready datasets are structured datasets that have been pre-processed and organized which makes them suitable for training ML models. These datasets typically require minimal modifications and allow users with little to no scripting experience but some domain knowledge to streamline the model development process. GlyGen consists of glycomics and glycoproteomics ML-ready datasets that allow glycobiology and bioinformatics scientists to leverage ML-ready datasets for disease risk assessment for conditions like Type II Diabetes and Clear Cell Renal Carcinoma. GlyGen is also developing an ML-ready dataset in collaboration with the University of Delaware that maps various features (such as disease, cell line, tissue, and species) to their respective ontological IDs that will be publicly available on [[data.glygen.org]] after publication.  


== Applications and Examples ==
Bioinformatics has greatly benefited from ML-ready datasets in areas such as protein structure prediction, biomarker discovery, clinical data analysis, and systems biology. GlyGen aims to extend these benefits by providing user-friendly tools and resources that make advanced data analysis more accessible to researchers, regardless of their technical background. Our goal is to enable more researchers to leverage machine learning in their work to facilitate discoveries and advancements in the field.


=== Sequence Analysis ===
== Available ML-Ready Datasets ==
ML-ready datasets have enabled researchers to identify sequences, mutations, and variants by training models that can recognize patterns in DNA, RNA, and protein sequences.
All ML-ready datasets are available at [[data.glygen.org]].  
 
{| class="wikitable"
<u>Examples:</u>
|+ML-Ready Datasets
 
!Dataset
* Identifying the protein molecular phenotypes that are associated with human missense variants<ref>Rehfeldt, T. G., Gabriels, R., Bouwmeester, R., Gessulat, S., Neely, B. A., Palmblad, M., Perez-Riverol, Y., Schmidt, T., Vizcaíno, J. A., & Deutsch, E. W. (2023). ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. ''Journal of proteome research'', ''22''(2), 632–636. <nowiki>https://doi.org/10.1021/acs.jproteome.2c00629</nowiki></ref>.
!Data 
* Investigating RNA-seq expression data to determine genes associated with disease<ref>Venkat, V., Abdelhalim, H., DeGroat, W., Zeeshan, S., & Ahmed, Z. (2023). Investigating genes associated with heart failure, atrial fibrillation, and other cardiovascular diseases, and predicting disease using machine learning techniques for translational research and precision medicine. ''Genomics'', ''115''(2), 110584. <nowiki>https://doi.org/10.1016/j.ygeno.2023.110584</nowiki></ref>.  
!Condition
 
!n
=== Biomarker Discovery ===
|-
ML-ready datasets can help identify biomarkers for diseases to help facilitate early diagnosis, prognosis, and the development of targeted therapies.
|Human Diabetes Glycomics (ML Ready)
 
|N-glycome Abundance
<u>Examples:</u>
|Diabetes
 
|74
* Leveraging genomics and healthcare data to identify genes associated with targeted diseases to predict the burden of risk<ref>DeGroat , W.,  Venkat , V., Pierre-Louis, W.,  Abdelhalim , H., & Ahmed, Z. (2023). ''Hygieia'': AI/ML pipeline integrating healthcare and genomics data to investigate genes associated with targeted disorders and predict disease. ''Software Impacts'', ''16'',  100493. <nowiki>https://doi.org/10.1016/j.simpa.2023.100493</nowiki></ref>.
|-
* Identifying biomarkers associated with cardiovascular disease<ref>DeGroat, W., Abdelhalim, H., Patel, K., Mendhe, D., Zeeshan, S., & Ahmed, Z. (2024). Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine. ''Scientific reports'', ''14''(1), 1. <nowiki>https://doi.org/10.1038/s41598-023-50600-8</nowiki></ref>.
|Human ccRCC Glycoproteomics (ML Ready)
 
|Glycopeptide Abundance
=== Protein Structure Prediction  ===
|Clear Cell Rena Carcinoma
ML-ready datasets can aid in predicting the efficiency and toxicity of new compounds to optimize drug design.
|
 
|}
<u>Examples:</u>
 
* Identifying relationships between protein sequence, structure, and function<ref>Draizen, E. J., Readey, J., Mura, C., & Bourne, P. E. (2024). Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. ''BMC bioinformatics'', ''25''(1), 11. <nowiki>https://doi.org/10.1186/s12859-023-05586-5</nowiki></ref>.
 
=== Drug Discovery ===
ML-ready datasets can aid in predicting the efficiency and toxicity of chemical compounds to optimize drug discovery.
 
<u>Examples:</u>
 
* Predicting allosteric sites for drug development<ref>Tian, H., Xiao, S., Jiang, X., & Tao, P. (2023). PASSerRank: Prediction of allosteric sites with learning to rank. ''Journal of computational chemistry'', ''44''(28), 2223–2229. <nowiki>https://doi.org/10.1002/jcc.27193</nowiki></ref>.
 
== Citations ==

Revision as of 15:00, 28 June 2024

Machine Learning (ML)-ready datasets are structured datasets that have been pre-processed and organized which makes them suitable for training ML models. These datasets typically require minimal modifications and allow users with little to no scripting experience but some domain knowledge to streamline the model development process. GlyGen consists of glycomics and glycoproteomics ML-ready datasets that allow glycobiology and bioinformatics scientists to leverage ML-ready datasets for disease risk assessment for conditions like Type II Diabetes and Clear Cell Renal Carcinoma. GlyGen is also developing an ML-ready dataset in collaboration with the University of Delaware that maps various features (such as disease, cell line, tissue, and species) to their respective ontological IDs that will be publicly available on data.glygen.org after publication.

Bioinformatics has greatly benefited from ML-ready datasets in areas such as protein structure prediction, biomarker discovery, clinical data analysis, and systems biology. GlyGen aims to extend these benefits by providing user-friendly tools and resources that make advanced data analysis more accessible to researchers, regardless of their technical background. Our goal is to enable more researchers to leverage machine learning in their work to facilitate discoveries and advancements in the field.

Available ML-Ready Datasets

All ML-ready datasets are available at data.glygen.org.

ML-Ready Datasets
Dataset Data Condition n
Human Diabetes Glycomics (ML Ready) N-glycome Abundance Diabetes 74
Human ccRCC Glycoproteomics (ML Ready) Glycopeptide Abundance Clear Cell Rena Carcinoma