ML-Ready Datasets: Difference between revisions

From GlyGen Wiki
Jump to navigation Jump to search
(Provided description and examples)
 
(Added citations)
Line 32: Line 32:


* Predicting allosteric sites for drug development<ref>Tian, H., Xiao, S., Jiang, X., & Tao, P. (2023). PASSerRank: Prediction of allosteric sites with learning to rank. ''Journal of computational chemistry'', ''44''(28), 2223–2229. <nowiki>https://doi.org/10.1002/jcc.27193</nowiki></ref>.
* Predicting allosteric sites for drug development<ref>Tian, H., Xiao, S., Jiang, X., & Tao, P. (2023). PASSerRank: Prediction of allosteric sites with learning to rank. ''Journal of computational chemistry'', ''44''(28), 2223–2229. <nowiki>https://doi.org/10.1002/jcc.27193</nowiki></ref>.
== Citations ==

Revision as of 16:40, 18 June 2024

Machine Learning (ML)-Ready datasets are structured datasets that have been pre-processed and organized which makes them suitable for training ML models. These datasets typically do not require any further modifications and allow users with little to no scripting experience or domain knowledge to streamline the model development process. Bioinformatics has greatly benefited from ML-ready datasets in areas such as protein structure prediction, biomarker discovery, clinical data analysis, and systems biology. Various -omics datasets, including glycomics, proteomics, genomics, and transcriptomics, have been leveraged to drive these advancements. ML-ready datasets also provide novice bioinformaticians with the ability to explore ML techniques without requiring extensive expertise in the subject matter. This, in turn, promotes the growth of bioinformaticians by helping them understand the nuances of -omic data curation and how these factors impact modeling approaches.

Applications and Examples

Sequence Analysis

ML-ready datasets have enabled researchers to identify sequences, mutations, and variants by training models that can recognize patterns in DNA, RNA, and protein sequences.

Examples:

  • Identifying the protein molecular phenotypes that are associated with human missense variants[1].
  • Investigating RNA-seq expression data to determine genes associated with disease[2].

Biomarker Discovery

ML-ready datasets can help identify biomarkers for diseases to help facilitate early diagnosis, prognosis, and the development of targeted therapies.

Examples:

  • Leveraging genomics and healthcare data to identify genes associated with targeted diseases to predict the burden of risk[3].
  • Identifying biomarkers associated with cardiovascular disease[4].

Protein Structure Prediction

ML-ready datasets can aid in predicting the efficiency and toxicity of new compounds to optimize drug design.

Examples:

  • Identifying relationships between protein sequence, structure, and function[5].

Drug Discovery

ML-ready datasets can aid in predicting the efficiency and toxicity of chemical compounds to optimize drug discovery.

Examples:

  • Predicting allosteric sites for drug development[6].

Citations

  1. Rehfeldt, T. G., Gabriels, R., Bouwmeester, R., Gessulat, S., Neely, B. A., Palmblad, M., Perez-Riverol, Y., Schmidt, T., Vizcaíno, J. A., & Deutsch, E. W. (2023). ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. Journal of proteome research, 22(2), 632–636. https://doi.org/10.1021/acs.jproteome.2c00629
  2. Venkat, V., Abdelhalim, H., DeGroat, W., Zeeshan, S., & Ahmed, Z. (2023). Investigating genes associated with heart failure, atrial fibrillation, and other cardiovascular diseases, and predicting disease using machine learning techniques for translational research and precision medicine. Genomics, 115(2), 110584. https://doi.org/10.1016/j.ygeno.2023.110584
  3. DeGroat , W., Venkat , V., Pierre-Louis, W., Abdelhalim , H., & Ahmed, Z. (2023). Hygieia: AI/ML pipeline integrating healthcare and genomics data to investigate genes associated with targeted disorders and predict disease. Software Impacts, 16, 100493. https://doi.org/10.1016/j.simpa.2023.100493
  4. DeGroat, W., Abdelhalim, H., Patel, K., Mendhe, D., Zeeshan, S., & Ahmed, Z. (2024). Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine. Scientific reports, 14(1), 1. https://doi.org/10.1038/s41598-023-50600-8
  5. Draizen, E. J., Readey, J., Mura, C., & Bourne, P. E. (2024). Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC bioinformatics, 25(1), 11. https://doi.org/10.1186/s12859-023-05586-5
  6. Tian, H., Xiao, S., Jiang, X., & Tao, P. (2023). PASSerRank: Prediction of allosteric sites with learning to rank. Journal of computational chemistry, 44(28), 2223–2229. https://doi.org/10.1002/jcc.27193