PubMed is a vast repository of medical research papers that provides API access to an extensive wealth of data. This treasure trove of information can be harnessed to construct knowledge graphs for applications such as healthbots and recommender systems. However, an inherent challenge lies in effectively sorting and analyzing these papers: while some journals boast high impact factors, others may not meet the same standards. Addressing this, the current project serves as a proof of concept, showcasing a comprehensive approach that combines advanced methods for scoring, extracting, and analyzing high-quality papers.
The project begins by scraping PubMed data, extracting the journal impact factor and university research scores through different API services. Once the data is collated in a tabular format, data analysis has been performed to understand the underlying distribution and idenitify the correlation between the parameters. Further, an unsupervised learning-based scoring system has been introduced to rank papers based on features such as journal impact factors and university research scores. Finally after scoring, top tier articles have been extracted for constructing Knowledge Graphs. Using BERN2, Named Entity Recognition (NER) is performed to identify critical biomedical entities . Further to identify the relationships between these entities LLama 3.1 was used. This extracted information forms the foundation for constructing sophisticated knowledge graphs, enabling enhanced insights and explainability.
While this project lays the groundwork for these capabilities, it is merely a starting point. With the rapid advancements in Large Language Models (LLMs) and other AI technologies, numerous improvements can be envisioned. These include refining entity extraction, enhancing relationship modeling, and integrating more dynamic scoring methods, ultimately expanding the potential applications and impact of this system in real-world scenarios.
The sorting of papers is based on various features such as:
- Citation count
- Impact factor of the journal
- Research score of the university
- Study type
Due to limitations in web scraping, only the impact factor and research score of universities could be scraped. In this project:
- Data scraping is achieved for impact factors and research scores from different sources.
- The extracted data is merged into a single dataframe.
- GLiner transformer is used to extract entities like universities and study types.
The repository is organized as follows:
pubmed/
- Main directory containing PubMed data processing resources.notebooks/
- Jupyter notebooks with detailed examples and use cases.csv/
- Sample CSV files for testing and demonstration purposes.README.md
- Project documentation (you are here!)..gitignore
- Specifies files and directories to ignore in the repository.You can view the exploratory data analysis on nbviewer: Pubmed_EDA_part2.ipynb