Towards a Scientific Language Processing Model

Publication Year2023

0
Citations
50
Usage
0
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Usage
50
- Abstract Views
  50

Artifact Description

Natural Language Processing is an effective tool for analyzing large volumes of text effectively. However, most scientific articles contain sophisticated language that can be difficult to understand effectively and quickly. To expedite this, I tuned a model that can quickly classify abstract datasets about scientific topics into specific subcategories. Using the ArXiv corpus with over 2.2 million abstracts, I created a dataset of climate change articles, on which I ran pretrained HuggingFace models. Using observational and quantitative data (ROUGE, Cosine SImilarity, etc.), I tuned the parameters of various keyword extraction models and analyzed the keyword frequency of the dataset. Then, using the BERTopic model with various embedding techniques (SentenceTranformers, spaCy, etc.), I classified the dataset into clusters which could be individually analyzed. I used abstractive and extractive summarization models on each cluster to concisely describe the general progress of particular climate change topics. Using dynamic topic modeling, I then plotted the prevalence of different topics over time, which provided insight into the interest in climate change topics over the past decade. This weakly-supervised algorithm allows analysts and researchers to quickly derive general conclusions about specific scientific topics and visualize their relevance in the scientific community over time.

Bibliographic Details

REPOSITORY URLhttps://digitalcommons.imsa.edu/sir_presentations/2023/session1/66

URL IDhttps://digitalcommons.imsa.edu/sir_presentations/2023/session1/66; https://digitalcommons.imsa.edu/cgi/viewcontent.cgi?article=2097&context=sir_presentations

AUTHOR(S)

Ishan Buyyanapragada

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know