PlumX Metrics
Embed PlumX Metrics

Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

Scientometrics, ISSN: 1588-2861, Vol: 128, Issue: 2, Page: 1187-1204
2023
  • 0
    Citations
  • 0
    Usage
  • 12
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

Article Description

Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know