Investigating the Influence of Feature Sources for Malicious Website Detection

Citation DataApplied Sciences (Switzerland), ISSN: 2076-3417, Vol: 12, Issue: 6

Publication Year2022

11
Citations
0
Usage
33
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Citations
11
- Citation Indexes
  11
Captures
33
- Readers
  33

Article Description

Malicious websites in general, and phishing websites in particular, attempt to mimic legitimate websites in order to trick users into trusting them. These websites, often a primary method for credential collection, pose a severe threat to large enterprises. Credential collection enables malicious actors to infiltrate enterprise systems without triggering the usual alarms. Therefore, there is a vital need to gain deep insights into the statistical features of these websites that enable Machine Learning (ML) models to classify them from their benign counterparts. Our objective in this paper is to provide this necessary investigation, more specifically, our contribution is to observe and evaluate combinations of feature sources that have not been studied in the existing literature— primarily involving embeddings extracted with Transformer-type neural networks. The second contribution is a new dataset for this problem, GAWAIN, constructed in a way that offers other researchers not only access to data, but our whole data acquisition and processing pipeline. The experiments on our new GAWAIN dataset show that the classification problem is much harder than reported in other studies—we are able to obtain around 84% in terms of test accuracy. For individual feature contributions, the most relevant ones are coming from URL embeddings, indicating that this additional step in the processing pipeline is needed in order to improve predictions. A surprising outcome of the investigation is lack of content-related features (HTML, JavaScript) from the top-10 list. When comparing the prediction outcomes between models trained on commonly used features in the literature versus embedding-related features, the gain with embeddings is slightly above 1% in terms of test accuracy. However, we argue that even this somewhat small increase can play a significant role in detecting malicious websites, and thus these types of feature categories are worth investigating further.

Bibliographic Details

DOI10.3390/app12062806

URL IDhttp://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=85126278210&origin=inward; http://dx.doi.org/10.3390/app12062806; https://www.mdpi.com/2076-3417/12/6/2806; https://dx.doi.org/10.3390/app12062806

AUTHOR(S)

Ahmad Chaiban; Xiaodong Lin; Dušan Sovilj; Hazem Soliman; Geoff Salmon

PUBLISHER(S)

MDPI AG

TAG(S)

Materials Science; Physics and Astronomy; Engineering; Chemical Engineering; Computer Science

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know