Predictive Analytics and Machine Learning-Based Models for E-Commerce Fraud Prevention

 

Sachin Bagoria1*, Dr. Kavita2

1 Research Scholar, SKD University, Hanumangarh, Rajasthan, India

radheykrishnalalita@gmail.com

2 Professor, SKD University, Hanumangarh, Rajasthan, India

Abstract: The e-commerce market has grown but so have online fraud and criminality. Online marketplaces can be very complex and customers are becoming more adept at new methods of fraud, which make traditional fraud detection methods ineffective. The aim of this study is to design a machine learning and predictive analysis system for e-commerce fraud prevention. It takes into account the structure of the URL, the content of the HTML, the technology profiles, SSL certificates, HTTP headers and external reputation indicators. 2,031 ecommerce sites were used for model creation and evaluation, with 739 of them being fraudulent and 1,292 authentic. XGBoost, Odd Forest, Support Vector Machine, Logistic Regression, k-Nearest Neighbour, AdaBoost, & Naοve Bayes were used to extract and evaluate 50 features. Experimentally, XGBoost outperformed the baseline with all characteristics at 0.9688 F1-Score & 97.78 accuracy rate while the baseline is 0.9653 & 97.49%. The comparative investigation revealed the superiority of the proposed system over existing fraud detection systems. The results suggest that machine learning-based predictive analytics could be a scalable and powerful tool to protect online transactions and detect fraudulent ecommerce sites.

Keywords: E-commerce Fraud Detection, Predictive Analytics, Machine Learning, XGBoost, Cybersecurity, Fraud Prevention, Classification Models, Data Mining, E-commerce Security, Artificial Intelligence.

1. INTRODUCTION

Internet technology has influenced consumers in their purchasing and accelerated global e-commerce. Consumers increasingly utilised Amazon, eBay, Alibaba, and Facebook Marketplace to buy products and services from home during the COVID-19 epidemic [1]. All the possibilities that growth has provided to companies and consumers have come with a price, namely higher cybercrime and online fraud. Fraudulent online shopfronts, phishing attacks, identity theft, account takeovers, bogus product listings, money laundering, and unauthorised financial transactions include e-commerce fraud [2,4]. Fraudulent activities can be expensive to companies and create a loss of trust on platforms. The losses from online payment fraud are increasing, thus the importance of fraud detection and prevention [5]. Cybercriminals' continually developing techniques make rule-based systems, human verification, and authentication processes unsuitable for fraud prevention [6]. Therefore, organisations are using modern technologies like machine learning (ML), artificial intelligence (AI), predictive analytics, and data mining to identify fraud in huge and complicated datasets. These techniques show hidden patterns and abnormalities that can be used to distinguish between fraudulent and legitimate transactions [6–11].

Fraud detection methods based on network analysis to detect suspicious transactions and interactions between users, transactions, and digital entities have also been investigated [12]. Due to its capacity to quickly analyse massive amounts of structured and unstructured data, machine learning-based predictive analytics remains one of the most successful and scalable ways for contemporary e-commerce. In this study, an e-commerce fraud avoidance technique based on predictive analytics and machine learning is used. This method relies on 50 predictive factors included in the URL structure, HTML content, technology profile, SSL certificate, HTTP headers, and external reputation signals. The project aims to develop a robust and scalable solution to detect fraudulent e-commerce sites and enhance commercial security online, evaluating several machine learning algorithms for their effectiveness.

2. OBJECTIVES

·                     To create and assess machine learning-based prediction models that use website-related information to reliably identify fraudulent e-commerce websites.

·                     To determine the best model for preventing e-commerce fraud by comparing the performance of several machine learning algorithms.

3. RESEARCH METHODOLOGY

The quantitative and predictive analytics-based research approach is used to develop and evaluate machine learning models to detect and prevent fraudulent online stores. The technique involves collecting data, feature extraction, feature engineering, model training, hyperparameter optimisation, and performance evaluation. The aim is to develop a system that can identify fake online stores and pinpoint patterns that distinguish them from legitimate ones so that they can be detected..

3.1 Dataset Description

In the sample of 1,292 real e-commerce websites, there were 739 fraudulent websites. The websites were categorized into 13 different industries: apparel, sports, technology, health, housing, food, education, entertainment, pets, toys, & office and industrial supplies.

The distribution of real websites, on the other hand, was more even, while the distribution of fraudulent websites was the greatest in the categories of fashion (41.95%), marketplace (31.27%), & sport (14.21%). Scammers often aim for high-demand consumer sectors, as shown by this distribution, which is reflective of actual fraud tendencies.

3.2 Feature Extraction

The third version of Python was used to create a thorough feature extraction framework. To create data from multiple online sources, six separate modules were created:

1. URL Features.

2. HTML Features.

3. Technology Features.

4. SSL Certificate Features.

5. HTTP Header Features.

6. External Reputation and Social Media Features.

For each module extracted, features were extracted and saved in a JSON structure, where the keys are feature names and the values are features. 50 qualities were grouped into 6 categories. The following criteria were used to identify e-commerce websites: structure, behavior, technology and reputation.

3.3 Feature Engineering.

Using the six feature groups, we created unique feature vectors for every website.

URL Feature Vector:

HTML Feature Vector:

Technology Feature Vector:

SSL Feature Vector:

HTTP Header Feature Vector:

External Feature Vector:

These vectors were concatenated to create a complete feature vector:

After processing all websites, feature vectors were ordered into a M Χ N matrix X, where M represents extracted features & N represents the total website samples.

3.4 Link Structure Analysis

HTML link architecture was studied to extract the features. Six categories were used to group the links:

·                     External Links.

·                     Internal Connections.

·                     Links for Internal References.

·                     No Links to Content.

·                     Links that contain no text. Links with no text attached.

·                     Links that don't have a reference (href).

Their distribution and frequency were used to indicate the structural quality and legitimacy of the website, for these link types.

3.5 Data Pre-processing

The feature values were standardised with the help of StandardScaler function of scikit learn prior to training the models. Feature scaling was performed on both training and testing sets to have all the models agree on the scaling.

3.6 Machine Learning Models

Eight popular machine learning algorithms in fraud detection research were used:

·                     x2 extreme increasing boosting (XGBoost).

·                     Random Forest(RF) Classifier.

·                     Random Forest (RF).

·                     Support Vector Machine (SVM).

·                     Logistic Regression (LR).

·                     k-Nearest Neighbour (kNN).

·                     AdaBoost.

·                     Naοve Bayes (NB).

The best hyperparameters were searched for each of the classifiers.

3.7 Model Validation

These parameters were used in a 5-fold Cross-Validation technique to ensure the robustness of the model and reduce model bias:

·                     Number of folds (k) = 5.

·                     Shuffle = True.

·                     Random State = 42.

The five parts of the data set were divided. The data set was divided into five equal parts. Four training subsets and one testing subset were used in each cycle. All folds were used to calculate performance measures and averaged results were used.

3.8 Performance Evaluation Metrics

Four common classification measures were used to assess the performance of each of the classifiers:

·                     Precision: Proportionate rate of correct identification of fraudulent sites out of total number of fraudulent sites.

·                     Recall: Calculates the percentage of correct identification of fraudulent sites by the model.

·                     Accuracy: Calculates the overall accuracy of the websites being classified.

·                     F1-Score: It is the harmonic average of Precision and Recall and gives a fair balance for the performance of the classifiers.

Confusion matrix was made up of:

·                     True Positives (TP): Correctly classified fraudulent websites.

·                     False Positives (FP): Not spam websites that end up being classified as spam.

·                     True Negatives (TN): Valid websites correctly identified.

·                     False Negatives (FN): Fraudulent websites misidentified as legitimate.

3.9 Experimental Setup

A preliminary evaluation of two experimental approaches was made:

·                     Complete Feature Set Model: All retrieved features were used in this model, such as social media cues and outside reputation.

·                     Independent Model: This method only used certain website elements that were available locally; elements from external reputation were not used. The aim was to develop an independent fraud detection programme which does not rely on other services.

3.10 Comparative Analysis

The framework recommended was discussed with Wu and Wadleigh's work done regarding fraud detection. All benchmarks used the same experimental techniques and machine learning optimisation to maintain the impartiality of the comparisons. Elements that were not available for the comparison models due to privacy, GDPR and third-party API restrictions were not included. To assess the effectiveness of the predictive analytics system in preventing online purchasing fraud, we used the metrics of accuracy, precision, recall, or F1-Score.

Table 1: Comparison of the proportion and number of categories on the authentic and counterfeit websites

Category

Fraud (n)

Fraud (%)

Legit (n)

Legit (%)

Automotive

5

0.68

45

3.48

Education

0

0.00

75

5.80

Entertainment

3

0.41

62

4.80

Fashion

310

41.95

179

13.86

Food

2

0.27

100

7.74

Health

11

1.49

142

10.99

Home

19

2.57

208

16.10

Marketplace

231

31.27

142

10.99

Office and Industrial Material

6

0.81

56

4.33

Pets

1

0.14

19

1.47

Sport

105

14.21

65

5.03

Technology

17

2.30

176

13.62

Toys

29

3.92

23

1.78

Total

739

100.00

1292

100.00

Total Transactions = 2,031 (Fraud = 739; Legit = 1,292).

Table 2: Link types based on the href tag's origin or destination

Type of Link

Example

External

<a href="otherdomain.com"></a>

Internal

<a href="thisdomain.com"></a>

Internal Reference

<a href="#internal_ref"></a>

No Content

<a href="#"></a>

Empty

<a href=""></a>

No Reference

<a></a>

 

4. RESULT

We used a 3.6 GHz Intel Core i3 9100F & 16 GB DDR4 RAM. For several experiments and machine learning model creation, we utilised scikit-learn13 & Python 3. Nine state-of-the-art classification techniques were employed to assess and compare design aspects [13, 14]. Algorithms such as XGBoost, GBC, RF, kNN, SVM, LR, NB, and Adaboost are used. Classifiers were trained using the best hyper-parameters from a 5-fold cross-validated grid search. Table 3 lists the finished projected model hyperparameters. We scaled the features vector across all features as well as instruction data using scikit-learn's StandardScaler, then applied it to test samples.

Table 3: An overview of the features that have been implemented and the group that corresponds to them

Feature ID

Group

Feature Name

Value Type

Description

U1

URL

domain_digit_count

D

Number of digits in the domain name

U2.1

URL

domain_length

D

Number of characters in the domain name

U2.2

URL

subdomain_length

D

Number of characters in the subdomain

U3.1

URL

raw_word_count

D

Number of words in the URL

U3.2

URL

average_word_length

C

Average length of words in the URL

U3.3

URL

longest_word_length

D

Length of the longest word in the URL

U3.4

URL

shortest_word_length

D

Length of the shortest word in the URL

U3.5

URL

std_word_length

C

Standard deviation of word lengths

H1

HTML

text_length

D

Number of characters in the HTML text

NH2

HTML

domain_title

B

Whether the domain appears in the title

NH3

HTML

domain_in_html

D

Number of times the domain appears in HTML text

NH4

HTML

base64

B

Website loads resources encoded in Base64

H5.1

HTML

link_int

D

Number of internal links

H5.2

HTML

link_ext

D

Number of external links

H5.3

HTML

link_#

D

Number of empty (“#”) links

H5.4

HTML

link_emp

D

Number of empty links

H5.5

HTML

link_null

D

Number of links without href attribute

H6

HTML

currencies

D

Number of currencies detected on the website

NH7.1

HTML

prices

D

Total number of prices detected

H7.2

HTML

most_times

D

Repetitions of the most frequent price

NH7.3

HTML

avg_times

C

Average repetitions of prices

NH7.4

HTML

avg_discount

C

Average discount percentage

NH8.1

HTML

num_social_html

D

Number of social media links

NH8.2

HTML

fake_fb

B

Facebook share link detected

NH8.3

HTML

fake_tw

B

Twitter share link detected

NT1

Tech

n_tech

D

Number of technologies detected

NT2.1

Tech

e-commerce

D

Number of e-commerce technologies used

NT2.2

Tech

live-chat

D

Number of live-chat technologies used

NT2.3

Tech

cookie-compliance

D

Number of cookie-compliance technologies used

NT2.4

Tech

analytics

D

Number of analytics technologies detected

NT2.5

Tech

payment-processors

D

Number of payment-processing platforms detected

NT3.1

Tech

google-analytics

B

Website uses Google Analytics

NT3.2

Tech

google-analytics-enh

B

Website uses enhanced Google Analytics for e-commerce

NT3.3

Tech

recaptcha

B

Website uses reCAPTCHA

S1

SSL

has_cert

B

Domain uses a valid SSL certificate

S2

SSL

n_name

D

Number of domain names registered in SSL certificate

NP1

HTTP

content-security-policy

B

Website defines a Content Security Policy (CSP) header

NP2

HTTP

strict-transport-security

B

HSTS is implemented

NP3

HTTP

x-content-type-options

B

Nosniff directive is enabled

NP4

HTTP

x-frame-options

B

Uses DENY or SAMEORIGIN directives

P5

HTTP

cache-control

B

Website avoids outdated post-check directive

P6

HTTP

expect-ct

B

Expect-CT header is configured

NM1

External

total_followers

D

Total social media followers

NM2

External

total_following

D

Total accounts followed on Instagram and Twitter

NM3

External

total_posts

D

Total posts on Instagram and Twitter

NM4

External

fb_likes

D

Facebook page likes

NM5

External

fb_visits

D

Facebook page visits in the last 24 hours

NM6

External

tw_age

D

Months since Twitter account registration

NM7

External

trustpilot_score

C

Trustpilot review score

NM8

External

trustpilot_reviews

D

Number of Trustpilot reviews

 

Table 4: Machine learning classifier evaluation for the suggested techniques

Classifier

Hyper-parameter

Value

XGBoost

eval_metric

error

n_estimators

120

objective

binary

scale_pos_weight

2

Gradient Boosting Classifier (GBC)

learning_rate

0.1

max_depth

3

max_features

sqrt

n_estimators

242

Random Forest (RF)

max_features

auto

n_estimators

127

k-Nearest Neighbors (kNN)

metric

manhattan

n_neighbors

2

weights

uniform

Support Vector Machine (SVM)

C

100

gamma

0.001

kernel

rbf

Logistic Regression (LR)

C

0.1

penalty

l2

AdaBoost

learning_rate

0.1

n_estimators

43

Naοve Bayes

kind

BernoulliNB

 

Finally, we used k-fold cross-validation with k = 5, shuffle = T rue, & random_state = 42 to evaluate classifier performance. We used averaged 5-fold cross-validation data to provide accuracy, precision, recall, and F1-Score [15, 16]. The number of fraudulent websites recognised properly is called true positives (TP). The FP is the number of legitimate samples misclassified as fraudulent. A properly classified sample is a true negative (TN). Finally, false negatives (FN) are fake websites misclassified as genuine.

                                                                                                          (1)

                                                                                                                (2)

                                                                                      (3)

                                                                                     (4)

This work develops a machine-learning system to notify clients about fraudulent e-commerce websites. This study investigated two strategies. The initial optimised performance with all intended features, including external services. The second technique produces a local-only model. Table 5 reveals that XGBoost had the highest F1-Score (0.9688) of all features, then GBC (0.9684) & Random Forest (0.9661). At 97.78% accuracy, the XGBoost algorithm can classify data, making it suitable for bogus website identification. Random Forest raised accuracy by 0.0069 & lowered recall by 0.0118, while the top two stars did not change. This is suggested for systems the requirement to detect particular attacks.

Table 5: Machine learning classifier evaluation for the suggested techniques

Algorithm

Full Set Precision

Full Set Recall

Full Set F1-Score

Full Set Accuracy (%)

Standalone Precision

Standalone Recall

Standalone F1-Score

Standalone Accuracy (%)

XGBoost

0.9778

0.9601

0.9688

97.78

0.9647

0.9660

0.9653

97.49

GBC

0.9751

0.9619

0.9684

97.73

0.9590

0.9686

0.9637

97.39

Random Forest

0.9847

0.9483

0.9661

97.59

0.9765

0.9342

0.9546

96.80

SVM

0.9622

0.9602

0.9611

97.19

0.9619

0.9618

0.9618

97.24

Logistic Regression (LR)

0.9566

0.9633

0.9599

97.09

0.9535

0.9576

0.9555

96.80

kNN

0.9564

0.9366

0.9463

96.21

0.9337

0.9481

0.9407

95.72

AdaBoost

0.9373

0.9534

0.9452

96.01

0.9444

0.9454

0.9448

96.01

Naοve Bayes (NB)

0.9231

0.9412

0.9320

94.84

0.9286

0.9346

0.9316

94.84

Figure 1: Full set and solo model feature important. In light colours, heritage traits from prior works; in deeper colours, unique features presented in this work.

Results of the experiment showed that XGBoost algorithm achieved the highest classification performance with the F1-Score value of 0.9688 and the accuracy of 97.78% when all the features were used. Random Forest had a slight improvement in the precision score, but a lower recall score, which makes XGBoost a better option for fraud detection. The standalone version also had a very good performance with an F1-Score of 0.9653, meaning that the performance of external reputation features did not make a large impact on the general performance. The feature importance analysis showed that the social media indicators, technology related features, HTML characteristics, and pricing information are the most important features in predicting fraudulent websites, while URL based features play a relatively small role. Additional experiments revealed that, if the HTML and technology aspects were removed, the performance of the models would be drastically diminished, indicating that they do play a significant role in fraud detection. The proposed framework was shown to be efficient and independent from language-specific, brand-specific, and expensive third-party resources, achieving a better performance than the current approaches, and being practical and scalable for e-commerce fraud prevention.

Figure 2: Findings for the top-performing algorithms when one of the resources is removed

The second experiment shows how well suggested features work alone. The 17 HTML features had the highest XGBoost and Random Forest F1-Scores, 0.9541 and 0.9564. URLs are needed for 7 of 17 HTML functions. Thus, low-resource systems should use it. The eight external feature model placed second with 0.8667 and 0.8662 GBC and Random Forest F1-Scores. The findings support this group's eight aims. Major issue: it relies on others and cannot run without specific services. The nine Wappalyzer technology report features for XGBoost and GBC have 0.8329 and 0.8333 F1-Scores, suitable for general-purpose systems. This experiment's URL, SSL, and HTTP Headers setting failed for several reasons. Due to domain name similarities, the URL set cannot distinguish fake and legitimate websites. Features were designed without keywords or brand lists since they may develop language-dependent models. Low SSL set results were due to a lack of model input as it had two features. The latest three HTTP Header sets were above-average (0.7740 GBC F1-Score). Its biggest drawbacks are its high sensitivity and false positive rate (0.9094 recall and 0.6746 accuracy on GBC) Figure 3. Compare the proposed methods to Wu et al. (2018) and Wadleigh (2015) [17,18]. Compare to other publications is impossible since none published data or restricted their methodology. Current works lack feature and method implementation information, thus we generated our own extraction, which may be different but accurate. These works cannot use third-party features in Europe under GDPR. Table 6 lists unimplemented features. For fairness, we will compare these methods to our solo version without third-party data. We employed the same experimental methodology to discover the optimum hyper-parameters as these investigations were not disclosed.

Figure 3: Findings for the top-performing algorithms when using a separate set of characteristics that match the resources used in this study

Table 6: Comparison Between Our Method and Existing Works

Work

Classifier

Precision

Recall

F1-Score

Accuracy (%)

Our Standalone Method

XGBoost

0.9647

0.9660

0.9653

95.49

Wu et al. [7]

Random Forest (RF)

0.9224

0.8795

0.9003

92.91

Wadleigh et al. [20]

XGBoost

0.6599

0.7715

0.7111

77.25

Note: Values representing the highest performance for each metric are shown in bold.

Table 7: Features Not Implemented in the Comparison Works

Work

Feature

Reason for Exclusion

Wadleigh et al. (2015)

Private or China WHOIS

No WHOIS data is publicly available for most EU websites

Wadleigh et al. (2015)

WHOIS Registration < 1 Year

No WHOIS data is publicly available for most EU websites

Wadleigh et al. (2015)

Website on Takedown Page

Our dataset contains no seized websites, only working ones

Wadleigh et al. (2015)

Website in Alexa Top 100K

Costly API requirement

Wu et al. (2018)

in_top_one_million

Costly API requirement

Wu et al. (2018)

china_registered

No WHOIS data is publicly available for most EU websites

Wu et al. (2018)

under_a_year

No WHOIS data is publicly available for most EU websites

 

Table 7 shows that our technique is better than the state-of-the-art currently available. Not only that, the suggested features don't rely on any outside data, thus they should work reliably across all countries and languages.

5. CONCLUSION

With online shopping, there has been a rise in online fraud, and there is a need to have sophisticated and effective fraud detection systems. In this study, 50 URL structure, HTML, technological profile, SSL certificate, HTTP header, and external reputation database features were used to identify fake e-commerce websites. The system was powered by machine learning and predictive analytics. A number of machine learning techniques were compared such as XGBoost, Gradient Boosting Classifier, Random Forest, SVM, Logistic Regression, k-Nearest Neighbour, AdaBoost and Naοve Bayes. XGBoost was the most successful with 96.78% feature set accuracy and 0.9688 F1-Score. The solo model achieved a good result in terms of accuracy of 97.49% and an F1-Score of 0.9653. The feature importance analysis revealed it was important HTML, technology and external reputation aspects. The framework offers a dependable, scalable, and accurate means of fighting e-commerce fraud, which helps to make online transactions safer.

References

1.                  Monteith, S., Bauer, M., Alda, M., Geddes, J., Whybrow, P. C., & Glenn, T. (2021). Increasing cybercrime since the pandemic: Concerns for psychiatry. Current Psychiatry Reports, 23(4), 18.

2.                  Kodate, S., Chiba, R., Kimura, S., & Masuda, N. (2020). Detecting problematic transactions in a consumer-to-consumer e-commerce network. Applied Network Science, 5(1), 90.

3.                  Samani, R., & Davis, G. (2019). McAfee mobile threat report. McAfee. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-mobile-threat-report-2019.pdf

4.                  Smith, S., & Juniper Research. (2024). Online payment fraud: Market forecasts, emerging threats & segment analysis 2022–2027. Juniper Research. https://www.juniperresearch.com/press/losses-online-payment-fraud-exceed-362-billion/

5.                  Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559–569.

6.                  Abdallah, A., Maarof, M. A., & Zainal, A. (2016). Fraud detection system: A survey. Journal of Network and Computer Applications, 68, 90–113.

7.                  Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 17(3), 235–255.

8.                  Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). A comprehensive survey of data mining-based fraud detection research (arXiv:1009.6119) . arXiv. https://arxiv.org/abs/1009.6119

9.                  Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: A survey. Data Mining and Knowledge Discovery, 29(3), 626–688.

10.              Irani, D., Webb, S., & Pu, C. (2010). Study of static classification of social spam profiles in MySpace. Proceedings of the International AAAI Conference on Web and Social Media, 4(1), 82–89.

11.              Bhowmick, S., & Hazarika, S. M. (2016). Machine learning for E-mail spam filtering: Review, techniques and trends (arXiv:1606.01042) . arXiv. https://arxiv.org/abs/1606.01042

12.              Savage, D., Zhang, X., Yu, X., Chou, P., & Wang, Q. (2014). Anomaly detection in online social networks. Social Networks, 39, 62–70.

13.              Mostard, W., Zijlema, B., & Wiering, M. (2019). Combining visual and contextual information for fraudulent online store classification. In Proceedings of the International Conference (pp. 84–90). https://doi.org/10.1145/3350546.3352504

14.              Beltzung, L., Lindley, A., Dinica, O., Hermann, N., & Lindner, R. (2020). Real-time detection of fake-shops through machine learning. In 2020 IEEE International Conference on Big Data (pp. 2254–2263). https://doi.org/10.1109/BigData50022.2020.9378204

15.              Maktabar, M., Zainal, A., Maarof, M. A., & Kassim, M. N. (2018). Content based fraudulent website detection using supervised machine learning techniques. Advances in Intelligent Systems and Computing, 734, 294–304. https://doi.org/10.1007/978-3-319-76351-4_30

16.              Khoo, E., Zainal, A., Ariffin, N., Kassim, M. N., Maarof, M. A., & Bakhtiari, M. (2021). Fraudulent e-commerce website detection model using HTML, text and image features. Advances in Intelligent Systems and Computing, 1182, 177–186. https://doi.org/10.1007/978-3-030-49345-5_19

17.              Wu, K., Chou, S., Chen, S., Tsai, C., & Yuan, S. (2018). Application of machine learning to identify counterfeit websites. In Proceedings of the International Conference (pp. 321–324). https://doi.org/10.1145/3282373.3282407

18.              Wadleigh, J., Drew, J., & Moore, T. (2015). The e-commerce market for “lemons”: Identification and analysis of websites selling counterfeit goods. In Proceedings of the 24th International Conference on World Wide Web (pp. 1188–1197). https://doi.org/10.1145/2736277.2741677