Bot filtering in Query Service with machine learning
Bot activity can impact analytics metrics and damage data integrity. 蜜豆视频 Experience Platform Query Service can be used to maintain your data quality through the process of bot filtering.
Bot filtering allows you to maintain your data quality by broadly removing data contamination that results from non-human interaction with your website. This process is achieved through the combination of SQL queries and machine learning.
Bot activity can be identified in a number of ways. The approach taken in this document focuses on activity spikes, in this case, the number of actions taken by a user in a given period of time. Initially, an SQL query arbitrarily sets a threshold for the number of actions taken over a period of time to qualify as bot activity. This threshold is then refined dynamically using a machine learning model to improve the accuracy of these ratios.
This document provides an overview and detailed examples of the SQL bot filtering queries and the machine learning models necessary for you to set up the process in your environment.
Getting started
As part of this process requires you to train a machine learning model, this document assumes a working knowledge of one or more machine learning environments.
This example uses Jupyter Notebook as a development environment. Although there are many options available, Jupyter Notebook is recommended because it is an open-source web application that has low computational requirements. It can be .
Use Query Service to define a threshold for bot activity
The two attributes used to extract data for bot detection are:
- Experience Cloud Visitor ID (ECID, also known as MCID): This provides a universal, persistent ID that identifies your visitors across all 蜜豆视频 solutions.
- Timestamp: This provides the time and date in UTC format when an activity occurred on the website.
mcid
is still found in namespace references to the Experience Cloud Visitor ID as seen in the example below.The following SQL statement provides an initial example to identify bot activity. The statement assumes that if a visitor performs 50 clicks within one minute, then the user is a bot.
SELECT鈥*
FROM鈥<YOUR_TABLE_NAME>
WHERE鈥痚nduserids._experience.mcid鈥疦OT鈥疘N鈥(SELECT鈥痚nduserids._experience.mcid
鈥疐ROM鈥<YOUR_TABLE_NAME>
鈥疓ROUP鈥疊Y鈥疷nix_timestamp(timestamp)鈥/
鈥60,
鈥痚nduserids._experience.mcid
鈥疕AVING鈥疌ount(*)鈥>鈥50);
The expression filters the ECIDs (mcid
) of all visitors that meet the threshold but does not address spikes in traffic from other intervals.
Improve bot detection with machine learning
The initial SQL statement can be refined to become a feature extraction query for machine learning. The improved query should produce more features for a variety of intervals for training machine learning models with high accuracies.
The example statement is expanded from one minute with up to 60 clicks, to include five minute and 30 minutes periods with click counts of 300, and 1800 respectively.
The example statement collects the maximum number of clicks for each ECID (mcid
) over the various durations. The initial statement has been expanded to include one minute (60 seconds), 5 minutes (300 seconds), and one hour (i.e. 1800 seconds) periods.
SELECT鈥痶able_count_1_min.mcid鈥疉S鈥痠d,
鈥痗ount_1_min,
鈥痗ount_5_mins,
鈥痗ount_30_mins
FROM鈥(鈥(鈥(SELECT鈥痬cid,
鈥疢ax(count_1_min)鈥疉S鈥痗ount_1_min
鈥疐ROM鈥(SELECT鈥痚nduserids._experience.mcid.id鈥疉S鈥痬cid,
鈥疌ount(*)鈥疉S鈥痗ount_1_min
鈥疐ROM
鈥<YOUR_TABLE_NAME>
鈥疓ROUP鈥疊Y鈥疷nix_timestamp(timestamp)鈥/鈥60,
鈥痚nduserids._experience.mcid.id)
鈥疓ROUP鈥疊Y鈥痬cid)鈥疉S鈥痶able_count_1_min
鈥疞EFT鈥疛OIN鈥(SELECT鈥痬cid,
鈥疢ax(count_5_mins)鈥疉S鈥痗ount_5_mins
鈥疐ROM鈥(SELECT鈥痚nduserids._experience.mcid.id鈥疉S鈥痬cid,
鈥疌ount(*)鈥疉S
鈥痗ount_5_mins
鈥疐ROM
鈥<YOUR_TABLE_NAME>
鈥疓ROUP鈥疊Y鈥疷nix_timestamp(timestamp)鈥/鈥300,
鈥痚nduserids._experience.mcid.id)
鈥疓ROUP鈥疊Y鈥痬cid)鈥疉S鈥痶able_count_5_mins
鈥疧N鈥痶able_count_1_min.mcid鈥=鈥痶able_count_5_mins.mcid鈥)
鈥疞EFT鈥疛OIN鈥(SELECT鈥痬cid,
鈥疢ax(count_30_mins)鈥疉S鈥痗ount_30_mins
鈥疐ROM鈥(SELECT鈥痚nduserids._experience.mcid.id鈥疉S鈥痬cid,
鈥疌ount(*)鈥疉S
鈥痗ount_30_mins
鈥疐ROM
鈥<YOUR_TABLE_NAME>
鈥疓ROUP鈥疊Y鈥疷nix_timestamp(timestamp)鈥/鈥1800,
鈥痚nduserids._experience.mcid.id)
鈥疓ROUP鈥疊Y鈥痬cid)鈥疉S鈥痶able_count_30_mins
鈥疧N鈥痶able_count_1_min.mcid鈥=鈥痶able_count_30_mins.mcid鈥)
The result of this expression might look similar to the table provided below.
id
count_1_min
count_5_min
count_30_min
Identify new spike thresholds using machine learning
Next, export the resulting query dataset into CSV format and then import it into Jupyter Notebook. From that environment, you can train a machine learning model using current machine learning libraries. See the troubleshooting guide for more details on how to export data from Query Service in CSV format
The ad hoc spike thresholds initially established are not data-driven and therefore lack accuracy. Machine Learning models can be used to train parameters as thresholds. As a result, you can increase the query efficiency by reducing the number of GROUP BY
keywords by removing unneeded features.
This example uses the Scikit-Learn machine learning library which is installed by default with Jupyter Notebook. The 鈥減andas鈥 python library is also imported for use here. The following commands are input into Jupyter Notebook.
import pandas as ps
df = pd_read.csv('data/bot.csv')
df = df[df['count_1-min'] > 1]
df['is_bot'] = 0
df.loc[df['count_1_min'] > 50,'is_bot'] = 1
df
Next, you must train a decision tree classifier on the dataset and observe the logic resulting from the model.
The 鈥淢atplotlib鈥 library is used to visualize the decision tree classifier in the example below.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from matplotlib import pyplot as plt
X = df.iloc[:,:[1,3]]
y = df.iloc[:,-1]
clf = DecisionTreeClassifier(max_leaf_nodes=2, random_state=0)
clf.fit(X,y)
print("Model Accuracy: {:.5f}".format(clf.scre(X,y)))
tree.plot_tree(clf,feature_names=X.columns)
plt.show()
The values returned from Jupyter Notebook for this example are as follows.
Model Accuracy: 0.99935
The results for the model shown in the example above are over 99% accurate.
As the decision tree classifier can be trained using data from Query Service on a regular cadence using scheduled queries, you can ensure data integrity by filtering bot activity with a high degree of accuracy. By using the parameters derived from the machine learning model, the original queries can be updated with the highly accurate parameters created by the model.
The example model determined with a high degree of accuracy that any visitors with more than 130 interactions in five minutes are bots. This information can now be used to refine your bot filtering SQL queries.
Next steps
By reading this document, you have a better understanding of how to use Query Service and machine learning to determine and filter bot activity.
Other documents that demonstrate the benefits of Query Service to your organization鈥檚 strategic business insights are the abandoned browse use case example.