Introduction Targeted Advertising using Machine Learning
Advertising is a must for creating a brand value. Now, the trend is online advertising which focuses the customer as per social media activity. Targeted Advertising using Machine Learning is Like anyone looking for a smartwatch then the various companies of smartwatch shows the ads as it means they target the customer as per search history. It is based on the traits and behavioral patterns of different people. Showing the ads depends upon customer behavior. Nowadays people knowingly or unknowingly are churning out personal data at an unprecedented scale because of the electronic devices. Simply, if the device has an internet connection, then the device IS leaking personal information to advertisers.
Today’s everyone used the mobile, laptop device, and majorly dependent on the internet for a solution to any problem. Mostly the young generation moves towards online-shopping so numbers of companies have customer data.
If data is generated about such internet activities, then the advertisers can virtually know their customers at a personal level and thus advertise to them according to their needs.
This is precisely what organizations are doing today. Targeted Advertising using Machine Learning has become so profitable that software giants like Google and Facebook earn a major part of their revenue by micro-targeting it’s users and advertising their clients’ products.
Target advertising is a type of online advertising where ads are shown to the user based on history. Numbers of online companies today use this approach because it saves money and time, and relevant ads are shown only to potential customers.
Targeted Advertising using Machine Learning works on keywords matching. The ads are associated with a keyword or phrase, and it shows that particular ads who search for a similar keyword to the keyword with which the advertisement was associated. But one factor is not enough to machine learning, there are other factors like website visits, and showing interest in, are all taken into account to provide the users with the relevant advertisement of products that they might be interested in.
However, targeting the right audience is still a challenge in online marketing. Spending millions to display the advertisement to the audience that is not likely to buy a product can be costly.
In this article, we are going to do Targeted Advertising using Machine Learning, we will work with the advertising data of a marketing agency to develop a machine learning algorithm that predicts if a particular user will click on an advertisement. The data consists of 10 features: ‘Daily Time Spent on Site’, ‘Age’, ‘Area Income’, ‘Daily Internet Usage’, ‘Ad Topic Line’, ‘City’, ‘Male’, ‘Country’, Timestamp’ and ‘Clicked on Ad’.
The main variable we are interested in is ‘Clicked on Ad’. This variable can have two possible outcomes,(i.e. binary) 0 and 1 where 0 refers to the case where a user didn’t click the advertisement, while 1 refers to the scenario where a user clicks the advertisement.
We will see if we can use the other 9 variables to accurately predict the value ‘Clicked on Ad’ variable. We will also perform some exploratory data analysis to see how ‘Daily Time Spent on Site’ in combination with ‘Ad Topic Line’ affects the user’s decision to click on the ads.
Tutorial: Import required library import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Read the data data = pd.read_csv('datasets_advertising.csv') Let's check the first line of data data.head()
We want to check how much data we have within each variable.
Good, All variables are complete and there are no missing values within them. Each of them contains 1000 elements and there will be no need for additional preprocessing of raw data.
We will use ‘describe’ function to gain insight into the ranges in which variables change:
An interesting fact from the table is that the smallest area income is $13,996.50 and the highest is $79,484.80. This means that site visitors are people belonging to different social classes. It can also be concluded that we are analyzing a popular website since users spend between 32 and 91 minutes on the website in one session. These are really big numbers!
Furthermore, the average age of a visitor is 36 years. We see that the youngest user has 19 and the oldest is 61 years old. We can conclude that the site is targetting adult users. Finally, if we are wondering whether the site is visited more by men or women, we can see that the situation is almost equal (52% in favor of women).
Now important steps are to analyze our data, let’s first plot a histogram with Kernel density estimation for the ‘Age’ variable.
from scipy.stats import norm sns.distplot(data['Age'], hist=False, color='r', rug=True, fit=norm);
It can be concluded that the variable ‘Age’ has a normal distribution of data. We will see another feature of the following articles why this is good for effective data processing.
Let’s plot a two-dimensional density plot to determine the interdependence of two variables. Let’s see how the user’s age and the time spent on the site are linked.
f, ax = plt.subplots(figsize=(10, 10)) sns.kdeplot(data.Age, data['Daily Time Spent on Site'], color="b", ax=ax) sns.rugplot(data.Age, color="r", ax=ax) sns.rugplot(data['Daily Time Spent on Site'], vertical=True, ax=ax)
From the plot, we can conclude that younger users spend more time on the web site. This implies that users of the age between 20 and 40 years can be the main target group for the marketing campaign. Hypothetically, if we have a product intended for middle-aged people, this is the right site for advertising. Conversely, if we have a product intended for people over the age of 60, it would be a mistake to advertise on this site.
We will present another density graphic and determine the interdependency of ‘Daily Time Spent on Site’ and ‘Daily Internet Usage’.
f, ax = plt.subplots(figsize=(8, 8)) cmap = sns.cubehelix_palette(as_cmap=True, start=0, dark=0, light=3, reverse=True) sns.kdeplot(data["Daily Time Spent on Site"], data['Daily Internet Usage'], cmap=cmap, n_levels=100, shade=True);
From the above plot, it is clear that users who spend more time on the internet also spend more time on the site.
Now plot a scatter matrix using scatter_matrix_function. We will include only numerical variables for performing analysis.
The plot gives a good insight into the properties of the users who click on the advertisements. On this basis, a large number of further analyzes can be made.
As we can see from the table above that all the values in column “Ad Topic Line” are unique, while the “City” column contains 969 unique values out of 1000. There are too many unique elements within these two categorical columns and it is generally difficult to perform a prediction without the existence of a data pattern. Because of that, they will be omitted from further analysis. The third categorical variable, i.e “Country”, has a unique element (France) that repeats 9 times. Additionally, we can determine countries with the highest number of visitors:
We have already seen, there are 237 different unique countries in our dataset and no single country is too dominant. A large number of unique elements will not allow a machine learning model to establish easily valuable relationships. For that reason, this variable will be excluded too.
Next, we will analyze the ‘Timestamp’ category. It represents the exact time when a user clicked on the advertisement. We will expand this category to 4 new categories: month, day of the month, day of the week, and hour. In this way, we will get new variables that an ML model will be able to process and find possible dependencies and correlations. Since we have created new variables, we will exclude the original variable “Timestamp” from the table. The “Day of the week” variable contains values from 0 to 6, where each number represents a specific day of the week (from Monday to Sunday).
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
data['Month'] = data['Timestamp'].dt.month
data['Day of the month'] = data['Timestamp'].dt.day
data["Day of the week"] = data['Timestamp'].dt.dayofweek
data['Hour'] = data['Timestamp'].dt.hour
data = data.drop(['Timestamp'], axis=1)
Split the dataset
Once the complete preprocess step, we need to divide it into two parts: training and test set. We will import and use the train_test_split function for that. All variables except ‘Clicked on Ad’ will be the input values X for the ML models. The variable ‘Clicked on Ad’ will be stored in y, and will represent the prediction variable. We arbitrarily chose to allocate 33% of the total data for the training set it means 33% data as a test dataset. In simple word X as a feature and y as a target variable.
from sklearn.model_selection import train_test_split X = data[['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Male', 'Month', 'Day of the month' ,'Day of the week']] y = data['Clicked on Ad'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Create a Model
In this article, we will use two classification models one Logistic Regression model and another is the Decision Tree model.
The Logistic Regression model is an algorithm that uses a logistic function to model binary dependent variables. It is a tool for predictive analysis and it is used to explain the relationships between multiple variables.
The Decision Tree is one of the most commonly used data mining techniques for analysis and modeling. It is used for classification, prediction, estimation, clustering, data description, and visualization. The advantages of Decision Trees, compared to other data mining techniques are simplicity and computation efficiency.
The first model we will import will be a Logistic Regression model. First, it is necessary to load the LogisticRegression function from the sklearn.linear_model library. Also, we will load the accuracy_score to evaluate the classification performances of the model.
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix The next steps:the initialization of the model, it's training, and finally, making predictions. model_1 = LogisticRegression(solver='lbfgs') model_1.fit(X_train, y_train) predictions_LR = model_1.predict(X_test) print('Logistic regression accuracy:', accuracy_score(predictions_LR, y_test)) print('') print('Confusion matrix:') print(confusion_matrix(y_test,predictions_LR)) Output: Logistic regression accuracy: 0.906060606060606 Confusion matrix: [[158 4] [ 27 141]]
As can be observed, the performance of the model is also determined by the confusion matrix. The condition for using this matrix is to be exploited on a data set with known true and false values.
Our confusion matrix tells us that the total number of accurate predictions is 158 + 141 = 299. On the other hand, the number of incorrect predictions is 27 + 4 = 31. We can be satisfied with the prediction accuracy of our model.
from sklearn.tree import DecisionTreeClassifier model_2 = DecisionTreeClassifier() model_2.fit(X_train, y_train) predictions_DT = model_2.predict(X_test) print('Decision Tree Accuracy:', accuracy_score(predictions_DT, y_test)) print('') print('Confusion matrix:') print(confusion_matrix(y_test,predictions_DT)) Output: Decision tree accuracy: 0.9393939393939394 Confusion matrix: [[153 9] [ 11 157]]
Our confusion matrix tells us that the total number of accurate predictions is 153 + 157 = 310. On the other hand, the number of incorrect predictions is 11 + 9 = 20.
After applying two models, we concluded that the Decision Tree(DT) showed better performance as compared to Logistic Regression,
The confusion matrix shows us that the 310 predictions have been done correctly and that there are only 20 incorrect predictions. Additionally, Decision Tree accuracy is better by about 3% in comparison to the first regression model.
The decision tree model showed more accuracy as compared to logistic regression. But working on both models is good.
The prediction results can be changed by a different approach like random numbers or data analysis.
We encourage you to do your analysis from the beginning, to find new dependencies between variables and graphically display them. After that, create a new training set and a new test set. Let the training set contain a larger amount of data than in the article. Fit and evaluate your model. In the end, praise yourself in a comment if you get improved performances.