Introduction

At the Code The City 13 – Alcohol Hack weekend, I worked as part of team #cantchadontchawontcha. You will find more information on this CTC page.

The challenge we faced was that brewers of craft beers, in particular, in a bid to make their products stand out are making the designs more colourful, use more cartoon characters, use more playful fonts, and minimise the clarity of labelling. It was suggested that the Portman group, the UK’s regulatory body for advertising and marketing  of alcohol, is not effective in controlling producers.

Approach

To test this, we sourced 238 images of drinks cans,  bottles. This was 189 alcoholic drinks and 49 non-acoholic ones.

We set up a Google forms web interface and had a panel of users rate the labels (on scales for 1 to 10) for colourfulness, use of cartoons, the heaviness of fonts and the clarity and size of labelling of the product. This was done over a few hours as a demonstration of what was possible:  it was not rigorous or controlled.  See the future works section at the foot of this article for more on this.

At the end of the first phase we had a CSV file with the following fields:

[1] “Timestamp”
[2] “Photo of container”
[3] “Name of drink (optional)”
[4] “Colour”
[5] “Cartoon / Characters”
[6] “Fonts”
[7] “How informative is the label in describing the contents”
[8] “First impression”
[9] “What is it really”

The two fields of interest were [8] which measured how the view perceived the drink, and [9] whether the drink actually was alcoholic or not.

The code, approach below, was largely as I created it at the weekend, but has some refinements which I made today before writing this blog post. You can find all code, the csv of data, and all of the image files which we harvested, here on Github.

What is the question?

Before we leap into coding it is always worthwhile reminding ourselves what the question is that we are attempting to address. Originally we wondered if we could predict from the attributes colour, font etc whether a drink was alcoholic or not. It appears from a variation of the code below that you can do that with some accuracy.

But that was not our primary question. This was, “given the data for colour, font etc, with what accuracy could we predict if a person would perceive it as alcoholic or not“. Again, see the future works section below for more on that.

The code

First we begin by loading some libraries which we will need, then importing the data into a data frame, and doing some exploratory data analysis. We also rename the columns to more friendly names.

library(readr)
library (dplyr)
library (tidyr)
library(caret)

df <- read_csv("~/development/cantcha/data/Cantcha Data Gather (Responses) - Form Responses 1.csv")
str(df)
names (df)
colnames(df) <-- c("Timestamp", "Photo", "Name", "Colour", "Cartoon", "Fonts", "Informative", "perception", "type")

If we look at the perception column, we can see what the values are:

num_class <- length(unique(df$perception))
uniqueClasses <- as.character(unique(df$perception)) 
cat("There are ", num_class, "unique classes, which are", uniqueClasses)

> There are 5 unique classes, which are Clearly Alcohol Probably Alcohol Really can’t tell Probably non-alcoholic Clearly non-alcoholic

Preprocessing

We aren’t really interested in how certain a person was that the drink was alcoholic or not, just that they did or didn’t. Nor do we care that hey were unsure. Let’s convert the 5 values into a binary yes/ no and remove the uncertain “Really can’t tell”.

To do that, we create a new column “add_col” and work out values based on those in the perception column. Then we make a copy of the data frame, taking over only those which don’t have a NA values for the new column. Now we are working with a binary yes / no.

 
index <- c("Clearly Alcohol", "Probably Alcohol", "Really can't tell", "Probably non-alcoholic", "Clearly non-alcoholic")
values <- c("Yes", "Yes", "NA", "No", "No")
df$add_col <- values[match(df$perception, index)]

Running a quick check on the new data frame (df2) we can see that we have dropped from 238 rows to 205, and we have a new column.

str(df2)

We are getting closer – but we can do a little more preprocess. We can make sure that our target class is labelled appropriately, then we can drop redundant attributes, and finally check that our classes as just yes /no. And we might as well put all that in a new data frame (which will preserve earlier work, should we need to go back).

colnames(df2) <- c("Timestamp", "Photo", "Name", "Colour", "Cartoon", "Fonts", "Informative", "not_class", "ex_class", "class")

# drop redundant colums
drop.cols <- c('Timestamp', 'Photo', 'Name', "not_class", "ex_class")
reduced_df <- df2 %&gt;% select(-one_of(drop.cols))

#how many classes - and what are they?
num_class <- length(unique(df2$class))
uniqueClasses <- as.character(unique(df2$class))
cat("There are ", num_class, "unique classes, which are", uniqueClasses)

Now we can set the class attribute to be a factor, and check their distribution.

# set class to be a factor
reduced_df[["class"]] = factor(reduced_df[["class"]])
#class distribution
classDis <- as.data.frame(table(reduced_df$class))
names(classDis) <- c("label","count")
classDis

> output
> label  count
> No     86
> Yes    119

Lastly we will split our data into train and test, so that we can build a model on the training data, then test its accuracy on the unseen test data


# divide the dataset into train (70%) and test (30%)

set.seed(135) #to get repeatable data
dt = sort(sample(nrow(reduced_df), nrow(reduced_df)*.7))
train <- reduced_df[dt,]
test <- reduced_df[-dt,]

#class distribution of TEST
classDis <- as.data.frame(table(test$class))
names(classDis) <- c("label","count")
classDis

#output

# label count
# No 26
# Yes 36

#class distribution of TRAIN
classDis <- as.data.frame(table(train$class))
names(classDis) <- c("label","count")
classDis

#output

# label count
# No 60
# Yes 83

So we can see that the distribution of classes in test and train is approximately the same. Now we can process to build a model.

Training a predictive model

I’ve amended the code which we used over the weekend, but which you can still find here. In that we tested three models – Support Vector Machine (SVM), k-Nearest Neighbour (k-nn) and C5.0 decision tree. In the following we jump straight in to using SVM (which proved to be best performing when we tried the there approaches earlier).

Some of the following code is inspired by / borrowed from the following two articles on DataAspirant and RevolutionaryAnalytics, both of which give excellent guidance on building and tuning Support Vector Machine classifiers.

#set controls for training the model
trctrl <- trainControl(method="repeatedcv",number=10,repeats=3)
set.seed(135)

#train the model
svm_Radial <- train(class ~., data =train, method="svmRadial",
    trControl=trctrl, preProcess = c("center", "scale"),tuneLength = 10)

#show the results
svm_Radial

This gives the following output (abbreviated), “Tuning parameter ‘sigma’ was held constant at a value of 0.3517445 Accuracy was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.3517445 and C = 4.” The over accuracy with those accuracy on training was 82.03%.

How will it perform on unseen test data?

Testing the model on unseen data

test_pred <- predict(svm_Radial, newdata=test) test_pred 
confusionMatrix(test_pred, test$class ) 

Confusion Matrix and Statistics
Reference
Prediction   No       Yes
No                22         7
Yes               4          29

Accuracy : 0.8226
95% CI : (0.7047, 0.908)
No Information Rate : 0.5806
P-Value [Acc > NIR] : 4.543e-05 Kappa : 0.6414

We can see that when applying our model to the unseen test data we achieve 82.26% accuracy. We can say with 95% confidence that the accuracy lies in the range 70.47% to 90.8%. The confusion matrix shows where it got yeses and noes correct and where it confused the two.

We can also say that since the accuracy is consistent with the earlier training accuracy, that the model developed has not over fitted for the training data.

Our last work for now is to try to tune the model further by adjusting the two parameters which SVMRadial supports: sigma and C, the Cost parameter. The articles above give some background to this.


# tune the model
grid <- expand.grid(
    sigma = c(.1, 0.2, 0.25, 0.3), 
    C = c(0, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,4, 5))

set.seed(135)
svm_Radial_Grid <- train(class ~., data = train, method = "svmRadial",
    trControl=trctrl,
    preProcess = c("center", "scale"),
    tuneGrid = grid,
    tuneLength = 10)

#show the results
svm_Radial_Grid
# Output (except) 
# 0.25 1.25 0.8182540 0.6204469
# 0.25 1.50 0.8274603 0.6419664
# 0.25 1.75 0.8298413 0.6473446
# 0.25 2.00 0.8250794 0.6367860
plot(svm_Radial_Grid)

This outputs “Accuracy was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.25 and C = 1.75.”

Results of tuning SVMradial parameters
Results of tuning SVMradial parameters

 

 

 

 

 

 

 

 

 

 

One last test

Finally, we test our optimised model on unseen data.

#Now test the tuned model on test data 
test_svm_Radial_Grid <- predict(svm_Radial_Grid, newdata = test) 
confusionMatrix(test_svm_Radial_Grid, test$class )

This gives us:

Confusion Matrix and Statistics

Reference
Prediction     No     Yes
No                  21        7
Yes                 5         29

Accuracy : 0.8065
95% CI : (0.6863, 0.8958)
No Information Rate : 0.5806
P-Value [Acc > NIR] : 0.0001457

Kappa : 0.6068

It appears that out optimised model performs marginally less well on unseen data than the non-optimised one. So there is clearly work to do.

Summary

We have shown shown how we can build a model to predict, with a certain level of accuracy, if a person will perceive a can or bottle label as alcoholic or not, based on the colourfulness, use of cartoon imagery, lightness of fonts and the informativeness of the label.

Future Works

There is much that we could do to make this more accurate.

  1. Ensure that we have a more robust mechanism for capturing the data about the factors such as colourfulness of the label. This could be based on computer vision, allowing us to remove human bias – or to have cross-validation of scoring by a panel of independent users.
  2. If we are trying to rest whether a teenager would perceive a label as alcoholic or not, we should recruit some to flag the labels as one or the other.
  3. We can develop a better model given more time – and based on better training data. We could better model the guidance of the Portman group.
  4. We could create a bot to scour the internet for new can designs and flag up where these might be perceived as non-alcoholic when they are in fact the opposite.
  5. Make it easier for human users to make reports of potentially transgressing designs to the appropriate authorities.
Alcohol Labels – how will they be perceived?
Tagged on:                 

Leave a Reply

Your email address will not be published. Required fields are marked *