Click Through Rate prediction using neural network in Vowpal Wabbit: A to Z

Click through rate (CTR) is an accepted metric for judging the success of an online advertising campaign. As a huge sum of money is spent in online advertisements, advertisers want to learn which advertisements are likely to be successful and which not. A number of machine learning techniques are used in the process.

Vowpal Wabbit is a very fast machine learning system. It is a bundle of a number of machine learning algorithms with very high  predictive accuracy. On many a Kaggle competition where data is complex or is sizeable or number of features are many, Vowpal Wabbit is used by some of the participants. However, unlike R or python machine learning environments, Vowpal Wabbit does not, as yet, have any data exploration capability though a utility wrapper does exist to give one some idea of data. One must already know one’s data before one begins to apply a machine learning algorithm using Vowpal Wabbit. Kaggle recently hosted a competition for predicting click through rate of online advertisements. Competition is on behalf of Avazu, who placed its 11 days of click through data on the site. 10 days of this data constitutes ‘train.csv’ and one day’s data is in file ‘test.csv’. My score after going through model building was 0.3976030. Plenty of scope for model improvements exists.

File, ‘train.csv’, is around 5.9gb and ‘test.csv’ is around 674mb. Fields in file ‘train.csv’ are as below:


All variables except 'hour' are categorical variables

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH
C1: anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21--anonymized categorical variables

test.csv, has all fields but the ‘click’ field which is not disclosed to us. The job is to train a machine learning model on ‘train.csv’ so as to predict ‘click’ for each online advertisement listed in ‘test.csv’. We can find total number of lines in both files and observe first few lines as:


# count lines in train.csv and test.csv
$ wc -l train.csv
40428968 train.csv
$ wc -l test.csv
4577465 test.csv
# show first five lines in train.csv and in test.csv
$ head -5 train.csv
id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
$ head -5 test.csv
id,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
10000174058809263569,14103100,1005,0,235ba823,f6ebf28e,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,69f45779,0eb711ec,1,0,8330,320,50,761,3,175,100075,23
10000182526920855428,14103100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8d44657,ecb851b2,1,0,22676,320,50,2616,0,35,100083,51
10000554139829213984,14103100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,10fb085b,1f0bc64f,1,0,22676,320,50,2616,0,35,100083,51
10001094637809798845,14103100,1005,0,85f751fd,c4e18dd6,50e219e0,51cedd4e,aefc06bd,0f2161f8,a99f214a,422d257a,542422a7,1,0,18648,320,50,1092,3,809,100156,61

To observe training file structure we read the data in R. Reading data in R implies loading the whole 5.9gb of file in RAM. The operating system, however, reports (check with command: $cat /proc/meminfo ) that RAM actually occupied is much more than 5.9gb. And with 8gb of total RAM in my machine it is not possible to do any modelling in R (in fact even 16gb RAM becomes insufficient). Hence, Vowpal Wabbit. Installation instructions on CentOS are here and general instructions here. After reading data in R, observe its structure. It is as below:


# We read the hour data as numeric and most of rest as 'factor'
>data<--read.csv("train.csv",header=TRUE,colClasses=c('character','factor','numeric',rep('factor',21)))
>str(data)
'data.frame':    40428967 obs. of  24 variables:
$ id              : chr  "1000009418151094273" "10000169349117863715" "10000371904215119486"  "10000640724480838376" ...
$ click           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
$ hour            : num  14102100 14102100 14102100 14102100 14102100 ...
$ C1              : Factor w/ 7 levels "1001","1002",..: 3 3 3 3 3 3 3 3 3 2 ...
$ banner_pos      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
$ site_id         : Factor w/ 4737 levels "000aa1a4","00255fb4",..: 583 583 583 583 4696 3949 2679 4159 583 2475 ...
$ site_domain     : Factor w/ 7745 levels "000129ff","0035f25a",..: 7340 7340 7340 7340 4457 5718 1194 3894 7340 6001 ...
$ site_category   : Factor w/ 26 levels "0569f928","110ab22d",..: 3 3 3 3 1 25 25 25 3 7 ...
$ app_id          : Factor w/ 8552 levels "000d6291","000f21f1",..: 7885 7885 7885 7885 7885 7885 7885 7885 7885 7885 ...
$ app_domain      : Factor w/ 559 levels "001b87ae","002e4064",..: 255 255 255 255 255 255 255 255 255 255 ...
$ app_category    : Factor w/ 36 levels "07d7df22","09481d60",..: 1 1 1 1 1 1 1 1 1 1 ...
$ device_id       : Factor w/ 2686408 levels "00000414","00000715",..: 1780273 1780273 1780273 1780273 1780273 1780273 1780273 1780273 1780273 2049924 ...
$ device_ip       : Factor w/ 6729486 levels "0000016d","00000262",..: 5831891 3958536 4728526 6103614 3952384 134645 4690993 6072203 1470150 6353522 ...
$ device_model    : Factor w/ 8251 levels "00097428","0009f4d7",..: 2188 3613 4461 3164 3836 4461 6079 6080 2997 1751 ...
$ device_type     : Factor w/ 5 levels "0","1","2","4",..: 2 2 2 2 2 2 2 2 2 1 ...
$ device_conn_type: Factor w/ 4 levels "0","2","3","5": 2 1 1 1 1 1 1 1 2 1 ...
$ C14             : Factor w/ 2626 levels "10289","1037",..: 188 186 186 188 493 239 642 680 189 920 ...
$ C15             : Factor w/ 8 levels "1024","120","216",..: 5 5 5 5 5 5 5 5 5 5 ...
$ C16             : Factor w/ 9 levels "1024","20","250",..: 7 7 7 7 7 7 7 7 7 7 ...
$ C17             : Factor w/ 435 levels "1008","1042",..: 31 31 31 31 84 51 126 136 31 187 ...
$ C18             : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 4 1 4 ...
$ C19             : Factor w/ 68 levels "1059","1063",..: 33 33 33 33 33 43 35 35 33 15 ...
$ C20             : Factor w/ 172 levels "-1","100000",..: 1 66 66 66 1 60 1 1 1 146 ...
$ C21             : Factor w/ 60 levels "1","100","101",..: 53 53 53 53 16 11 16 33 53 33 ...
>
>table(data$click)

0        1
33563901  6865066

Note from output of ‘table’ command above that clicks are about 20% of non-clicks. We have taken a summary of data attributes in R. It is as below:

> summary(data)
id                 click             hour             C1
Length:40428967    0:33563901   Min.   :14102100   1001:    9463
Class :character   1: 6865066   1st Qu.:14102304   1002: 2220812
Mode  :character                Median :14102602   1005:37140632
Mean   :14102558   1007:   35304
3rd Qu.:14102814   1008:    5787
Max.   :14103023   1010:  903457
1012:  113512

banner_pos       site_id           site_domain        site_category
0:29109590   85f751fd:14596137   c4e18dd6:15131739   50e219e0:16537234
1:11247282   1fbe01fe: 6486150   f3845767: 6486150   f028772b:12657073
2:   13001   e151e245: 2637747   7e091613: 3325008   28905ebd: 7377208
3:    2035   d9750ee7:  963745   7687a86e: 1290165   3e814130: 3050306
4:    7704   5b08c53b:  913325   98572c79:  996816   f66779e6:  252451
5:    5778   5b4d2eda:  771360   16a36ef3:  855686   75fa27f6:  160985
7:   43577   (Other) :14060503   (Other) :12343403   (Other) :  393710

app_id               app_domain         app_category
ecad2386:25832830   7801e8d9:27237087   07d7df22:26165592
92f5800b: 1555283   2347f47a: 5240885   0f2161f8: 9561058
e2fcccd2: 1129016   ae637522: 1881838   cef3e649: 1731545
febd1138:  759098   5c5a694b: 1129228   8ded1f7a: 1467257
9c13b419:  757812   82e27996:  759125   f95efa07: 1141673
7358e05e:  615635   d9b5648e:  713924   d1327cf5:  123233
(Other) : 9779293   (Other) : 3466880   (Other) :  238609

device_id           device_ip          device_model      device_type
a99f214a:33358308   6b9769f2:  208701   8a4875bd: 2455470   0: 2220812
0f7c61dc:   21356   431b3174:  135322   1f0bc64f: 1424546   1:37304667
c357dbff:   19667   2f323f36:   88499   d787e91b: 1405169   2:      31
936e92fb:   13712   af9205f9:   87844   76dc4769:  767961   4:  774272
afeffc18:    9654   930ec31d:   86996   be6db1d7:  742913   5:  129185
987552d1:    4187   af62faf4:   85802   a0f5f879:  652751
(Other) : 7002083   (Other) :39735803   (Other) :32980157

device_conn_type      C14                C15                C16
0:34886838       4687   :  948215   320    :37708959   50     :38136554
2: 3317443       21611  :  907004   300    : 2337294   250    : 1806334
3: 2181796       21189  :  765968   216    :  298794   36     :  298794
5:   42890       21191  :  765092   728    :   74533   480    :  103365
19771  :  730238   120    :    3069   90     :   74533
19772  :  729305   1024   :    2560   20     :    3069
(Other):35583145   (Other):    3758   (Other):    6318

C17           C18               C19                C20
1722   : 4513492   0:16939044   35     :12170630   -1     :18937918
2424   : 1531071   1: 2719623   39     : 8829426   100084 : 2438478
2227   : 1473105   2: 7116058   167    : 3145695   100148 : 1794890
1800   : 1190161   3:13654242   161    : 1587765   100111 : 1716733
423    :  948215                47     : 1451708   100077 : 1575495
2480   :  918663                1327   : 1092601   100075 : 1546414
(Other):29854260                (Other):12151142   (Other):12419039

C21
23     : 8896205
221    : 5051245
79     : 4614799
48     : 2160794
71     : 2108496
61     : 2053636
(Other):15543792

From the data structure and its summary it can be seen that certain categorical variables such as device_id and device_ip have too many levels as to be really meaningful as a factor. Some others such as device_model and site_id also have a relatively large number of categorical levels. In this blog I have not taken these into account but in a closer analysis it may be worthwhile to examine that if either some levels can be clubbed together or maybe that particular feature ignored altogether as merely being another id.

Vowpal Wabbit requires input data to be in certain format. It is not csv format. Its formatting instructions are here. You may further benefit from the detailed explanations regarding input format at this Stack Overflow link. You may also like to go through the clarifications regarding difference between ‘namespace’ and ‘feature’ at this link.

We will format input as below. While formatting ‘id’ field is ignored being of no importance. Note also that value of click in train.csv is either 0 or 1.


==> train.csv <==
id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157
10000720757801103869,0,14102100,1005,0,d6137915,bb1ef334,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,05241af0,8a4875bd,1,0,16920,320,50,1899,0,431,100077,117
10000724729988544911,0,14102100,1005,0,8fda644b,25d4cfcd,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,b264c159,be6db1d7,1,0,20362,320,50,2333,0,39,-1,157
10000918755742328737,0,14102100,1005,1,e151e245,7e091613,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,e6f67278,be74e6fe,1,0,20632,320,50,2374,3,39,-1,23
10000949271186029916,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,37e8da74,5db079b5,1,2,15707,320,50,1722,0,35,-1,79

==> train.vw <==
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_ddd2926e device_model_44956a24 device_type_1 device_conn_type_2 |others c14_15706 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_-1 c21_79
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_96809ac8 device_model_711ee120 device_type_1 device_conn_type_0 |others c14_15704 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_b3cf8def device_model_8a4875bd device_type_1 device_conn_type_0 |others c14_15704 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79
-1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_e8275b8f device_model_6332421a device_type_1 device_conn_type_0 |others c14_15706 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79
-1 |fe c1_1005 banner_pos_1 |site site_id_fe8cc448 site_domain_9166c161 site_category_0569f928 |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_9644d0bf device_model_779d90c2 device_type_1 device_conn_type_0 |others c14_18993 c15_320 c16_50 c17_2161 c18_0 c19_35 c20_-1 c21_157
-1 |fe c1_1005 banner_pos_0 |site site_id_d6137915 site_domain_bb1ef334 site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_05241af0 device_model_8a4875bd device_type_1 device_conn_type_0 |others c14_16920 c15_320 c16_50 c17_1899 c18_0 c19_431 c20_100077 c21_117
-1 |fe c1_1005 banner_pos_0 |site site_id_8fda644b site_domain_25d4cfcd site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_b264c159 device_model_be6db1d7 device_type_1 device_conn_type_0 |others c14_20362 c15_320 c16_50 c17_2333 c18_0 c19_39 c20_-1 c21_157
-1 |fe c1_1005 banner_pos_1 |site site_id_e151e245 site_domain_7e091613 site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_e6f67278 device_model_be74e6fe device_type_1 device_conn_type_0 |others c14_20632 c15_320 c16_50 c17_2374 c18_3 c19_39 c20_-1 c21_23
1 2 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_37e8da74 device_model_5db079b5 device_type_1 device_conn_type_2 |others c14_15707 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_-1 c21_79
-1 |fe c1_1002 banner_pos_0 |site site_id_84c7ba46 site_domain_c4e18dd6 site_category_50e219e0 |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_c357dbff device_ip_f1ac7184 device_model_373ecbe6 device_type_0 device_conn_type_0 |others c14_21689 c15_320 c16_50 c17_2496 c18_3 c19_167 c20_100191 c21_23

What we have done above is this: We created five namespaces: fe, site, app, device and others. Initial two fields have been bracketed with ‘fe’ namespace (‘fe’ is an arbitrary name). And site related fields with ‘site’ namespace and so on. Fields about which we are not clear (names being anonymous) are under ‘others’ namespace.


Namespace	Fields (prefix to value) in namespace

|fe             c1_             banner_pos_
|site           site_id_        site_domain_    site_category_
|app            app_id_         app_domain_     app_category_
|device         device_id_      device_ip_      device_model_   device_type_   device_conn_type_
|others         c14_            c15_            c16_	c17_    c18_     c19_  c20_  c21_

A namespace name starts with ‘|’. A namespace is known by its initial letter rather than by the complete name. Thus identifier for ‘site’ namespace is ‘s’ rather than ‘site’. Before the first namespace (‘|fe’), we have the value of class (i.e. ‘click’) label. It is -1, if the click is 0 but if the click is 1, it remains 1. As number of clicks (1s) are few, we have attached an ‘Importance’ factor of 2 to each click (line 22 in train.vw above). Latter on in our analyses, we will vary ‘Importance’ and see the effect.

Conversion from csv to Vowpal Wabbit format is easy and can be carried out either using ‘awk’ or python. Code for awk is as below. Header line has been ignored (NR >1) so also the first field (i.e. id or $1):

#! /bin/awk -f
# # Call it as: ./train.awk train.csv > train.vw
# # Check few sample lines in VW Validator: http://hunch.net/~vw/validate.html

BEGIN {FS = "," ; ORS = ""};
		{if (NR > 1 )
			{
	 	 	if ($2 == 0) $2 = "-1" ; else  $2 = "1 2" ;
 			print($2)
 	 		print(" |fe")
			print(" c1_") ; print $4
			print(" banner_pos_") ; print $5
			print(" |site")
			print(" site_id_") ; print $6
			print(" site_domain_") ; print $7 ;
			print(" site_category_") ; print $8 ;
			print(" |app")
			print(" app_id_"); print $9
			print(" app_domain_") ; print $10
			print(" app_category_") ; print($11)
			print(" |device")
			print(" device_id_") ; print($12)
			print(" device_ip_"); print($13)
			print(" device_model_") ; print($14)
			print(" device_type_") ; print($15)
			print(" device_conn_type_") ; print($16)
			print(" |others")
			print(" c14_") ; print($17)
			print(" c15_") ; print($18)
			print(" c16_") ; print($19)
			print(" c17_") ; print($20)
			print(" c18_") ; print($21)
			print(" c19_") ; print($22)
			print(" c20_") ; print($23)
			print(" c21_") ; print($24)
 			print("\n")
 			} }

You can print first few lines of ‘train.vw’ file using command: $head --lines 5 train.vw . The python conversion code is equally simple and I write below:

## Use it as: python convert.py > train.vw
import csv
import re
i = 0
trainfile= open("train.csv", "r")
csv_reader=csv.reader(trainfile)
linenum=0
for row in csv_reader:
	linenum +=1
	# If not header
	if linenum > 1:
		vw_line = ""
		# Check value in column 2. If 0, make it -1
		if (str(row[1])== "0" ):
			# Label in vw_line
			vw_line += "-1 |fe"
		else:
			vw_line += "1 2 |fe"
		dtime_numb=row[2]
		year  = dtime_numb[0:2]
		month = dtime_numb[2:4]
		day   = dtime_numb[4:6]
		hour  = dtime_numb[6:9]
		yeartime = " year:"+year + " month:" + month + " day:" + day +" hour:" + hour
		vw_line += yeartime
		vw_line += " |pos"
		vw_line += str(" c1_")+str(row[3])
		vw_line += str(" banner_pos_")+str(row[4])
		vw_line += " |site"
		vw_line += str(" site_id_")+str(row[5])
		vw_line += str(" site_domain_")+str(row[6])
		vw_line += str(" site_category_")+str(row[7])
		vw_line += " |app"
		vw_line += str(" app_id_")+str(row[8])
		vw_line += str(" app_domain_")+str(row[9])
		vw_line += str(" app_category_")+str(row[10])
		vw_line += " |device"
		vw_line += str(" device_id_")+str(row[11])
		vw_line += str(" device_ip_")+str(row[12])
		vw_line += str(" device_model_")+str(row[13])
		vw_line += str(" device_type_")+str(row[14])
		vw_line += str(" device_conn_type_")+str(row[15])
		vw_line += " |others"
		vw_line += str(" c14_")+str(row[16])
		vw_line += str(" c15_")+str(row[17])
		vw_line += str(" c16_")+str(row[18])
		vw_line += str(" c17_")+str(row[19])
		vw_line += str(" c18_")+str(row[20])
		vw_line += str(" c19_")+str(row[21])
		vw_line += str(" c20_")+str(row[22])
		vw_line += str(" c21_")+str(row[23])
		print (vw_line)

You may have noted that in the python code, I have also included ‘hour’ field by breaking it into four pieces. However, we will not use the ‘hour’ field in our learning machine. Also, it is a good idea to check beforehand if the input format is as per vw’s requirement. You can check this by pasting a few lines in vw validator here. Size of train.vw file is around 12.5gb i.e. more than double of ‘train.csv’.

Now that our vw file is ready we can feed it into VW machine. The command (just the first line) and its output is as below:

$ vw -d train.vw --cache_file  neural --inpass --passes 5 -q sd  -q ad  -q do -q fd --binary -f neural_model  --loss_function=logistic  --nn 3

creating quadratic features for pairs: sd ad do fd
final_regressor = neural_model
using input passthrough for neural network training
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = neural
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000000   0.000000          1      1.0    -1.0000  -1.0000      102
0.000000   0.000000          2      2.0    -1.0000  -1.0000      102
0.000000   0.000000          4      4.0    -1.0000  -1.0000      102
0.000000   0.000000          8      8.0    -1.0000  -1.0000      102
0.062500   0.125000         16     16.0    -1.0000  -1.0000      102
0.125000   0.187500         32     32.0    -1.0000  -1.0000      102
0.234375   0.343750         64     64.0    -1.0000  -1.0000      102
0.234375   0.234375        128    128.0    -1.0000  -1.0000      102
0.179688   0.125000        256    256.0    -1.0000  -1.0000      102
0.179688   0.179688        512    512.0    -1.0000  -1.0000      102
0.177734   0.175781       1024   1024.0    -1.0000  -1.0000      102
0.176270   0.174805       2048   2048.0    -1.0000  -1.0000      102
0.183594   0.190918       4096   4096.0    -1.0000  -1.0000      102
0.180298   0.177002       8192   8192.0    -1.0000  -1.0000      102
0.174622   0.168945      16384  16384.0    -1.0000  -1.0000      102
0.177582   0.180542      32768  32768.0    -1.0000  -1.0000      102
0.175446   0.173309      65536  65536.0    -1.0000  -1.0000      102
0.173775   0.172104     131072 131072.0    -1.0000  -1.0000      102
0.170753   0.167732     262144 262144.0    -1.0000  -1.0000      102
0.163738   0.156723     524288 524288.0    -1.0000  -1.0000      102
0.157854   0.151970    1048576 1048576.0    -1.0000  -1.0000      102
0.161858   0.165862    2097152 2097152.0    -1.0000  -1.0000      102
0.172266   0.182673    4194304 4194304.0     1.0000  -1.0000      102
0.161030   0.149794    8388608 8388608.0    -1.0000  -1.0000      102
0.168042   0.175053   16777216 16777216.0    -1.0000  -1.0000      102
0.164841   0.161640   33554432 33554432.0    -1.0000  -1.0000      102
0.165551   0.165551   67108864 67108864.0    -1.0000  -1.0000      102 h
0.165849   0.166147   134217728 134217728.0    -1.0000  -1.0000      102 h

finished run
number of examples per pass = 36386071
passes used = 5
weighted example sum = 1.8193e+08
weighted label sum = -1.20141e+08
average loss = 0.164397 h
best constant = -1.58694
best constant's loss = 0.455592
total feature number = 18556896210
[ashokharnal@master clickthroughrate]$

Explanation of arguments to ‘vw‘ command is here: While processing the text file, train.vw, vw first converts it into a special binary format (cache file). This file is ‘neural’. The next time you again run the command, vw will use this file rather than the vw file. The contents of the cache file are argument dependent; if you run vw with different arguments, it is possible that a new cache file may be created. Number of '--passes' is 5. '--binary' is for binary classification. The model will be stored in file '-f neural_model'. The loss function for convergence is '--loss_function=logistic'. Why did I select ‘logistic’ loss function? The default loss function is ‘squared’. In many instances squared loss function leads to slower learning. An excellent and simple explanation about loss functions appropriate for neural networks is given by Michael Nielsen in his html book here. '--nn 3' represents a neural network with one hidden layer having 3 neurons. '--inpass' is for adding a further direct connection between the input and output layer. Arguments '-q sd -q ad -q do -q fd' are for creating interaction variables. An interaction variable is created from two variables ‘A’ and ‘B’ by multiplying the values of ‘A’ and ‘B’. It is a general technique and you may read more about it in Wikipedia. Argument '-q sd' means that all possible interaction variables are created from variables in the namespace ‘s’ (i.e. site) and namespace ‘d’ (i.e. device). Similar explanation holds for three other interactions ‘-q ad -q do -q fd’.

In our model building we have not used regularization functions. Regularization helps to avoid over fitting. For example, a non-linear model may become so non-linear as to connect all points of a class (including noise). This curve may have so many twists and turns that when model is run for finding out the class of an unclassified point, result may be confusing. Model building, therefore, attempts to penalize excessive twists and turns and hence ‘regularization’. For further exploration, it is worthwhile trying L1 (--l1) and L2 (--l2) regularizations; say, to start with: --l1 0.0005 and --l2 0.00004 . About regularizations in neural network you may like to read this work here.

Average loss of 0.164397 is an indicator of model fit accuracy (line 50 above). A direct and relative comparison of performances of various models can be made with this measure rather than loading the predictions to Kaggle.

The above algorithm takes around 2 hours to run on an 8GB machine and occupies at the maximum 1.5GB of RAM. Thus it is both fast and resource wise economical.

It is now time to predict clicks in test.csv. This file first needs to be converted to vw format. We add a click field to it but with a uniform click value of 1 in all records. This field is ignored while making predictions. The awk conversion code for this is as below. First field of ‘test.csv’ is ‘id’ field and is ignored.

#! /bin/awk -f
# # Call it as: ./test.awk test.csv > test.vw

BEGIN { FS = "," ; ORS = ""}
		{if (NR > 1 )
			{
	 	 	$2 = "1"  ;
 			print($2)
 	 		print(" |fe")
			print(" c1_") ; print $3
			print(" banner_pos_") ; print $4
			print(" |site")
			print(" site_id_") ; print $5
			print(" site_domain_") ; print $6
			print(" site_category_") ; print $7
			print(" |app")
			print(" app_id_"); print $8
			print(" app_domain_") ; print $9
			print(" app_category_") ; print($10)
			print(" |device")
			print(" device_id_") ; print($11)
			print(" device_ip_"); print($12)
			print(" device_model_") ; print($13)
			print(" device_type_") ; print($14)
			print(" device_conn_type_") ; print($15)
			print(" |others")
			print(" c14_") ; print($16)
			print(" c15_") ; print($17)
			print(" c16_") ; print($18)
			print(" c17_") ; print($19)
			print(" c18_") ; print($20)
			print(" c19_") ; print($21)
			print(" c20_") ; print($22)
			print(" c21_") ; print($23)
 			print("\n")
 			} } 

We, next, use the model prepared earlier to make predictions for test.vw. The vw command is:

$vw -d test.vw -t -i neural_model --link=logistic -p probabilities.txt

Argument '-t' is to indicate that we are feeding test file and class field is to be ignored. '-i' specifies the model file. We use '--link=logistic' to get probabilities. Use '--link=glf1' to get output between [-1,1]. The output file is ‘probabilities.txt’.

Kaggle requires that we submit results in the format ‘id,probability’ (without headers). A sample submission file is on the site which has all the IDs (same as in test.csv). We read this file in R and overwrite its second column with our predicted probabilities. R code for this is as below:

p<-read.table("probabilities.txt",header=FALSE)
samplesubmission<-read.csv("sampleSubmission.csv",header=TRUE,colClasses=c('character','numeric') )
samplesubmission[,2]<-p[1]
head(samplesubmission)
write.csv(samplesubmission,"neural_result.csv",quote=FALSE,row.names=FALSE)

File, neural_result.csv, (either as it is or zipped) can be submitted to Kaggle. Even though the competition is over, one does get a score. For this model the score was 0.3976030.

Kaggle score of 0.3976030: Post deadline

Kaggle score: Post deadline

Let us now once again review our model options. We have selected neural network option with just three neurons. This choice (it is believed) gives optimum results. But, it appears, we could have done without using neural network with just the following learning model:

$vw -d train.vw --cache_file  neural  --passes 5 -q sd  -q ad  -q do -q fd --binary -f logistic_model  --loss_function=logistic

With the above we get Kaggle score of 0.3994788. Thus, '--nn 3' option did make an improvement but not dramatic. We did not test '--nn' option by increasing the number of neurons. Increasing the number of '--passes' from 5 to 10 did not affect results; actual '--passes' used were 8. I may mention that VW’s default learning algorithm is online gradient descent.

We have initially used ‘Importance‘ of 2 while converting ‘train.csv’ file to ‘train.vw’ for clicks (of ‘1’). See line 22 in train.vw file above. This we did so that the learner would treat them as important events and not merely as noise (as such events were few).

We raised the ‘Importance’ to 3; the score degraded. We then changed the Importance to just 1 i.e. treated click events on par with non-click events. The Kaggle score was still 0.3976030. This meant that 20% clicks were sufficient to make good enough predictions for VW. This finishes our experiments with Vowpal Wabbit on CTR predictions.

Tags: , , ,

4 Responses to “Click Through Rate prediction using neural network in Vowpal Wabbit: A to Z”

  1. Ron Says:

    Your awk script is outputting an importance of 1 for click training lines, while your python script is outputting an importance of 2 for click training lines.

    I read in your article that you settled on importance of 2 for clicks, for best results. It may be a good idea to adjust your awk script to reflect that.

    Great work, BTW!

  2. Aaron Says:

    Great post. Just curious, what resources did you use to familiarize yourself with VW’s nnet training algorithm? For example, I had no idea that there was an –inpass argument. Tips and tricks seem kinda scattered all over the internet.

Leave a comment