Click through rate (CTR) is an accepted metric for judging the success of an online advertising campaign. As a huge sum of money is spent in online advertisements, advertisers want to learn which advertisements are likely to be successful and which not. A number of machine learning techniques are used in the process.
Vowpal Wabbit is a very fast machine learning system. It is a bundle of a number of machine learning algorithms with very high predictive accuracy. On many a Kaggle competition where data is complex or is sizeable or number of features are many, Vowpal Wabbit is used by some of the participants. However, unlike R or python machine learning environments, Vowpal Wabbit does not, as yet, have any data exploration capability though a utility wrapper does exist to give one some idea of data. One must already know one’s data before one begins to apply a machine learning algorithm using Vowpal Wabbit. Kaggle recently hosted a competition for predicting click through rate of online advertisements. Competition is on behalf of Avazu, who placed its 11 days of click through data on the site. 10 days of this data constitutes ‘train.csv’ and one day’s data is in file ‘test.csv’. My score after going through model building was 0.3976030. Plenty of scope for model improvements exists.
File, ‘train.csv’, is around 5.9gb and ‘test.csv’ is around 674mb. Fields in file ‘train.csv’ are as below:
All variables except 'hour' are categorical variables id: ad identifier click: 0/1 for non-click/click hour: format is YYMMDDHH C1: anonymized categorical variable banner_pos site_id site_domain site_category app_id app_domain app_category device_id device_ip device_model device_type device_conn_type C14-C21--anonymized categorical variables
test.csv, has all fields but the ‘click’ field which is not disclosed to us. The job is to train a machine learning model on ‘train.csv’ so as to predict ‘click’ for each online advertisement listed in ‘test.csv’. We can find total number of lines in both files and observe first few lines as:
# count lines in train.csv and test.csv $ wc -l train.csv 40428968 train.csv $ wc -l test.csv 4577465 test.csv # show first five lines in train.csv and in test.csv $ head -5 train.csv id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21 1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79 10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79 10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79 10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79 $ head -5 test.csv id,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21 10000174058809263569,14103100,1005,0,235ba823,f6ebf28e,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,69f45779,0eb711ec,1,0,8330,320,50,761,3,175,100075,23 10000182526920855428,14103100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8d44657,ecb851b2,1,0,22676,320,50,2616,0,35,100083,51 10000554139829213984,14103100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,10fb085b,1f0bc64f,1,0,22676,320,50,2616,0,35,100083,51 10001094637809798845,14103100,1005,0,85f751fd,c4e18dd6,50e219e0,51cedd4e,aefc06bd,0f2161f8,a99f214a,422d257a,542422a7,1,0,18648,320,50,1092,3,809,100156,61
To observe training file structure we read the data in R. Reading data in R implies loading the whole 5.9gb of file in RAM. The operating system, however, reports (check with command: $cat /proc/meminfo
) that RAM actually occupied is much more than 5.9gb. And with 8gb of total RAM in my machine it is not possible to do any modelling in R (in fact even 16gb RAM becomes insufficient). Hence, Vowpal Wabbit. Installation instructions on CentOS are here and general instructions here. After reading data in R, observe its structure. It is as below:
# We read the hour data as numeric and most of rest as 'factor' >data<--read.csv("train.csv",header=TRUE,colClasses=c('character','factor','numeric',rep('factor',21))) >str(data) 'data.frame': 40428967 obs. of 24 variables: $ id : chr "1000009418151094273" "10000169349117863715" "10000371904215119486" "10000640724480838376" ... $ click : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ... $ hour : num 14102100 14102100 14102100 14102100 14102100 ... $ C1 : Factor w/ 7 levels "1001","1002",..: 3 3 3 3 3 3 3 3 3 2 ... $ banner_pos : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ... $ site_id : Factor w/ 4737 levels "000aa1a4","00255fb4",..: 583 583 583 583 4696 3949 2679 4159 583 2475 ... $ site_domain : Factor w/ 7745 levels "000129ff","0035f25a",..: 7340 7340 7340 7340 4457 5718 1194 3894 7340 6001 ... $ site_category : Factor w/ 26 levels "0569f928","110ab22d",..: 3 3 3 3 1 25 25 25 3 7 ... $ app_id : Factor w/ 8552 levels "000d6291","000f21f1",..: 7885 7885 7885 7885 7885 7885 7885 7885 7885 7885 ... $ app_domain : Factor w/ 559 levels "001b87ae","002e4064",..: 255 255 255 255 255 255 255 255 255 255 ... $ app_category : Factor w/ 36 levels "07d7df22","09481d60",..: 1 1 1 1 1 1 1 1 1 1 ... $ device_id : Factor w/ 2686408 levels "00000414","00000715",..: 1780273 1780273 1780273 1780273 1780273 1780273 1780273 1780273 1780273 2049924 ... $ device_ip : Factor w/ 6729486 levels "0000016d","00000262",..: 5831891 3958536 4728526 6103614 3952384 134645 4690993 6072203 1470150 6353522 ... $ device_model : Factor w/ 8251 levels "00097428","0009f4d7",..: 2188 3613 4461 3164 3836 4461 6079 6080 2997 1751 ... $ device_type : Factor w/ 5 levels "0","1","2","4",..: 2 2 2 2 2 2 2 2 2 1 ... $ device_conn_type: Factor w/ 4 levels "0","2","3","5": 2 1 1 1 1 1 1 1 2 1 ... $ C14 : Factor w/ 2626 levels "10289","1037",..: 188 186 186 188 493 239 642 680 189 920 ... $ C15 : Factor w/ 8 levels "1024","120","216",..: 5 5 5 5 5 5 5 5 5 5 ... $ C16 : Factor w/ 9 levels "1024","20","250",..: 7 7 7 7 7 7 7 7 7 7 ... $ C17 : Factor w/ 435 levels "1008","1042",..: 31 31 31 31 84 51 126 136 31 187 ... $ C18 : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 4 1 4 ... $ C19 : Factor w/ 68 levels "1059","1063",..: 33 33 33 33 33 43 35 35 33 15 ... $ C20 : Factor w/ 172 levels "-1","100000",..: 1 66 66 66 1 60 1 1 1 146 ... $ C21 : Factor w/ 60 levels "1","100","101",..: 53 53 53 53 16 11 16 33 53 33 ... > >table(data$click) 0 1 33563901 6865066
Note from output of ‘table’ command above that clicks are about 20% of non-clicks. We have taken a summary of data attributes in R. It is as below:
> summary(data) id click hour C1 Length:40428967 0:33563901 Min. :14102100 1001: 9463 Class :character 1: 6865066 1st Qu.:14102304 1002: 2220812 Mode :character Median :14102602 1005:37140632 Mean :14102558 1007: 35304 3rd Qu.:14102814 1008: 5787 Max. :14103023 1010: 903457 1012: 113512 banner_pos site_id site_domain site_category 0:29109590 85f751fd:14596137 c4e18dd6:15131739 50e219e0:16537234 1:11247282 1fbe01fe: 6486150 f3845767: 6486150 f028772b:12657073 2: 13001 e151e245: 2637747 7e091613: 3325008 28905ebd: 7377208 3: 2035 d9750ee7: 963745 7687a86e: 1290165 3e814130: 3050306 4: 7704 5b08c53b: 913325 98572c79: 996816 f66779e6: 252451 5: 5778 5b4d2eda: 771360 16a36ef3: 855686 75fa27f6: 160985 7: 43577 (Other) :14060503 (Other) :12343403 (Other) : 393710 app_id app_domain app_category ecad2386:25832830 7801e8d9:27237087 07d7df22:26165592 92f5800b: 1555283 2347f47a: 5240885 0f2161f8: 9561058 e2fcccd2: 1129016 ae637522: 1881838 cef3e649: 1731545 febd1138: 759098 5c5a694b: 1129228 8ded1f7a: 1467257 9c13b419: 757812 82e27996: 759125 f95efa07: 1141673 7358e05e: 615635 d9b5648e: 713924 d1327cf5: 123233 (Other) : 9779293 (Other) : 3466880 (Other) : 238609 device_id device_ip device_model device_type a99f214a:33358308 6b9769f2: 208701 8a4875bd: 2455470 0: 2220812 0f7c61dc: 21356 431b3174: 135322 1f0bc64f: 1424546 1:37304667 c357dbff: 19667 2f323f36: 88499 d787e91b: 1405169 2: 31 936e92fb: 13712 af9205f9: 87844 76dc4769: 767961 4: 774272 afeffc18: 9654 930ec31d: 86996 be6db1d7: 742913 5: 129185 987552d1: 4187 af62faf4: 85802 a0f5f879: 652751 (Other) : 7002083 (Other) :39735803 (Other) :32980157 device_conn_type C14 C15 C16 0:34886838 4687 : 948215 320 :37708959 50 :38136554 2: 3317443 21611 : 907004 300 : 2337294 250 : 1806334 3: 2181796 21189 : 765968 216 : 298794 36 : 298794 5: 42890 21191 : 765092 728 : 74533 480 : 103365 19771 : 730238 120 : 3069 90 : 74533 19772 : 729305 1024 : 2560 20 : 3069 (Other):35583145 (Other): 3758 (Other): 6318 C17 C18 C19 C20 1722 : 4513492 0:16939044 35 :12170630 -1 :18937918 2424 : 1531071 1: 2719623 39 : 8829426 100084 : 2438478 2227 : 1473105 2: 7116058 167 : 3145695 100148 : 1794890 1800 : 1190161 3:13654242 161 : 1587765 100111 : 1716733 423 : 948215 47 : 1451708 100077 : 1575495 2480 : 918663 1327 : 1092601 100075 : 1546414 (Other):29854260 (Other):12151142 (Other):12419039 C21 23 : 8896205 221 : 5051245 79 : 4614799 48 : 2160794 71 : 2108496 61 : 2053636 (Other):15543792
From the data structure and its summary it can be seen that certain categorical variables such as device_id and device_ip have too many levels as to be really meaningful as a factor. Some others such as device_model and site_id also have a relatively large number of categorical levels. In this blog I have not taken these into account but in a closer analysis it may be worthwhile to examine that if either some levels can be clubbed together or maybe that particular feature ignored altogether as merely being another id.
Vowpal Wabbit requires input data to be in certain format. It is not csv format. Its formatting instructions are here. You may further benefit from the detailed explanations regarding input format at this Stack Overflow link. You may also like to go through the clarifications regarding difference between ‘namespace’ and ‘feature’ at this link.
We will format input as below. While formatting ‘id’ field is ignored being of no importance. Note also that value of click in train.csv is either 0 or 1.
==> train.csv <== id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21 1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79 10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79 10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79 10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79 10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157 10000720757801103869,0,14102100,1005,0,d6137915,bb1ef334,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,05241af0,8a4875bd,1,0,16920,320,50,1899,0,431,100077,117 10000724729988544911,0,14102100,1005,0,8fda644b,25d4cfcd,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,b264c159,be6db1d7,1,0,20362,320,50,2333,0,39,-1,157 10000918755742328737,0,14102100,1005,1,e151e245,7e091613,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,e6f67278,be74e6fe,1,0,20632,320,50,2374,3,39,-1,23 10000949271186029916,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,37e8da74,5db079b5,1,2,15707,320,50,1722,0,35,-1,79 ==> train.vw <== -1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_ddd2926e device_model_44956a24 device_type_1 device_conn_type_2 |others c14_15706 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_-1 c21_79 -1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_96809ac8 device_model_711ee120 device_type_1 device_conn_type_0 |others c14_15704 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79 -1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_b3cf8def device_model_8a4875bd device_type_1 device_conn_type_0 |others c14_15704 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79 -1 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_e8275b8f device_model_6332421a device_type_1 device_conn_type_0 |others c14_15706 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_100084 c21_79 -1 |fe c1_1005 banner_pos_1 |site site_id_fe8cc448 site_domain_9166c161 site_category_0569f928 |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_9644d0bf device_model_779d90c2 device_type_1 device_conn_type_0 |others c14_18993 c15_320 c16_50 c17_2161 c18_0 c19_35 c20_-1 c21_157 -1 |fe c1_1005 banner_pos_0 |site site_id_d6137915 site_domain_bb1ef334 site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_05241af0 device_model_8a4875bd device_type_1 device_conn_type_0 |others c14_16920 c15_320 c16_50 c17_1899 c18_0 c19_431 c20_100077 c21_117 -1 |fe c1_1005 banner_pos_0 |site site_id_8fda644b site_domain_25d4cfcd site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_b264c159 device_model_be6db1d7 device_type_1 device_conn_type_0 |others c14_20362 c15_320 c16_50 c17_2333 c18_0 c19_39 c20_-1 c21_157 -1 |fe c1_1005 banner_pos_1 |site site_id_e151e245 site_domain_7e091613 site_category_f028772b |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_e6f67278 device_model_be74e6fe device_type_1 device_conn_type_0 |others c14_20632 c15_320 c16_50 c17_2374 c18_3 c19_39 c20_-1 c21_23 1 2 |fe c1_1005 banner_pos_0 |site site_id_1fbe01fe site_domain_f3845767 site_category_28905ebd |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_a99f214a device_ip_37e8da74 device_model_5db079b5 device_type_1 device_conn_type_2 |others c14_15707 c15_320 c16_50 c17_1722 c18_0 c19_35 c20_-1 c21_79 -1 |fe c1_1002 banner_pos_0 |site site_id_84c7ba46 site_domain_c4e18dd6 site_category_50e219e0 |app app_id_ecad2386 app_domain_7801e8d9 app_category_07d7df22 |device device_id_c357dbff device_ip_f1ac7184 device_model_373ecbe6 device_type_0 device_conn_type_0 |others c14_21689 c15_320 c16_50 c17_2496 c18_3 c19_167 c20_100191 c21_23
What we have done above is this: We created five namespaces: fe, site, app, device and others. Initial two fields have been bracketed with ‘fe’ namespace (‘fe’ is an arbitrary name). And site related fields with ‘site’ namespace and so on. Fields about which we are not clear (names being anonymous) are under ‘others’ namespace.
Namespace Fields (prefix to value) in namespace |fe c1_ banner_pos_ |site site_id_ site_domain_ site_category_ |app app_id_ app_domain_ app_category_ |device device_id_ device_ip_ device_model_ device_type_ device_conn_type_ |others c14_ c15_ c16_ c17_ c18_ c19_ c20_ c21_
A namespace name starts with ‘|’. A namespace is known by its initial letter rather than by the complete name. Thus identifier for ‘site’ namespace is ‘s’ rather than ‘site’. Before the first namespace (‘|fe’), we have the value of class (i.e. ‘click’) label. It is -1, if the click is 0 but if the click is 1, it remains 1. As number of clicks (1s) are few, we have attached an ‘Importance’ factor of 2 to each click (line 22 in train.vw above). Latter on in our analyses, we will vary ‘Importance’ and see the effect.
Conversion from csv to Vowpal Wabbit format is easy and can be carried out either using ‘awk’ or python. Code for awk is as below. Header line has been ignored (NR >1) so also the first field (i.e. id or $1):
#! /bin/awk -f # # Call it as: ./train.awk train.csv > train.vw # # Check few sample lines in VW Validator: http://hunch.net/~vw/validate.html BEGIN {FS = "," ; ORS = ""}; {if (NR > 1 ) { if ($2 == 0) $2 = "-1" ; else $2 = "1 2" ; print($2) print(" |fe") print(" c1_") ; print $4 print(" banner_pos_") ; print $5 print(" |site") print(" site_id_") ; print $6 print(" site_domain_") ; print $7 ; print(" site_category_") ; print $8 ; print(" |app") print(" app_id_"); print $9 print(" app_domain_") ; print $10 print(" app_category_") ; print($11) print(" |device") print(" device_id_") ; print($12) print(" device_ip_"); print($13) print(" device_model_") ; print($14) print(" device_type_") ; print($15) print(" device_conn_type_") ; print($16) print(" |others") print(" c14_") ; print($17) print(" c15_") ; print($18) print(" c16_") ; print($19) print(" c17_") ; print($20) print(" c18_") ; print($21) print(" c19_") ; print($22) print(" c20_") ; print($23) print(" c21_") ; print($24) print("\n") } }
You can print first few lines of ‘train.vw’ file using command: $head --lines 5 train.vw
. The python conversion code is equally simple and I write below:
## Use it as: python convert.py > train.vw import csv import re i = 0 trainfile= open("train.csv", "r") csv_reader=csv.reader(trainfile) linenum=0 for row in csv_reader: linenum +=1 # If not header if linenum > 1: vw_line = "" # Check value in column 2. If 0, make it -1 if (str(row[1])== "0" ): # Label in vw_line vw_line += "-1 |fe" else: vw_line += "1 2 |fe" dtime_numb=row[2] year = dtime_numb[0:2] month = dtime_numb[2:4] day = dtime_numb[4:6] hour = dtime_numb[6:9] yeartime = " year:"+year + " month:" + month + " day:" + day +" hour:" + hour vw_line += yeartime vw_line += " |pos" vw_line += str(" c1_")+str(row[3]) vw_line += str(" banner_pos_")+str(row[4]) vw_line += " |site" vw_line += str(" site_id_")+str(row[5]) vw_line += str(" site_domain_")+str(row[6]) vw_line += str(" site_category_")+str(row[7]) vw_line += " |app" vw_line += str(" app_id_")+str(row[8]) vw_line += str(" app_domain_")+str(row[9]) vw_line += str(" app_category_")+str(row[10]) vw_line += " |device" vw_line += str(" device_id_")+str(row[11]) vw_line += str(" device_ip_")+str(row[12]) vw_line += str(" device_model_")+str(row[13]) vw_line += str(" device_type_")+str(row[14]) vw_line += str(" device_conn_type_")+str(row[15]) vw_line += " |others" vw_line += str(" c14_")+str(row[16]) vw_line += str(" c15_")+str(row[17]) vw_line += str(" c16_")+str(row[18]) vw_line += str(" c17_")+str(row[19]) vw_line += str(" c18_")+str(row[20]) vw_line += str(" c19_")+str(row[21]) vw_line += str(" c20_")+str(row[22]) vw_line += str(" c21_")+str(row[23]) print (vw_line)
You may have noted that in the python code, I have also included ‘hour’ field by breaking it into four pieces. However, we will not use the ‘hour’ field in our learning machine. Also, it is a good idea to check beforehand if the input format is as per vw’s requirement. You can check this by pasting a few lines in vw validator here. Size of train.vw file is around 12.5gb i.e. more than double of ‘train.csv’.
Now that our vw file is ready we can feed it into VW machine. The command (just the first line) and its output is as below:
$ vw -d train.vw --cache_file neural --inpass --passes 5 -q sd -q ad -q do -q fd --binary -f neural_model --loss_function=logistic --nn 3 creating quadratic features for pairs: sd ad do fd final_regressor = neural_model using input passthrough for neural network training Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 using cache_file = neural ignoring text input in favor of cache input num sources = 1 average since example example current current current loss last counter weight label predict features 0.000000 0.000000 1 1.0 -1.0000 -1.0000 102 0.000000 0.000000 2 2.0 -1.0000 -1.0000 102 0.000000 0.000000 4 4.0 -1.0000 -1.0000 102 0.000000 0.000000 8 8.0 -1.0000 -1.0000 102 0.062500 0.125000 16 16.0 -1.0000 -1.0000 102 0.125000 0.187500 32 32.0 -1.0000 -1.0000 102 0.234375 0.343750 64 64.0 -1.0000 -1.0000 102 0.234375 0.234375 128 128.0 -1.0000 -1.0000 102 0.179688 0.125000 256 256.0 -1.0000 -1.0000 102 0.179688 0.179688 512 512.0 -1.0000 -1.0000 102 0.177734 0.175781 1024 1024.0 -1.0000 -1.0000 102 0.176270 0.174805 2048 2048.0 -1.0000 -1.0000 102 0.183594 0.190918 4096 4096.0 -1.0000 -1.0000 102 0.180298 0.177002 8192 8192.0 -1.0000 -1.0000 102 0.174622 0.168945 16384 16384.0 -1.0000 -1.0000 102 0.177582 0.180542 32768 32768.0 -1.0000 -1.0000 102 0.175446 0.173309 65536 65536.0 -1.0000 -1.0000 102 0.173775 0.172104 131072 131072.0 -1.0000 -1.0000 102 0.170753 0.167732 262144 262144.0 -1.0000 -1.0000 102 0.163738 0.156723 524288 524288.0 -1.0000 -1.0000 102 0.157854 0.151970 1048576 1048576.0 -1.0000 -1.0000 102 0.161858 0.165862 2097152 2097152.0 -1.0000 -1.0000 102 0.172266 0.182673 4194304 4194304.0 1.0000 -1.0000 102 0.161030 0.149794 8388608 8388608.0 -1.0000 -1.0000 102 0.168042 0.175053 16777216 16777216.0 -1.0000 -1.0000 102 0.164841 0.161640 33554432 33554432.0 -1.0000 -1.0000 102 0.165551 0.165551 67108864 67108864.0 -1.0000 -1.0000 102 h 0.165849 0.166147 134217728 134217728.0 -1.0000 -1.0000 102 h finished run number of examples per pass = 36386071 passes used = 5 weighted example sum = 1.8193e+08 weighted label sum = -1.20141e+08 average loss = 0.164397 h best constant = -1.58694 best constant's loss = 0.455592 total feature number = 18556896210 [ashokharnal@master clickthroughrate]$
Explanation of arguments to ‘vw‘ command is here: While processing the text file, train.vw, vw first converts it into a special binary format (cache file). This file is ‘neural’. The next time you again run the command, vw will use this file rather than the vw file. The contents of the cache file are argument dependent; if you run vw with different arguments, it is possible that a new cache file may be created. Number of '--passes'
is 5. '--binary'
is for binary classification. The model will be stored in file '-f neural_model'
. The loss function for convergence is '--loss_function=logistic'
. Why did I select ‘logistic’ loss function? The default loss function is ‘squared’. In many instances squared loss function leads to slower learning. An excellent and simple explanation about loss functions appropriate for neural networks is given by Michael Nielsen in his html book here. '--nn 3'
represents a neural network with one hidden layer having 3 neurons. '--inpass'
is for adding a further direct connection between the input and output layer. Arguments '-q sd -q ad -q do -q fd'
are for creating interaction variables. An interaction variable is created from two variables ‘A’ and ‘B’ by multiplying the values of ‘A’ and ‘B’. It is a general technique and you may read more about it in Wikipedia. Argument '-q sd'
means that all possible interaction variables are created from variables in the namespace ‘s’ (i.e. site) and namespace ‘d’ (i.e. device). Similar explanation holds for three other interactions ‘-q ad -q do -q fd’.
In our model building we have not used regularization functions. Regularization helps to avoid over fitting. For example, a non-linear model may become so non-linear as to connect all points of a class (including noise). This curve may have so many twists and turns that when model is run for finding out the class of an unclassified point, result may be confusing. Model building, therefore, attempts to penalize excessive twists and turns and hence ‘regularization’. For further exploration, it is worthwhile trying L1 (--l1
) and L2 (--l2
) regularizations; say, to start with: --l1 0.0005
and --l2 0.00004
. About regularizations in neural network you may like to read this work here.
Average loss of 0.164397 is an indicator of model fit accuracy (line 50 above). A direct and relative comparison of performances of various models can be made with this measure rather than loading the predictions to Kaggle.
The above algorithm takes around 2 hours to run on an 8GB machine and occupies at the maximum 1.5GB of RAM. Thus it is both fast and resource wise economical.
It is now time to predict clicks in test.csv. This file first needs to be converted to vw format. We add a click field to it but with a uniform click value of 1 in all records. This field is ignored while making predictions. The awk conversion code for this is as below. First field of ‘test.csv’ is ‘id’ field and is ignored.
#! /bin/awk -f # # Call it as: ./test.awk test.csv > test.vw BEGIN { FS = "," ; ORS = ""} {if (NR > 1 ) { $2 = "1" ; print($2) print(" |fe") print(" c1_") ; print $3 print(" banner_pos_") ; print $4 print(" |site") print(" site_id_") ; print $5 print(" site_domain_") ; print $6 print(" site_category_") ; print $7 print(" |app") print(" app_id_"); print $8 print(" app_domain_") ; print $9 print(" app_category_") ; print($10) print(" |device") print(" device_id_") ; print($11) print(" device_ip_"); print($12) print(" device_model_") ; print($13) print(" device_type_") ; print($14) print(" device_conn_type_") ; print($15) print(" |others") print(" c14_") ; print($16) print(" c15_") ; print($17) print(" c16_") ; print($18) print(" c17_") ; print($19) print(" c18_") ; print($20) print(" c19_") ; print($21) print(" c20_") ; print($22) print(" c21_") ; print($23) print("\n") } }
We, next, use the model prepared earlier to make predictions for test.vw. The vw command is:
$vw -d test.vw -t -i neural_model --link=logistic -p probabilities.txt
Argument '-t'
is to indicate that we are feeding test file and class field is to be ignored. '-i'
specifies the model file. We use '--link=logistic'
to get probabilities. Use '--link=glf1'
to get output between [-1,1]. The output file is ‘probabilities.txt’.
Kaggle requires that we submit results in the format ‘id,probability’ (without headers). A sample submission file is on the site which has all the IDs (same as in test.csv). We read this file in R and overwrite its second column with our predicted probabilities. R code for this is as below:
p<-read.table("probabilities.txt",header=FALSE) samplesubmission<-read.csv("sampleSubmission.csv",header=TRUE,colClasses=c('character','numeric') ) samplesubmission[,2]<-p[1] head(samplesubmission) write.csv(samplesubmission,"neural_result.csv",quote=FALSE,row.names=FALSE)
File, neural_result.csv, (either as it is or zipped) can be submitted to Kaggle. Even though the competition is over, one does get a score. For this model the score was 0.3976030.
Let us now once again review our model options. We have selected neural network option with just three neurons. This choice (it is believed) gives optimum results. But, it appears, we could have done without using neural network with just the following learning model:
$vw -d train.vw --cache_file neural --passes 5 -q sd -q ad -q do -q fd --binary -f logistic_model --loss_function=logistic
With the above we get Kaggle score of 0.3994788. Thus, '--nn 3'
option did make an improvement but not dramatic. We did not test '--nn'
option by increasing the number of neurons. Increasing the number of '--passes'
from 5 to 10 did not affect results; actual '--passes'
used were 8. I may mention that VW’s default learning algorithm is online gradient descent.
We have initially used ‘Importance‘ of 2 while converting ‘train.csv’ file to ‘train.vw’ for clicks (of ‘1’). See line 22 in train.vw file above. This we did so that the learner would treat them as important events and not merely as noise (as such events were few).
We raised the ‘Importance’ to 3; the score degraded. We then changed the Importance to just 1 i.e. treated click events on par with non-click events. The Kaggle score was still 0.3976030. This meant that 20% clicks were sufficient to make good enough predictions for VW. This finishes our experiments with Vowpal Wabbit on CTR predictions.
Tags: Advertisement click prediction, CTR prediction, Vowpal Wabbit, vowpal wabbit neural network
March 4, 2015 at 11:41 pm |
Your awk script is outputting an importance of 1 for click training lines, while your python script is outputting an importance of 2 for click training lines.
I read in your article that you settled on importance of 2 for clicks, for best results. It may be a good idea to adjust your awk script to reflect that.
Great work, BTW!
March 6, 2015 at 2:20 am |
Thanks. I have corrected.
September 24, 2015 at 2:17 am |
Great post. Just curious, what resources did you use to familiarize yourself with VW’s nnet training algorithm? For example, I had no idea that there was an –inpass argument. Tips and tricks seem kinda scattered all over the internet.
September 24, 2015 at 5:56 am |
Thanks. I do not remember exactly what I referred to.