Analyzing Apache Weblogs In R
Articles —> Analyzing Apache Weblogs In R
Although many web hosts and utilities such as google analytics provide statistics for website usage, it can sometimes be enlightening to drill deeper into the traffic of a website. In this article, I will use the free statistical programming software R to dig deeper into the apache web logs of a website.
To look in depth at web logs requires three steps. First is to gather the appropriate log files, typically found through a web host cPanel or similar interface. Next involves parsing the logs, which can be done inside our outside of R. Lastly, one can analyze and view the data within R using its many readily available functions.
The first step is to collect the logs for your website. Hosts usually provide raw logs on a per unit basis (eg day, month, etc...). Often one must concatenate several log files - for instance if one wishes to analysis the logs on a monthly basis, yet only have daily logs, a simple concatenation of the files can aggregate several files. On mac/linux, navigate on the command line to the appropriate directory and concatenate:
cat *.log > logs.txt
At this point, one may wish to use the text processing capabilities of other linux command line utilities (awk, perl, etc...) to trim down the data a bit. On windows, the type command can be used in a similar fashion.
The typical apache log file is tabular - each request is a single row, and each column contains data about that request. The files in this article have their columns delimited/separated by a space, with each column representing the following values:
IP RFC 1413 Id User Id Time Request Page Status Code Size Referring Page User Agent
Once a log file is available it can be loaded into R:
logs = read.csv(Pathtofile, sep=" ", header=F)
head(logs) #View the first few lines of the data
Here we have read in the file as a data frame - using the head function to show the first few lines of the data. Depending upon the format of the log file, the resulting data in R may require some filtering, pasting, or other types of manipulations to make it more usable mathematically. In my case, I will discard of un-necessary columns.
logs = logs[,-c(2,3)]
The date column has unique formatting, and I want to use this to tabulate visits. To do so, I will pull out the day using a regular expression and the str_match function in the stringr package (if you do not have this package, it can be loaded using the install.packages() function), then convert this to day, and finally append this column to the data frame.
library(stringr) logs = cbind(logs, as.Date( str_match(logs[,2], "[0-9]{1,2}\\/[A-Za-z]{3}\\/[0-9]{4}"), format="%d/%b/%Y"))
This leaves us with a column representing the day of the visit. We could also do this for the hour, the day of the week, etc...depending upon how one wishes to slice and dice the data
logs = cbind(logs, str_match(logs[,2], "[0-9]{4}:([0-9]{2}):")) logs = logs[,-(ncol(logs)-1)] #remove the match 0 of the regular expression from above logs = cbind(logs, weekdays(logs[,(ncol(logs) - 1)])) logs = logs[,-2]#remove the original date column colnames(logs) = c("IP", "Webpage", "Status", "Size", "Referrer", "User Agent", "Date", "Hour", "Weekday") #Set the column names
Note that a lot of the manipulations above can readily be done outside of R, for instance using text processing command line tools such as awk, perl, set, etc... Now we're ready for the fun. Lets see how visits we get per day:
vpd = tapply(logs[,1], logs[,7], length)#tabulate the the visits per day marginDefaults = par()$mar; par(mar=c(9,4,4,4)) barplot(vpd, main="Visits Per Day", col="lightgreen", las=2, border="gray") par(mar = marginDefaults);
And the result
What about those peaks and valleys? There seems to be a slight trend of 1-2 slow followed by 5 or more higher...perhaps a weekend trend? Let's look at a few of those lower hit days:
> weekdays(as.Date(c("07/Apr/2013", "13/Apr/2013", "20/Apr/2013"), format="%d/%b/%Y")) [2] "Sunday" "Saturday" "Saturday" #More succinctly: head(weekdays(as.Date(names(vpd[order(vpd)])))) "Sunday" "Saturday" "Saturday" "Sunday" "Saturday" "Monday"
Looks to be a large portion of weekend days in the lower end - not too suprising, but as an aside how about some statistics? Is there a difference between hits on a per day of the week basis?
##Create a data frame containing necessary information about days of the week and counts on those days tally = tapply(logs[,1], logs[,7], length) tally = data.frame(matrix(c(tally, names(tally)), ncol=2)) tally = cbind(tally, weekdays(as.Date(tally[,2]))) colnames(tally) = c("Count", "Date", "Day") tally[,1] = as.numeric(as.character(tally[,1]))#force this column to be numeric fit = aov(Count ~ Day, tally);#Use analysis of variance to test of there is a difference between classes summary(fit) Df Sum Sq Mean Sq F value Pr(>F) Day 6 6798147 1133024 5.785 0.00013 *** Residuals 49 9597254 195862 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The above demonstrates a more rigorous statistical approach - in this case there does appear to be a difference in visits throughout the week (p < 0.00013). You could also view this as a pie chart:
pie(tapply(logs[,"IP"], logs[,"Weekday"], length), main="Visits")
Using similar techniques as shown above, one can do a whole lot more: tally the visits per hour, the most popular pages, the most popular referring pages, bandwidth usages, look at page 'stickiness', typical user navigations, etc...
###Visits on an hourly basis tallied across each day vph = tapply(logs[,"IP"], logs[,"Hour"], length) barplot(vph, main="Visits Per Hour", col="lightgreen", border="gray" ##Top visited pages logs = cbind(logs, str_match(logs[,"Webpage"], "\\.com\\/([^\\.]+\\.(php|html))"));#Extract the actual webpage pop = tapply(logs[,1], logs[,9], length)#assuming 9 is the newly bound column above, tally by the webpage head(pop[rev(order(pop))])
Given the comprehensive toolbox of R, one can do quite a lot to navigate and inspect website traffic beyond what is provided by the typical free analytics available.
Of course, websites with much higher traffic than the site analyzed above make the job much harder, and may require a distributed system to handle the data load and analytics.
There are no comments on this article.