Web History Visualization in R

When poking around my file system, I came across the history folder of my Google Chrome web browser. I wondered if the graphs could capture my recent odd work schedule. Technial notes follow, but here are the figures. Click for bigger PDF.

Apparently Fridays are low productivity days, mainly because a class takes up my afternoon. Apparently I tend to work from noon until midnight or a little after:

This is more evident in a stacked bar chart of all days:

More detailed info can be pulled out, like how (relatively infrequently, given my generation) I check Facebook:

The R code to make them is below. It should run on any Mac, and with a few minor modifications on another OS.

Technical Notes

First make a copy of the history file, which is locked if Chrome is open:

orig_hist_file = paste("~/Library/Application Support",
                       "/Google/Chrome/Default/History",
                       sep="")
copy_hist_file = paste("~/Library/Application Support",
                       "/Google/Chrome/Default/temp_hist.sql",
                       sep="")
file.copy(orig_hist_file, 
          copy_hist_file,
          overwrite=TRUE)

The file is a SQLite database, which we can connect to with R using the RSQLite library. I pass a query to grab the URL, the webpage title, the visit count and the time of the visit. The time of the visit is in a funky format, hence the call to the SQL datetime() function. It is encoded as the time passed since January 1st, 1601, so this converts it to something a bit more meaningful.

library (RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, copy_hist_file)

query = paste("SELECT u.url, u.title, u.visit_count, ",
        "datetime(v.visit_time/1000000-11644473600, ",
                 "'unixepoch','localtime') ",
        "AS visit_time ",
        "FROM urls AS u, visits AS v ",
        "WHERE u.id = v.url;", sep="")

df = dbGetQuery(con, query)

I can grab the date and time from the full date-time using R's as.POSIXct() function. I also tally up the number of visits per day and make a column that contains the day of the week of the visit.

# Grab date and time
df$day = trunc(as.POSIXct(df$visit_time), "day")
df$time = difftime(as.POSIXct(df$visit_time), df$day, units="hours")

# Total visits per day
urls_per_day = aggregate(as.numeric(df$hour), 
                 by=list(as.character(df$day)), sum)
wkdays = format(as.POSIXct(urls_per_day$Group.1), "%u")
wkdays.s = weekdays(as.POSIXct(urls_per_day$Group.1))
wkdays.s = factor(wkdays.s, levels = c("Sunday", 
       "Monday", "Tuesday", "Wednesday", 
       "Thursday", "Friday", "Saturday"))

Here's the first plot of URLs visited per day, which uses ggplot2's geom_bar():

ggplot(urls_per_day, aes(Group.1, x, fill=wkdays.s)) + 
scale_fill_continuous(low="#333333", 
                      high="lightblue", 
                      limits=c(1,7)) +
opts(legend.position = "none") + 
geom_bar(stat="identity") + 
geom_text(data=urls_per_day, 
          aes(x=Group.1, y=x, label=wkdays.s, size=12), 
          vjust=-1) +
ylab("Total URLs Visited") +
xlab("Date") +
opts(axis.text.x  = theme_text(angle=90, hjust=1.2, size=10)) + 
opts(title = expression("Total URLs Visited per Day"))

The next plot is made up of histograms of the visits' times, arranged vertically using ggplot2's facet_grid function. It requires a modified data.frame, with the time converted to numeric type and the day converted to character type.

df.2 = df
df.2$time = as.numeric(df.2$time)
df.2$day = as.character(df.2$day)

# Plot daily histograms, stacked vertically
ggplot(df.2, aes(time, ..density.., fill = ..count..)) + 
geom_histogram(binwidth = 0.25) +
facet_grid(day ~ .) +
opts(strip.text.y = theme_text(angle=0, size=8)) + 
scale_x_continuous(breaks=c(0,6,12,18,24), 
     labels=c('12AM','6AM','12PM','6PM', '12AM')) +
opts(axis.text.y = theme_blank(), 
     axis.ticks = theme_blank()) +
opts(axis.title.x = theme_blank()) +
ylab("Total URLs Visited") +
opts(title = expression("Visits in 15 Min. Intervals"))

Finally, the stacked histogram of all days.

# Stacked Histogram
ggplot(df.2, aes(time, ..count.., fill = day)) + 
geom_bar(binwidth = 0.25, color="black") +
scale_fill_continuous(low="#333333", 
           high="lightblue", 
           limits=c(1,length(unique(as.POSIXct(df$day))))) +
scale_x_continuous(breaks=c(0,6,12,18,24), 
           labels=c('12AM','6AM','12PM','6PM', '12AM')) +
opts(axis.title.x = theme_blank()) +
ylab("Total URLs Visited") +
opts(legend.position = "none")