This is the R script I wrote to extract 10 years of Sandettie buoy data from the publicly-accessible web pages of a well-known commercial weather service. Assuming you have R installed on your machine, you can run the script yourself. But first you will need to:

  1. Install the XML package.
  2. Construct a vector of dates and assign it to an object named dates.
  3. Assign URLbase and URLtail appropriately.

That should be specific enough for someone who knows what they’re doing to make it work. No, I’m not going to give you the name of the website, but it shouldn’t be difficult to deduce.

scraper <- function(dates) {
    assign("alldat", data.frame(), envir=.GlobalEnv)
    for(i in 1:length(dates)) {
        url <- paste(URLbase, dates[i], URLtail, sep="")             
	# Read in raw HTML, parse, & extract table
        d1 <- readLines(url)                                        
        d2 <- htmlTreeParse(d1, useInternalNodes=T)  
        d3 <- xpathSApply(d2, "//div[@class='column']//tbody", xmlValue)
	if(length(d3) == 0) next  # if missing, skip                                            
	# Split data on delimiter pattern, distribute cells into matrix
        d4 <- unlist(strsplit(d3, "\n\t\t"))                                
        d5 <- d4[ -c(1:12) ]                                                
        d6 <- as.data.frame(matrix(d5, ncol=14, byrow=T))                   
        d7 <- d6[, -c(4,6)]  # remove empty columns                                          
        d7$date <- dates[i]  # Append date to all rows                                                
	names(d7)<-c("time", "atmp", "wtmp", "hum", "wdir", "wkts", "wmph",
		     "wper", "wvht", "wrng", "pres", "tend", "date")
	# Append, cat, & pause
	alldat <- rbind(alldat, d7) 
        cat(dates[i], fill=T)                                               
        Sys.sleep(runif(1, 1, 7))                                           
    assign("alldat", alldat, envir=.GlobalEnv)

