Archive

Archive for July, 2010

Bootstrap, strap-on, anal-yzing… statistics is getting weirder by the moment

I have spent the better portion of the day trying to get a bootstrap working. I have adapter a pre-written bootstrap function, but I wanted to use a generic function, mostly for reaping fame and glory. My hypothesis was that writing a hand-written, unoptimized function will consume more computer time than the generic, presumably optimized, boot() function. Was I wrong!

In this example, I random sample (with replacement) a vector of n values and store them in a matrix. These values that are written to the matrix are percent of values of our initial vector values that are smaller or equal to the bin number (see the code). Once we’ve got all the values, we can calculate the median by columns for further tinkering. There’s a lot of material covering bootstrapping on the Internet, but for a quick reference, see this Wikipedia entry. At this StackOverflow thread, Aniko provided me with valuable information on what I was doing wrong.

This is the code I used – first part is the hand written function, and the second part is my attempt at cracking the matter with a generic function.

require(Hmisc)

#generate fake data
data <- rchisq(500, df = 12)

#create bins
binx <- cut(data, breaks = 10)
binx <- levels(binx)
binx <- sub("^.*\\,", "", binx)
binx <- as.numeric(substr(binx, 1, nchar(binx) - 1))

#pre-allocate a matrix to be filled with samples
num.boots <- 10
output <- matrix(NA, nrow = num.boots, ncol = length(binx))

#do random sampling from the vector and calculate percent
#of values equal or smaller to the bin number (i)

for (i in 1:num.boots) {
 data.sample <- sample(data, size = length(data), replace = TRUE)
 data.cut <- cut(x = data.sample, breaks = 10)
 data.cut <- table(data.cut)/sum(table(data.cut))
 output[i, ] <- data.cut
}

#do some plotting
plot(1:10, seq(0, max(output), length.out = nrow(output)), type = "n", xlab = "", ylab = "")

for (i in 1:nrow(output)) {
 lines(1:10, output[i, 1:nrow(output)], lwd = "0.5")
}
output.median <- apply(output, 2, median)
output.sd <- apply(output, 2, sd)
sd.up <- output.median + output.sd * 1.96
sd.down <- output.median - output.sd * 1.96

lines(output.median, col="red", lwd = 3)
lines(sd.up, lty="dashed", col="green")
lines(sd.down, lty="dashed", col="green")
legend(x = 8, y = 0.25, legend = c("median", "0.95% CI"), col = c("red", "green"), lty = c("solid", "dashed"), lwd = c(3, 1))

All this produces this graph.

Click to enlarge (the picture).

We can then try the power of the generic boot::boot() function that comes with the R core packages. The fun part of doing a bootstraping with the boot() function is to figure out how to write the statistic function correctly. If you look at the boot() help page (?boot) you will notice that you need to provide at least two parameters to the statistic argument: data and an index parameter. Help page says that the index parameter is a vector, which is a bit confusing from where I sit. In other words, this is actually a “an empty object” (let’s call the object i for now) that tells the boot() how to crawl over your data. If your data is in a form of a vector, you will place the index as you would use to subset a vector – object[i]. If it’s a data.frame and you want to re-sample rows, you would call object[i, ]… Let’s see a working example, things may be clearer there.

pairboot <- function(data, i, binx) {
 data.cut <- cut(x = data[i], breaks = 10)
 data.cut <- table(data.cut)/sum(table(data.cut))
 return(data.cut)
}

Notice the data[i] in the function. This will tell the boot() function to extract i elements of the data. If we had a data.frame, and rows were what we wanted to sample randomly, we would have written data[i, ].
And now we call the boot function to perform 10 “boots” on our data.

library(boot)
booty.call <- boot(data = data, statistic = pairboot, R = 10, binx = binx)

And here is the code to plot the output of the boot object (actually, it’s the print.boot object!). If you don’t believe me, try it on your own and see if you get a similar picture as above.

plot(1:10, seq(0, 1.5 * max(booty.call$t0), length.out = nrow(booty.call$t)), type = "n", xlab = "", ylab = "")
apply(booty.call$t, 1, function(x) lines(x, lwd = 0.5))
booty.call.median <- apply(booty.call$t, 2, median)
booty.call.sd <- apply(booty.call$t, 2, sd)
booty.call.sd.up <- booty.call.median + booty.call.sd * 1.96
booty.call.sd.down <- booty.call.median - booty.call.sd * 1.96
lines(booty.call.median, col="red", lwd = 3)
lines(booty.call.sd.up, lty="dashed", col="green")
lines(booty.call.sd.down, lty="dashed", col="green")
legend(x = 8, y = 0.25, legend = c("median", "0.95% CI"), col = c("red", "green"), lty = c("solid", "dashed"), lwd = c(3, 1))

The plot thickens!

Have I, by doing a bootstrap with a generic function, profited time-wise at all? Here are the elapse times for 10000 iterations for our hand written bootstrap function and the generic, respectively.

# user  system elapsed
#20.045   0.664  22.821
# user  system elapsed
#20.382   0.612  22.097

So you can see, I profit by 0.724 seconds. Was it worth it? 🙂 I can, for the moment, only wonder if I could improve the generic function to beat the hand-written one. Does anyone skilled in optimizing chip in any tips?

And here’s how a graph with 10000 iterations looks like. The green dashed lines represent a 95% confidence interval, which means that 95% of iterations fall between those two lines.

Bootstrap with 10000 iterations. CI is confidence interval.

Advertisements
Categories: Uncategorized

How to calculate with dates and hours in R

A while ago I was asked whether calculating with datums and hours is possible in R. Especially, if you added an hours to 23:45 (say Jan 1, 2010) , would R know to jump to the next day – 00:45 (jan 2, 2010)? The answer is yes. Our only job is to tell R in what format our datum and time is. After we have done so, we can use these datums and times for different calculations.

Let’s say we have a vector (or any other object type) with datum and hour:

1.1.2010 11:35

We need to tell R that this is in fact a POSIX type object, whereas we need to specify which number corresponds to what. We will tell it that the first number is day, followed by a dot (.), followed by a month, followed by a dot… You get the picture. This is done through the format argument.

dtm <- strptime(c("1.1.2010 11:35"), format = "%d.%m.%Y %H:%M", tz = "CET")

While the inclusion of tz argument (time zone) is not applicable here, be careful if you’re calculating with datums and times from different time zones. For the following examples, I’ve added another element to the vector to spice things up.

> dtm
[1] "2010-01-01 11:35:00 CET" "2010-02-01 08:45:00 CET"
> str(dtm)
 POSIXlt[1:9], format: "2010-01-01 11:35:00" "2010-02-01 08:45:00"

For pedagogic purposes, I will make things a wee more complicated (at least for novice). Let’s write two simple (and I mean SIMPLE) yet not mandatory functions that will convert hours and minutes to seconds. Because calculating with POSIX objects is done via seconds (see below), by using these functions we free ourselves from typing 3 * 3600 every time we want to use 3 hours or 3*60 for 3 minutes. At least in my view, this makes the code (where we calculate) a bit more readable.

Here are the two functions:

hrs <- function(u) {
 x <- u * 3600
 return(x)
}

mns <- function(m) {
 x <- m * 60
 return(x)
}

This will add three hours to the current date/time values in dtm

> dtm + hrs(3)
[1] "2010-01-01 14:35:00 CET" "2010-02-01 11:45:00 CET"

This will add 3 hours and 10 minutes.

> dtm + hrs(3) + mns(10)
[1] "2010-01-01 14:45:00 CET" "2010-02-01 11:55:00 CET"

If we hadn’t used our functions, our input would look like this

> dtm + 3*3600 + 10*60
 [1] "2010-01-01 14:45:00 CET" "2010-02-01 11:55:00 CET"

See, I told you hrs() and mns() is more intuitive. 😛

We can also subtract. Let’s cut off 111 seconds

> dtm - 111
[1] "2010-01-01 11:33:09 CET" "2010-02-01 08:43:09 CET"

We are interested if datum will change if we add enough hours (or minutes). Let’s add 15 hours.

> dtm + hrs(15)
[1] "2010-01-02 02:35:00 CET" "2010-02-01 23:45:00 CET"

No surprises there – first element changed to the next day and the seconds element fell short for 15 minutes.

If we’re curious how far apart are two datums, we use the following function:

> difftime(dtm2[1], dtm2[2])
Time difference of -30.88194 days

All this and more can be also found in The R Book by Michael J. Crawley. This particular chapter can be accessed through the web. Data manipulations with R by Phil Specter can in part be previewed here (kudos to aL3xa for pointing it out). Help pages on datums and time classes can be found in

?DateTimeClasses
Categories: Uncategorized Tags: , , , , ,

Building an R package (under Windows) without C, C++ or FORTRAN code

Why build and R package? It basically boils down to be able to brag at your local pub that a new version of YOUR package is on CRAN as of 7 p.m. CET. But seriously, if you’ve produced some function that other people might benefit (or have ordered them) from using them, like your boss, co-workers or students, consider building a package. The chances of broken dependencies and ease of installing everything outweighs the effort of learning how to build one. If you feel your functions (that may be new in some respect) could benefit an even wider audience, consider submitting it to CRAN (I will not discuss how to do that here, but do read the Ripley reference I mention later).

I have set out to build a test package to prepare myself when the time comes and will really need to build one of my own. This here is an attempt I made to document steps I took when building a dummy package (called texasranger (yes, THE Texas Ranger!)) with one single function. I have attempted to build documentation and all other ladeeda things that are mandatory for the package to check out O.K. when building it.

Before you dig into the actual preparation and building itself, you will need a bunch of tools. These come in a bag with a linux distribution, but you will have to add them yourself if you’re on Windows. This is basically the only thing that is different when trying to build a package on Windows/Linux. I will not go into details regarding these tools (perl, MS html help compiler, if you have C/C++/FORTRAN code you will need GNU compiler set) , a TeX distro), – I will, however, advise you to check out Making R package under Windows (P. Rossi). There, you will find a detailed description (see page 2-6) of how to proceed to get all the correct tools and how to set them up.  When you have done so, you are invited to come back here. Feel free to follow just mentioned tutorial, as it goes a bit more in-depth with explaining various aspects. The author warns that MiKTeX will not work (see the datum of the document), but things might have changed since then and it now works, at least for me.

I have followed the aforementioned Making R package under Windows (by P. Rossi), slides Making an R package made by R.M. Ripley and of course the now famous Writing R Extensions (WRE) by R dev core team (you are referred to this document everywhere). I would advise everyone to read them in this listed order – or at least read WRE last. First two can be read from cover to cover in a few minutes – the last one is a good reference document for those pesky “details”. In my experience, I started to appreciate WRE only after I have read the first two documents.

Enough chit-chat, let’s get cracking!

1. These are the paths I entered (see document by Rossi what this is all about) to enable all the tools so that I can access them from the Command prompt (Command prompt can be found under Accessories, another term for it may be Terminal or Console on different OSs):

c:/rtools/bin;c:/program files/miktex 2.8/miktex/bin;c:/program  files/ghostgum/gsview;C:/strawberry/perl/bin;c:/program  files/r/r-2.11.0/bin;c:/program files/help  workshop%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%\System32\WindowsPowerShell\v1.0\;C:\strawberry\c\bin;C:\strawberry\perl\site\bin

2. Use R function

package.skeleton()

to create directory structure and some files (DESCRIPTION and NAMESPACE). I used the following arguments:

package.skeleton(name = "texasranger", list =  c("bancaMancaHuman"), namespace = TRUE) #I only have one function, but you can list them more

See argument code_files for an alternative way of telling the function where to read your functions. I suspect this may be very handy if you have each function in a separate file.

3. Fill out DESCRIPTION and NAMESPACE (if you decide to have a name space, read more @ WRE document). Pay special attention to export, import, useDynLib… All of the above mentioned documentation will help guide you through the process with minimal effort.
A side (but important) note. You should write your functions without them calling

require()

or

source()

to dig up other function and packages. Read more about NAMESPACE and how to specify which functions and packages to “export” (or “import”) and how.

4. Create  documentation files. This is said to be the trickiest part. I still don’t have much experience with this so I can’t judge how tricky it can be – but I can tell you that it may be time consuming. Make sure you take time to document your functions well. If you were smart, you wrote all this down while you were writing (or preparing to write) a function and this should be more or less a session of copy-paste. Use

prompt(function, file = "filename.Rd")

to create template Rd files ready for editing. They are more or less self explanatory (with plenty of instructions). It help if you know LaTeX, but not necessary. Also, I suspect the function may dump the files into the correct /man directory automatically – if it doesn’t, do give it a hand and move the files there yourself. Perhaps worth mentioning is that if you want to reference to functions outside your package, use(notice the options square brackets [])

\code{\link[package:function]{function}}

, e.g.

\code{\link[raster:polygonsToRaster]{polygonsToRaster}}

or

\code{\link[utils:remove.packages]{remove_packages}}

– To refer to “internal” package function (those visible by the user), use

\code{\link{function_name}}

4a. If you have datasets you wish to include in your package (assuming those in library(help=”datasets”) are not sufficient), you will need to do two things. First, prepare your object (list, data.frame, matrix…). Save it and prepare documentation. Saved .rda file goes to data/ directory. The documentation file goes into the same directory (man/) as other .Rd files. If your dataset is not bigger than 1 MB you shouldn’t worry, otherwise consult the Manual on how to prepare a

save(my.dataset, file = "my.dataset.rda") # move to data/ folder
promptData(my.dataset, filename ="my.dataset.rda.Rd") # move to man/ folder¸and edit

4b. You should also build a vignette, where you can explain at greater length what your package is about and maybe give a detailed (or more detailed) workflow with the accompanying functions. You can use Sweave or knitr, and the folder to place your .Rnw file is vignettes/.

5. To check the documentation for errors, use

R CMD Rd2txt filename.Rd

and/or

R CMD Rdconv -t=html -o=filename.html filename.Rd

6. Next, you should run a check on your package before you build it. You should run it from the directory where the package directory is located. I’ve dumped my package contents to d:/workspace/texasranger/ and executed the commands from d:/workspace/

R CMD check

If you get any errors, you will be referred to the output file. READ and UNDERSTAND it.

7. Build the package with the command

R CMD build package_name

This will create a file and will add a version (as specified in the DESCRIPTION file, i.e. package_name_1.0-1.tar.gz, see WRE for specifics on package version directives).

package_name is actually the name of the directory (which should be the name of your package as well).

If you use Windows, you can build a .zip file AND install the package (uses install.packages) at the same time. Use command

R CMD INSTALL --build package_name_1.0-1.tar.gz

8. Rejoyce.

Join the Statistical analysis Q&A beta

Statistical Analysis Q&A is a web page dedicated to exchange questions, answers and ideas primarily on statistics. See http://stackoverflow.com/ to get the general idea what I’m talking about. You still have a chance to sign up for closed beta testing. If you miss it, no worries mate, after about a week, it will be opened for the whole world to see.

See also Tal’s writing on the wall of R-statistics blog.

Just recently, a MetaOptimize page has been launched to serve the statistical needs of people everywhere. The page has been building up steam but I’m still not sure which way it’s gonna go. I hope MO and SA will compliment each other and will not get caught in the stats feud.

Categories: R buzz