As a recent convert from Stata to R, one of my main problem is that the quality of the documentation in R is nowhere near what I was used to from Stata. That makes sense. Stata is a relatively expensive piece of software, and they sweat the details to ensure that the user is happy. R is more hit and miss.

A problem I ran into this week was trying to do a double-censored normal regression in R. This is where the outcome of interest can take one of three forms: left-censored, right-censored, or no censoring (where you observe the actual value). If you are an economist, think a Tobit model with both upper and lower censoring points. To make the problem even more fun, the censoring points varied across observations. Both cnreg and intreg in Stata do this, but I was having a hard time figuring out how to do this is in R.

As it turns out, the answer is interval regression from the survival package, which can also fit Tobit models. It took me a while to get past the name, and all the examples were of actual intervals, while my data are all points. To make it worse, coding it was not very intuitive and the documentation of limited help. Hence, this post is mainly to remind my future self how to do it.

To make it easy to see what is going on, I am going to use simulated data. The outcome is censored from below at zero and from above at 50 for the first 500 observations and 40 for the last 500 observations. You do not need the two censoring variables (censoring_low and censoring_high), but I find it easier to read.

library(tidyverse)
library(survival)
set.seed(2)

# Get 1,000 observation
b0 <- 17
b1 <- 0.5
id <- 1:1000
x1 <- runif(1000, min = -100, 100)
sigma <- 4
eps <- rnorm(1000, mean = 0, sigma)
df <- bind_cols(data_frame(id), data_frame(x1), data_frame(eps))

# Set up the data
df <- df %>%
  mutate(
    y = b0 + b1 * x1 + eps
  ) %>%
  mutate(
    # Convert all negative to zero
    y_cen = if_else(y < 0, 0, y)
  ) %>%
  mutate(
    # Convert first 500 obs > 50 to 50
    censoring_high = id < 500 & y > 50,
    y_cen = replace(y_cen, censoring_high, 50),
    # Convert last 500 obs > 40 to 40
    censoring_low = id >= 500 & y > 40,
    y_cen = replace(y_cen, censoring_low, 40)
  )

The trick is that you need to create a survival object with two variables indicating the left and right side of the "interval." Ironically, it was the documentation for Stata's intreg that helped me see how to do this. For those observations that are points and uncensored, you have the same value in both variables. For left-censored observations, you need minus infinity in the left variable and the censoring point in the right variable. For right-censored observations, you need the censoring point in the left variable and infinity in the right. In principle, you should be able to use NA instead of infinity, but I could not get it to work.

# Define left and rigth variables
df <- df %>%
  mutate(
    left = case_when(
      y_cen <= 0 ~ -Inf,
      y_cen > 0 ~ y_cen
    ),
    right = case_when(
      !censoring_low & !censoring_high ~ y_cen,
      censoring_low | censoring_high ~ Inf
    )
  )

Once you have the data set up, you use the survival object as the outcome and "interval2" as type together with a Gaussian distribution. Confusingly, you do not need a censoring indicator at all, despite the documentation's claim that this is unusual and is equivalent to no censoring. The model generates the censoring status automatically.

# OLS model
ols <- lm(y_cen ~ x1, data = df)
summary(ols)

# Survival model
res <- survreg(Surv(left, right, type = "interval2") ~ x1,
  data = df, dist = "gaussian"
)

# Show results
summary(res)

The summary does not provide much information on the amount of censoring, and I need that for my result table, so here is one way of getting that information. It is a little rough, but it works.

# Show descriptive stats on censoring (easier with tidyverse)
# Updated on 2018-11-28 to address updated packages
# Thank you to Kirk Wythers for alerting me to this
y_out <- as.tibble(res$y[ , 1:3])

# 'time1' in include only y values for observations used
obs_used <- length(y_out$time1)

# 2: left censored, 1: point, 0: right censored
censoring <- y_out %>%
  group_by(status) %>%
  summarise(
    count = n()
  )

left <- censoring %>% 
  filter(status == 2) %>% 
  select(count)

right <- censoring %>% 
  filter(status == 0) %>% 
  select(count)

# Example for including in LaTeX file
cat(
  "Of the", obs_used, "observations,", as.numeric(left[1,1]), "are left censored and", as.numeric(right[1,1]), "are right censored."
)

Double-censored regression in R

Stata screencast on model validation using graphs

Third Stata screencast - this one on do-files

Second Stata screencast

First screencast on Stata