Double-censored regression in R

As a recent convert from Stata to R, one of my main problem is that the quality of the documentation in R is nowhere near what I was used to from Stata. That makes sense. Stata is a relatively expensive piece of software, and they sweat the details to ensure that the user is happy. R is more hit and miss.

A problem I ran into this week was trying to do a double-censored normal regression in R. This is where the outcome of interest can take one of three forms: left-censored, right-censored, or no censoring (where you observe the actual value). If you are an economist, think a Tobit model with both upper and lower censoring points. To make the problem even more fun, the censoring points varied across observations. Both cnreg and intreg in Stata do this, but I was having a hard time figuring out how to do this is in R.

As it turns out, the answer is interval regression from the survival package, which can also fit Tobit models. It took me a while to get past the name, and all the examples were of actual intervals, while my data are all points. To make it worse, coding it was not very intuitive and the documentation of limited help. Hence, this post is mainly to remind my future self how to do it.

To make it easy to see what is going on, I am going to use simulated data. The outcome is censored from below at zero and from above at 50 for the first 500 observations and 40 for the last 500 observations. You do not need the two censoring variables (censoring_low and censoring_high), but I find it easier to read.

library(tidyverse)
library(survival)
set.seed(2)

# Get 1,000 observation
b0 <- 17
b1 <- 0.5
id <- 1:1000
x1 <- runif(1000, min = -100, 100)
sigma <- 4
eps <- rnorm(1000, mean = 0, sigma)
df <- bind_cols(data_frame(id), data_frame(x1), data_frame(eps))

# Set up the data
df <- df %>%
  mutate(
    y = b0 + b1 * x1 + eps
  ) %>%
  mutate(
    # Convert all negative to zero
    y_cen = if_else(y < 0, 0, y)
  ) %>%
  mutate(
    # Convert first 500 obs > 50 to 50
    censoring_high = id < 500 & y > 50,
    y_cen = replace(y_cen, censoring_high, 50),
    # Convert last 500 obs > 40 to 40
    censoring_low = id >= 500 & y > 40,
    y_cen = replace(y_cen, censoring_low, 40)
  )

The trick is that you need to create a survival object with two variables indicating the left and right side of the "interval." Ironically, it was the documentation for Stata's intreg that helped me see how to do this. For those observations that are points and uncensored, you have the same value in both variables. For left-censored observations, you need minus infinity in the left variable and the censoring point in the right variable. For right-censored observations, you need the censoring point in the left variable and infinity in the right. In principle, you should be able to use NA instead of infinity, but I could not get it to work.

# Define left and rigth variables
df <- df %>%
  mutate(
    left = case_when(
      y_cen <= 0 ~ -Inf,
      y_cen > 0 ~ y_cen
    ),
    right = case_when(
      !censoring_low & !censoring_high ~ y_cen,
      censoring_low | censoring_high ~ Inf
    )
  )

Once you have the data set up, you use the survival object as the outcome and "interval2" as type together with a Gaussian distribution. Confusingly, you do not need a censoring indicator at all, despite the documentation's claim that this is unusual and is equivalent to no censoring. The model generates the censoring status automatically.

# OLS model
ols <- lm(y_cen ~ x1, data = df)
summary(ols)

# Survival model
res <- survreg(Surv(left, right, type = "interval2") ~ x1,
  data = df, dist = "gaussian"
)

# Show results
summary(res)

The summary does not provide much information on the amount of censoring, and I need that for my result table, so here is one way of getting that information. It is a little rough, but it works.

# Show descriptive stats on censoring (easier with tidyverse)
y_out <- as.tibble(res$y)

# 'time1' in include only y values for observations used
obs_used <- length(y_out$time1)

# 2: left censored, 1: point, 0: right censored
censoring <- y_out %>%
  group_by(status) %>%
  summarise(
    count = n()
  )

left <- censoring %>% 
  filter(status == 2) %>% 
  select(count)

right <- censoring %>% 
  filter(status == 0) %>% 
  select(count)

# Example for including in LaTeX file
cat(
  "Of the", obs_used, "observations,", as.numeric(left[1,1]), "are left censored and", as.numeric(right[1,1]), "are right censored."
)

Birth Spacing in the Presence of Son Preference and Sex-Selective Abortions: India’s Experience over Four Decades

My latest paper, on how birth spacing changed in India with the introduction of sex selection, is now available. I am presenting a poster on this paper this coming Friday at the Population Association of America's annual meeting in Denver.

Title:

Birth Spacing in the Presence of Son Preference and Sex-Selective Abortions: India’s Experience over Four Decades

Abstract:

Strong son preference is typically associated with shorter birth spacing in the absence of sons, but access to sex selection has the potential to reverse this pattern because each abortion extends spacing by six to twelve months. I introduce a statistical method that simultaneously accounts for how sex selection increases the spacing between births and the likelihood of a son. Using four rounds of India’s National Family and Health Surveys, I show that, except for first births, the spacing between births increased substantially over the last four decades, with the most substantial increases among women most likely to use sex selection. Specifically, well-educated women with no boys now exhibit significantly longer spacing and more male-biased sex ratios than similar women with boys. Women with no education still follow the standard pattern of short spacing when they have girls and little evidence of sex selection, with medium-educated women showing mixed results. Finally, sex ratios are more likely to decline within spells at lower parities, where there is less pressure to ensure a son, and more likely to increase or remain consistently high for higher-order spells, where the pressure to provide a son is high.

Online versions of forthcoming papers

My paper with Shamma Alam on shocks and timing of fertility in Tanzania, which forthcoming in Journal of Development Economics, is now available here. A free version if available until early February through this link.

My paper with Yu-hsuan Su on child health across urban, slums, and rural area, which is forthcoming in Demography, is available here.

Open research assistant position for Fall quarter

I am looking for a research assistant (RA) to work on a project that examine how determinants of urban fertility vary across countries. The RA will be help clean and merge data from many different countries, research education systems in these countries, code variables, and, if time permits, run analyses. The minimum requirements are a working knowledge of RStudio (for example, currently taking or having taken an upper-level undergraduate course like Econ 4770 or similar), great attention to detail, and the ability to commit to 5-10 hours a week of work for the Fall quarter. A knowledge of other programming languages, GitHub, and basic Unix commands would be a plus since all work will be done on a University of Washington server.

The pay will be $15 an hour and hours are flexible. To apply please email me a short statement of interest, a resume, an unoffical transcript, and an example of your R code by close of business 6 October. Shortlisted candidates may receive a short "test" assignment based on the project.

Please contact me if you have any questions about the project or the position.

PS To apply for this position, you have to be a student at Seattle University (graduate or undergraduate).

Paper on child health in India forthcoming in Demography

My paper with Yu-hsuan Su, "Differences in Child Health across Rural, Urban, and Slum Areas: Evidence from India," has been accepted for publication in Demography. The final version is here and the abstract for the paper is below.

The developing world is rapidly urbanizing, but our understanding of how child health differs across urban and rural areas is lacking. We examine the association between area of residence and child health in India, focusing on composition and selection effects. Simple height-for-age averages show that rural Indian children have the poorest health and urban children the best, with slum children in between. Controlling for wealth or observed health environment, the urban height-for-age advantage disappears, and slum children fare significantly worse than their rural counterparts. Hence, differences in composition across areas mask a substantial negative association between living in slums and height-for-age. This association is more negative for girls than boys. Furthermore, a large number of girls are "missing" in slums. We argue that this implies that the negative association between living in slums and health is even stronger than our estimate. The "missing" girls also help explain why slum girls appear to have a substantially lower mortality than rural girls do, whereas slum boys have a higher mortality risk than rural boys do. We estimate that slum conditions–which the survey does not adequately capture, such as overcrowding and open sewers–are associated with 20-37% of slum children's stunting risk.

Open Research Assistant Position

I am looking for a research assistant (RA) to work on a project that examine how determinants of urban fertility vary across countries. The RA will be help clean and merge data from many different countries, research education systems in these countries, code variables, and, if time permits, run analyses. The minimum requirements are a working knowledge of RStudio (for example, having take an upper-level undergraduate course like Econ 4770 or similar), great attention to detail, and the ability to commit to 5-10 hours a week of work for the Spring quarter. A knowledge of other programming languages, GitHub, and basic Unix commands would be a plus since all work will be done on a University of Washington server.

The pay will be $15 an hour and hours are flexible. To apply please email me a short statement of interest, a resume, an unoffical transcript, and an example of your R code by close of business 24 March. Shortlisted candidates may receive a short "test" assignment based on the project.

Please contact me if you have any questions about the project or the position.

PS To apply for this position, you have to be a student at Seattle University (graduate or undergraduate).

New version of paper on sex-selective abortions in India

A new version of my paper on sex-selective abortion, fertility, and birth spacing in India is now available. The major change from the prior version is new theoretical model that better ties the theoretical and empirical parts of the paper together plus many edits throughout the paper. The new version is available here and the new online appendix here.

Of white space, line ending, and GREP in BBEdit

This is completely "inside baseball" and likely of little interest, unless you happen to use BBEdit and LaTeX and want to search and replace using GREP. As a side benefit, it gives me a break from actually working on my tenure file!

The basic problem is finding white spaces and then replace them with LaTeX column separators (&), without having BBEdit include the line endings. An example text is:

2015    Spring  4770    1   3.9 4.5 4.4
2015    Spring  2110    3   4.4 4.5 4.4
2015    Fall    4760    1   4.3 4.3 4.3
2015    Fall    2110    2   4.4 4.3 4.3
2016    Winter  3100    3   4.4 4.5 4.4
2016    Winter  3100    2   4.0 4.5 4.4
2016    Spring  2110    2   4.5 4.4 4.4
2016    Spring  2110    3   4.8 4.4 4.4

I could, of course, just use the column copy and paste in BBEdit, but where is the fun in that!

My first inclination was to use

\s*

but that captures the line endings as well, so I turned to my trusted app Patterns and after a bit I settled on

([[:blank:]]*)

which looked like it did what I wanted in Patterns' search and replace window. The problem was that transferring this to BBEdit gave me

& 2 & 0 & 1 & 5     &  & S & p & r & i & n & g   &  & 4 & 7 & 7 & 0     &  & 1    &  & 3 & . & 9  &  & 4 & . & 5  &  & 4 & . & 4 & 
 & 2 & 0 & 1 & 5     &  & S & p & r & i & n & g   &  & 2 & 1 & 1 & 0      &  & 3    &  & 4 & . & 4  &  & 4 & . & 5  &  & 4 & . & 4 & 
 & 2 & 0 & 1 & 5     &  & F & a & l & l     &  & 4 & 7 & 6 & 0     &  & 1    &  & 4 & . & 3  &  & 4 & . & 3  &  & 4 & . & 3 & 
 & 2 & 0 & 1 & 5     &  & F & a & l & l     &  & 2 & 1 & 1 & 0     &  & 2    &  & 4 & . & 4  &  & 4 & . & 3  &  & 4 & . & 3 & 
 & 2 & 0 & 1 & 6     &  & W & i & n & t & e & r   &  & 3 & 1 & 0 & 0     &  & 3    &  & 4 & . & 4  &  & 4 & . & 5  &  & 4 & . & 4 & 
 & 2 & 0 & 1 & 6     &  & W & i & n & t & e & r   &  & 3 & 1 & 0 & 0     &  & 2    &  & 4 & . & 0  &  & 4 & . & 5  &  & 4 & . & 4 & 
 & 2 & 0 & 1 & 6     &  & S & p & r & i & n & g   &  & 2 & 1 & 1 & 0     &  & 2    &  & 4 & . & 5  &  & 4 & . & 4  &  & 4 & . & 4 & 
 & 2 & 0 & 1 & 6     &  & S & p & r & i & n & g   &  & 2 & 1 & 1 & 0     &  & 3    &  & 4 & . & 8  &  & 4 & . & 4  &  & 4 & . & 4 &

which is not exactly what I had in mind! After quite a bit of digging around, thinking that the problem was with posix in BBEdit, I finally figured out that the quantifiers are treated differently in BBEdit than in Patterns. The correct version in BBEdit is

([[:blank:]]+)

Patterns for some reason does not treat "*" as actual zero whereas BBEdit does. Using this and

\1 \&

gave me this beautifully formatted text instead

2015     & Spring   & 4770     & 1    & 3.9  & 4.5  & 4.4
2015     & Spring   & 2110     & 3    & 4.4  & 4.5  & 4.4
2015     & Fall     & 4760     & 1    & 4.3  & 4.3  & 4.3
2015     & Fall     & 2110     & 2    & 4.4  & 4.3  & 4.3
2016     & Winter   & 3100     & 3    & 4.4  & 4.5  & 4.4
2016     & Winter   & 3100     & 2    & 4.0  & 4.5  & 4.4
2016     & Spring   & 2110     & 2    & 4.5  & 4.4  & 4.4
2016     & Spring   & 2110     & 3    & 4.8  & 4.4  & 4.4

Now I just need the end of line symbols (and to make the next 14 tables) and I am done!

By the way, instead of the posix version you could use

([^\S\r\n]+)

The same thing apply with the quantifier.

Control of fertility using only traditional contraceptives?

Shamma Alam and I just finished a paper on the effects of income shocks on the timing of fertility in Tanzania using the Kagera data set. There are significant reductions in the likelihoods of being pregnant and giving birth following shocks, consistent with prior results in the literature. What is new is that we can show that this is predominately the results of an increased use of contraceptives. This is interesting for two reasons. First, it shows that the postponement of fertility following a shock is the result of an conscious decision, rather than being an unintended consequence of the shocks' effect on, for example, health or migration. Second, the postponement is achieved almost entirely through the use of traditional contraceptives. This shows that, once the incentives are strong enough, people are able to control their fertility even in the absence of modern contraceptives. The full abstract is:

This paper examines the relationship between household income shocks and fertility decisions. Using panel data from Tanzania, we estimate the impact of agricultural shocks on pregnancy, births, and contraception use. We estimate individual level fixed effect models to account for potential correlation between unobservable household characteristics and both shocks and decisions on fertility and contraceptive use. The likelihood of pregnancies and childbirth are significantly lower for households that experience a crop shock. Furthermore, women significantly increase their contraception use in response to crop losses. We find little evidence that the response to crop loss depends on education or wealth levels. The increase in contraceptive use comes almost entirely from traditional contraceptive methods, such as abstinence, withdrawal, and the rhythm method. We argue that these changes in behavior are the result of deliberate decisions of the households rather than the shocks' effects on other factors that influence fertility, such as women’s health status, the absence or migration of a spouse, the dissolution of partnerships, or the number of hours worked. We also show that, although traditional contraceptives have low overall efficacy, households with a strong incentive to postpone fertility are very effective at using them.

New version of my paper on sex-selective abortions in India

After working through many and excellent comments from 3 referees, I now have a revised version of my paper on sex-selective abortions in India. There is also now a substantial on-line appendix (89 pages). The new abstract is:

This paper addresses two main questions: what is the relationship between fertility and sex selection and how does birth spacing interact with the use of sex-selective abortions? I introduce a statistical method that incorporates how sex-selective abortions affect both the likelihood of a son and spacing between births. Using India's National Family and Health Surveys, I show that falling fertility intensifies use of sex selection, leading to use at lower parities, and longer spacing after a daughter is born. Women with 8 or more years of education, both in urban and rural areas, are the main users of sex-selective abortions and have the lowest fertility. Women with less education have substantially higher fertility and do not appear to use sex selection. Predicted lifetime fertility for high-education women declined more than 10% between 1985–1994, when sex selection was legal, and 1995–2006, when sex selection was illegal. Fertility is now around replacement level. Abortions per woman increased almost 20% for urban women and 50% for rural women between the two periods, suggesting that making sex selection illegal has not reversed its use. Finally, sex selection appears to be used to ensure one son rather than multiple sons.

Parental absence paper published in Review of Economics of the Household

My paper, "Effects of Parental Absence on Child Labor and School Attendance in the Philippines," was published in the Review of Economics of the Household, vol. 14(1), pp 103-130, 2016.

This paper uses longitudinal data from the Philippines to analyze determinants of children’s time allocation. The estimation method takes into account both the simultaneity of time use decisions, by allowing for correlation of residuals across time uses, and unobservable family heterogeneity, through the inclusion of household fixed effects. Importantly, this improved estimation method leads to different results than when applying the methods previously used in the literature. Girls suffer significantly from the absence of their mother with a reduction in time spent in school that is equivalent to dropping out completely. This effect is substantially larger when controlling for household unobservables than when not. Boys increase time spent working on market related activities in response to an absent father, although this time appears to come out of leisure rather than school or doing household chores. Land ownership substantially increase the time boys spend on school activities, whereas renting land reduces the time girls spend on school. Finally, there does not appear to be a substantial trade-off between time spent on school and work, either in the market or at home.

Compensating wage differentials in an online labor market

We have recently finished the first in a series of planned papers based on experiments that we ran. The first paper is on the compensating wage differentials theory, with the title "Only if You Pay Me More: Field Experiments Support Compensating Wage Differentials Theory". Abstract below:

Compensating wage differentials is Adam Smith’s idea that wage differences equalize differences in job and worker characteristics. Other than risk of death, however, no job characteristics have consistently been found to affect wages, likely because of problems with self-selection and unobservable job characteristics. We run experiments in an online labor market, randomizing offered pay and job characteristics, thereby overcoming both problems. We find, as predicted by our model, that increasing job disamenities significantly reduces both likelihood of working and amount of work supplied. Correspondingly, the wage increases necessary to compensate workers for worse job disamenities are substantial, supporting the theory.

Congratulation to Chasya Hoagland

A big congratulations to Chasys Hoagland who successfully defended her PhD thesis today.

Chasya's main paper is on black women's hair styles and how people react to them using a very neat experiment. From the results it looks like hair styles affect perceptions of worker quality if there is imperfect information, but not if there is strong information about quality.

She also has a paper on the different effects of test scores, grades, and self-perception of ability in math and English on schooling outcomes and subsequent earnings. Even controlling for test scores and grades there is a strong effect of perceived abiility on earning, although not much effect on schooling outcomes.

Sex-selective abortion paper available as World Bank Policy Research Working Paper

My paper on sex-selective abortions in India is now available as a World Bank Policy Research Working Paper. You can find it here. The abstract is

Previous research on sex-selective abortions has ignored the interactions between fertility, birth spacing, and sex selection, despite both fertility and birth spacing being important considerations for parents when deciding on the use of sex selection. This paper presents a novel approach that jointly estimates the determinants of sex-selective abortions, fertility, and birth spacing, using data on Hindu women from India's National Family and Health Surveys. Women with eight or more years of education in urban and rural areas are the main users of sex-selective abortions and they also have the lowest fertility. Predicted lifetime fertility for these women declined 11 percent between the 1985-1994 and 1995-2006 periods, which correspond to the periods of time before and after sex selection became illegal. Fertility is now around replacement level. This decrease in fertility has been accompanied by a 6 percent increase in the predicted number of abortions during the childbearing years between the two periods, and sex selection is increasingly used for earlier parities. Hence, the legal steps taken to combat sex selection have been unable to reverse its use. Women with fewer than eight years of education have substantially higher fertility and do not appear to use sex selection.