1 Introduction
Understanding the causal effects of an intervention is a key question in many applications, from personalised medicine to marketing (e.g. Sun et al. (2015); Wager and Athey (2017); Alaa and van der Schaar (2017)). Predicting the causal outcome in these situations typically involves dealing with highdimensional observational data that is frequently subject to the effects of confounding, where the actions in the data are dependent on variables that may also indirectly influence the outcome.
In general, we distinguish between measured and hidden confounding: When confounders are directly measured, they may be accounted for using techniques that correct for their effects, such as propensity reweighting (IPS) or covariate shift (Hernán and Robins, 2006; Rosenbaum and Rubin, 1984). In contrast, to account for hidden confounding, proxy variables may be used as noisy representatives of latent confounders (Greenland and Lash, 2008; Pearl, 2012; Kuroki and Pearl, 2014; Louizos et al., 2017)
. While each of these approaches has its merits, the former and the latter approaches can only be applied when covariate data is completely measured. This assumption is not feasible in a large number of settings such as medicine. For example, doctors are interested in identifying treatments that improve patient outcomes, and have to base decisions on hundreds of potentially confounding variables such as age and genetic factors. Here, a doctor may readily have access to many routine measurements such as blood count data for all patients, but may only have genetic information for some patients. Inferring the causal effects of a treatment in such a setting requires learning a joint distribution over covariates and confounders of patients whose data is completely observable, while simultaneously transferring this knowledge to patients whose data is missing. In lowdimensional discrete settings, performing such a knowledge transfer is relatively straightforward as one can average over covariates and directly use this information to make inference about cases where covariates are missing. However, in highdimensional settings this is not achievable in practice since we have to integrate over all missing covariates.
In this paper, we propose addressing the problem of performing causal inference with partial covariate information from an decisiontheoretic point of view. We emphasise that our approach can be generalised to both hidden and measured confounders. Specifically, we assume that a fixed set of measurements is unavailable for a subset of the data (or patients) at test time. The key idea is to use the Information Bottleneck (IB) criterion (Tishby et al., 2000; Alemi et al., 2016) to perform a sufficient reduction of the covariate and recover a distribution of the confounding information. In particular, the IB enables us to build a discrete reference class over patients whose covariate data is complete, to which we can map patients with incomplete data and estimate treatment effects on the basis of such a mapping.
Our contributions may thus be summarised as follows: We learn a discrete, lowdimensional, interpretable latent space representation of confounding information. We use the discrete, lowdimensional representation to estimate the causal effect for data with missing covariates at test time. This representation allows us to learn equivalence classes among patients such that the specific treatment effect of a patient can be approximated by the specific treatment effect of the subgroups. Finally, we demonstrate that our method outperforms existing approaches across established causal inference benchmarks and a real world application for treating sepsis.
2 Preliminaries and Related Work
Potential Outcomes and Counterfactual Reasoning
Counterfactual reasoning (CR) has drawn large attention, particularly in the medical community. Counterfactual models are essentially rooted in causal inference and may be used to determine the causal effects of an intervention. These models are formalised in terms of potential outcomes (SplawaNeyman, 1923, 1990; Rubin, 1978). Assume we have two choices of taking a treatment , and not taking a treatment (control) . Let denote the outcomes under and denote outcomes under the control . The counterfactual approach assumes that there is a preexisting joint distribution over outcomes. This joint distribution is hidden since and cannot be applied simultaneously. Applying an action thus only reveals , but not . In this setting, computing the effect of an intervention involves computing the difference between when an intervention is made and when no treatment is applied (Pearl, 2009; Morgan and Winship, 2015). We would subsequently choose to treat with if,
(1) 
for loss over and respectively. Potential outcomes are typically applied to crosssectional data (Schulam and Saria, 2017b, a) and sequential time settings. Notable examples of models for counterfactual reasoning include Johansson et al. (2016) and Bottou et al. (2013). Specifically, Johansson et al. (2016)
propose a neural network architecture called TARnet to estimate the effects of interventions. Similarly, Gaussian Process CR (GPCR) models are proposed in
Schulam and Saria (2017b, a) and further extended to the multitask setting in Alaa and van der Schaar (2017). Offpolicy evaluation methods in reinforcement learning (RL) offer another perspective for reasoning about counterfactuals, and have been extensively explored to estimate the outcomes of a particular policy or series of treatments based on retrospective observational data (see for example
Dudík et al. (2011), Thomas and Brunskill (2016), Jiang and Li (2016)).Importantly, none of these approaches enable estimating treatment effects in the presence of confounding and partial covariate information. We emphasise that using the IB principle to perform causal inference in these settings is (to our knowledge) novel.
DecisionTheoretic View of Causal Inference
The decision theoretic approach to causal inference focuses on studying the effects of causes rather than the causes of effects (Dawid, 2007). Here, the key question is what is the effect of the causal action on the outcome?
The outcome may be modelled as a random variable
for which we can set up a decision problem. That is, at each point, the value of is dependent on whether or is selected. The decisiontheoretic view of causal inference considers the distributions of outcomes given the treatment or control, and and explicitly computes an expected loss of with respect to each action choice. Finally, the choice to treat with is made using Bayesian decision theory if,(2) 
Thus in this setting, causal inference involves comparing the expected losses over the hypothetical distributions and for outcome .
Information Bottleneck
The IB principle (Tishby et al., 2000) describes an information theoretic approach which is used to compress a random variable with respect to a second random variable . The compression of is described by another random variable . Achieving an optimal compression requires solving the following problem,
(3) 
under the assumption that and are conditionally independent given . Here, represents the mutual information between two random variables and
controls the degree of compression. In its classical form, the IB principle is defined only for discrete random variables. However, in recent years multiple IB relaxations for Gaussian
(Chechik et al., 2005) and metaGaussian variables (Rey and Roth, 2012) have been proposed.Deep Latent Variable Models
Deep latent variable models have recently received remarkable attention and been applied to a variety of problems. Among these, variational autoencoders (VAEs) employ the reparameterisation trick introduced in
Kingma and Welling (2013); Rezende et al. (2014) to infer a variational approximation over the posterior distribution of the latent space . Important work in this direction include Kingma et al. (2014) and Jang et al. (2017). Most closely related to the work we present here, is the application of VAEs in a healthcare setting by Louizos et al. (2017). Here, the authors introduce a CauseEffect VAE (CEVAE) to estimate the causal effect of an intervention in the presence of noisy proxies.Despite their differences, it has been shown that there are several close connections between the VAE framework and the IB principle. Alemi et al. (2016) introduce the Deep Information Bottleneck (DIB). This is essentially a VAE where is replaced by in the decoder. In contrast, the approach in this paper considers the IB principle to perform causal inference in scenarios where only partial covariate data is available at test time.
3 Method
In this section, we present an approach based on the IB principle for estimating the causal effects of an intervention with incomplete covariate information. We refer to this model as CEIB. In recent years, there has been a growing interest in the connections between the IB principle and deep neural networks (Tishby and Zaslavsky, 2015; Alemi et al., 2016; Wieczorek et al., 2018). Here, we use the nonlinear expressiveness of neural networks with the IB principle to recover a lowdimensional interpretable representation for approximating the causal effects of an intervention more effectively. In Figure 1, we illustrate an overview of the possible configurations for performing causal inference and present our model in the context of existing work. The corresponding causal graphs for Cases I and II are shown in Figure 2. The major difference between I and II is the reversal of the arrow between and , and the fact that in Case II confounders are not measured, but indirectly observed via noisy proxies.
In this paper, we interpret our model from the decisiontheoretic view of causal inference presented in Section 2. Like other approaches in the decisiontheoretic setting, our goal is to estimate the Average Causal Effect (ACE) of on . If we assume or define the interventional regimes, and the observational regime, the ACE is given by,
(4) 
Evidently, the ACE in Equation 4 is defined in terms of the interventional regime however, in practice we can only collect data on the basis of the observational regime, . The observational counterpart of the ACE may formally be defined as:
(5) 
In general, the ACE and observational ACE are not equal as long as we assume ignorable treatment assignments. Dawid (2007) show that the ACE and observational ACE are equivalent under the conditional independence assumption . This assumption expresses that the distribution of is the same in the interventional and observational regimes. It can also be extended to account for the notion of confounding. Here, the treatment assignment may be ignored when estimating , provided a sufficient covariate and . Formally, is a sufficient covariate for the effect of on outcome if and . It can also be shown via Pearl’s backdoor criterion (Pearl, 2009) that the ACE may be defined in terms of the Specific Causal Effect (SCE),
(6) 
where
(7) 
In our paper, we consider the decisiontheoretic approach of Dawid (2007) to estimate the causal effect where we have both hidden and measured confounding with incomplete covariates. This involves computing the ACE. Importantly, estimating the ACE only requires computing a distribution in Figure 2. In what follows, we use the IB to learn a sufficient covariate that allows us to approximate this distribution.
Case I: Measured Confounding
This case occurs when we have a highdimensional observational data set where all the relevant confounding variables are measured, but where a fixed set of covariates is only available for some subset of the data at test time. We propose modelling the task using a causeeffect IB with the architecture proposed in Figure 3. The IB approach allows us to learn a lowdimensional interpretable compression of the relevant information during training, which we can use to make causal inferences where data is incomplete at test time. Let and be our covariate sets (both available at training). We adapt the IB for learning the outcome of a therapy when partial covariate information is available for at test time. To do so, we consider the following parametric form,
(8) 
where and are lowdimensional discrete representations of the covariate data, is a concatenation of and and represents the mutual information parameterised by networks , , and respectively. We assume a parametric form of the conditionals , , ,
, as well as Markov chain
. The three terms in Equation 8 have the following forms:as a result of the Markov assumption in the IB model. Here is the entropy of . For the decoder model, we use an architecture similar to the TARnet (Johansson et al., 2016), where we replace conditioning on highdimensional covariates with conditioning on latent . We can thus express the conditionals as,
(9) 
with logistic function , and outcome
given by a Gaussian distribution parameterised with a TARnet with
. Note that the terms correspond to neural networks. While distribution is included to ensure the joint distribution over treatments, outcomes and covariates is identifiable, in practice, our goal is to approximate the effects of a given on . Hence, we train our model in teacher forcing fashion by using the true s from data and fix the s at test time. Since and are discrete latent representations of the covariate information, we make use of the Gumbel softmax reparameterisation trick (Jang et al., 2017) to draw samplesfrom a categorical distribution with probabilities
.Case II: Hidden Confounding
This case is analogous to the work of Louizos et al. (2017), where the authors proposed a variational autoencoder approach to deal with hidden confounding in the context of proxy variables. Louizos et al. (2017) assumed a VAE architecture where they explicitly include a network for estimating in the decoder model. This approach, however, requires the use of two auxiliary networks for predicting and for outofsample cases. We instead treat proxies as measured confounders and propose using Case I to estimate the causal effect here. Using Case I is permissible since both DAGs in Figure 2 are Markov equivalent, and the causal direction between and can only be determined by additional assumptions on the causal graph. However, assuming the causal structure in Figure 1(b) as in Louizos et al. (2017) requires the definition of a complex prior over . In the case of a highdimensional with a complex dependency structure, it is extremely difficult to define such a prior in practice. Hence, it may be more natural to treat all covariates including proxies as measured confounders like we propose in this paper. In doing so, we compress the relevant information to a sufficient covariate as described in Case I.
Once we can estimate in both cases using the proposed model, we can compute the ACE from Equation 5. The discrete representation enables us to learn equivalence classes among patients such that we can use the SCE of the subgroups from Equations 6 and 7 to approximate the individualised treatment effect. In particular, when given a test patient with partial covariates, we can assign them to the closest equivalence class of patients with similar characteristics, and approximate the effect of treatments for them on the basis of the equivalence class.
4 Experiments
The lack of ground truth in real world data makes evaluating causal inference algorithms a difficult problem. To overcome this issue, existing approaches typically consider a) either using synthetic or semisynthetic data sets where the outcomes and treatment assignment are fully known or, b) using randomised control trials. We use a semisynthetic benchmark data set from McCormick et al. (2013)
that is frequently used in many causal inference studies. We also demonstrate the performance of our approach on a highdimensional real world task for managing and treating sepsis. Our implementation uses Tensorflow
(Abadi et al., 2015), and the neural architectures considered in experiments (unless otherwise stated) have 3 hidden layers. Our model is trained with the Adam optimiser (Kingma and Ba, 2014) with a learning rate of 0.001. An additional experiment using a binary twins benchmark is provided in the supplement (Almond et al., 2005).4.1 Infant Health and Development Program
The Infant Health and Development Program (IHDP) (McCormick et al., 2013; Hill, 2011) is a randomised control experiment assessing the impact of educational intervention on outcomes of premature, low birth weight infants born in 19841985. Measurements from children and their mother were collected for studying the effects of childcare and home visits from a trained specialist on test scores. Briefly, the study contains information about the children and their mothers/caregivers. Data on the children include treatment group, sex, birth weight, health indices. Information about the mothers includes maternal age, mother’s race as well as educational achievement. Hill (2011) extract features and treatment assignments from the realworld clinical trial, and introduce selection bias to the data artificially by removing a nonrandom portion of the treatment group, in particular children with nonwhite mothers. In total, the data set consists of 747 subjects (139 treated, 608 control), each represented by 25 covariates measuring properties of the child and their mother. The data set is divided into 60/20/20% into training/validation/testing sets.
Method  

OLS1  
OLS2  
KNN  
BLR  
TARnet  
BNN  
RF  
CEVAE  
CFRW  
CEIB 
Withinsample and outofsample mean and standard errors for the metrics across models on the IHDP data set. A smaller value indicates better performance. Bold values indicate the method with the best performance.
For our experiment, we compare the performance of the proposed approach for predicting the ACE against several existing baselines. Descriptions about these baselines can be found in the supplement. We train our model with , dimensional Gaussian mixture components, although our method can be applied without loss of generality to any number of dimensions. To assess the ability to estimate treatment effects on the basis of partial information, we artificially exclude three covariates at test time. These are covariates that are exhibit a moderate correlation to the hidden confounder ethnicity. The results are shown in Table 1. Overall, our approach exhibits good performance for both insample and outofsample predictions, while simultaneously accounting for partial covariate information.
To assess the interpretability of the proposed approach and the ability to account for hidden confounding, we perform an analysis on the latent space of our model. First, we plot two information curves illustrating the number of latent dimensions required to reconstruct the output for the terms and respectively. These results are shown in Figure 3(a) and Figure 3(b). In particular, we perform this analysis when the data set of subjects is both derandomised and randomised (i.e. when we do not introduce selection bias into the data set). Comparing the information curves in Figure 3(a) confirms that when we do not derandomise the data, the information content in the treatment tends to be closer to 0, whereas the opposite is true when the data is derandomised. The information curves in Figure 3(b) additionally demonstrate our model’s ability to account for indirect effects of confounding when predicting the overall outcomes: when data is derandomised, we are able to reconstruct treatment outcomes more accurately. Overall, the results from Figures 3(a) and 3(b) highlight that there is indeed a hidden confounding effect that we can account for using the proposed approach.
Next, we perform an analysis of the discretised latent space by comparing the proportions of ethnic groups of test subjects in each cluster from the Gaussian mixture to see if we can recover the hidden confounding effect. These results are shown in Figure 5 where we plot a hard assignment of test subjects to clusters on the basis of their ethnicity. Evidently, the clusters exhibit a clear structure with respect to the ethnic groups. In particular, Cluster 2 in Figure 4(b) has a significantly higher proportion of nonwhite members in the derandomised setting, confirming that we are able to correctly identify the true confounding effect and account for this when making predictions. Finally, we perform similar analyses and assess the error in estimating the ACE when varying the number of mixture components in Figure 7. When the number of clusters is larger, the clusters get smaller and it becomes more difficult to reliably estimate the ACE since we average over the cluster members to account for partial covariate information at test time. Here, model selection is made by observing where the error in estimating the ACE stabilises (anywhere between 47 mixture components).
4.2 Sepsis Management
Proportion of initial SOFA scores in each cluster. The variation in initial SOFA scores across clusters suggests that it is a potential confounder of odds of mortality when managing and treating sepsis.
We illustrate the performance CEIB on the realworld task of managing and treating sepsis. Sepsis is one of the leading causes of mortality within hospitals and treating septic patients is highly challenging, since outcomes vary with interventions and there are no universal treatment guidelines. For this experiment, we make use of data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMICIII) database (Johnson et al., 2016). We focus specifically on patients satisfying Sepsis3 criteria (16 804 patients in total). For each patient, we have a 48dimensional set of physiological parameters including demographics, lab values, vital signs and input/output events, where covariates are partially incomplete. Our outcomes correspond to the odds of mortality, while we binarise medical interventions according to whether or not a vasopressor is administered. The data set is divided into 60/20/20% into training/validation/testing sets. We train our model with 6, 4dimensional Gaussian mixture components and analysed the information curves and cluster compositions respectively.
The information curves for and are shown in Figures 5(a) and 5(b) respectively. We observe that we can perform a sufficient reduction of the highdimensional covariate information to between 4 and 6 dimensions while achieving high predictive accuracy of outcomes . Since there is no ground truth available for the sepsis task, we do not have access to the true confounding variables. However, we can perform an analysis on the basis of the clusters obtained over the latent space. Here, we see that we can characterise the patients in each cluster according to their initial SOFA (Sequential Organ Failure Assessment) scores. SOFA scores range between 14 and are used to track a patient’s stay in hospital. In Figure 8, we observe clear differences in cluster composition relative to the SOFA scores. Clusters 2, 5 and 6 tend to have higher proportions of patients with lower SOFA scores, while Clusters 3 and 4 have larger proportions of patients with higher SOFA scores. This result suggests that a patient’s initial SOFA score is potentially a confounder when determining how to administer subsequent treatments and predicting their odds of inhospital mortality. This is consistent with medical studies such as Medam et al. (2017); Studnek et al. (2012) where authors indicate that high initial SOFA scores were likely to impact on their overall chances of survival and treatments administered in hospital.
Finally, while we cannot quantify an error in estimating the ACE since we do not have access to the counterfactual outcomes, we can still compute the ACE for the sepsis management task. Here, we specifically observe a negative ACE value. This means that in general, treating patients with vasopressors reduces the chances of mortality in comparison to not treating patients with vasopressors. Overall, performing such analyses for tasks like Sepsis may help correct for confounding and assist in establishing potential guidelines.
5 Discussion
CEIB learns a lowdimensional, interpretable representation of latent confounding
Since CEIB extracts only the information that is relevant for making predictions, it is able to learn a lowdimensional representation of the confounding effect and uses this to make predictions. In particular, the introduction of a discrete cluster structure in the latent space allows an easier interpretation of the confounding effect. For the IHDP experiment, we are able to learn a lowdimensional representation that is consistent with the known ethnicity confounder and account for its effects when making predictions of treatment outcomes. Similar methods such as Louizos et al. (2017) typically use a higher dimensional representation (in the order of 20 dimensions) to account for these effects and make less accurate predictions nonetheless. This is potentially a consequence of misrepresenting the true confounding effect. Modelling the task as an IB alleviates this problem. Analogously, for the sepsis task we identify a latent space of 6 dimensions when predicting odds of mortality, where clusters exhibit a distinct structure with respect to a patient’s initial SOFA score. In both tasks, the low dimensional representation enables us to accurately identify confounders without sacrificing interpretability.
CEIB enables estimating the causal effect with incomplete covariates.
Unlike previous approaches, CEIB can deal with incomplete covariate data during test time by introducing a discrete latent space. Specifically, we learn equivalence classes among patients such that the approximate the effects of treatments can be computed where data is incomplete.
CEIB makes stateoftheart predictions of the ACE that are robust against confounding
Across the IHDP dataset, we see that predictions of the ACE are more accurate than existing approaches. In the IHDP case, we see reductions in the error in estimating the ACE up to 0.58 for insample predictions. This performance is sustained when making outofsample predictions we see error reductions of between 0.04 and 0.73 in comparison with existing methods. Overall, we attribute this increase in performance directly to the fact that CEIB extracts only the information that is causally relevant for making predictions. Proxybased approaches such as Louizos et al. (2017) do not explicitly trade off learning meaningful representations of latent confounders and achieving accurate predictions. In contrast, we can explicitly inspect the information curves in Figure 3(b) and adjust compression parameter to uncover the true latent confounder. If we set in accordance to Figure 3(b), we require only a 4dimensional representation to adequately account for and uncover the true confounding effect (as shown in Figure 4(b)). This produces more accurate predictions as a result.
6 Conclusion
We have presented a novel approach to estimate causal relationships with respect to incomplete covariates from an decisiontheoretic point of view. For this purpose, we analysed the role of a sufficient covariate in the context of the IB framework to estimate the causal effect. By introducing a discrete latent space, we can estimate the causal effect if parts of the covariates are missing during test time, while accounting for both measured and hidden confounders. In contrast to previous methods, the compression parameter in the IB framework allows for a taskdependent adjustment of the latent dimensionality. Directions for future work include modelling structured hidden confounders as well as adopting CEIB to implicit generative models.
References

Abadi et al. (2015)
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal
Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  Alaa and van der Schaar (2017) Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multitask gaussian processes. CoRR, abs/1704.02801, 2017.
 Alemi et al. (2016) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin. Murphy. Deep Variational Information Bottleneck. ArXiv eprints, December 2016.
 Almond et al. (2005) Douglas Almond, Kenneth Y Chay, and David S Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
 Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin QuiñoneroCandela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
 Chechik et al. (2005) Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. In Journal of Machine Learning Research, pages 165–188, 2005.
 Dawid (2007) Philip Dawid. Fundamentals of statistical causality. Technical report, Department of Statistical Science, University College London, 2007.
 Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
 Greenland and Lash (2008) Sander Greenland and Timothy Lash. Bias analysis. Modern Epidemiology, pages 345 – 380, 2008.
 Hernán and Robins (2006) Miguel A Hernán and James M Robins. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7):578–586, 2006.
 Hill (2011) Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
 Jang et al. (2017) E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with GumbelSoftmax. International Conference on Learning Representations (ICLR), 2017.
 Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661, 2016.
 Johansson et al. (2016) Fredrik D. Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 3020–3029. JMLR.org, 2016.
 Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. abs/1412.6980, 2014.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 3581–3589, 2014.
 Kuroki and Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6446–6456. Curran Associates, Inc., 2017.
 McCormick et al. (2013) Marie C. McCormick, Jeanne BrooksGunn, and Stephen L. Buka. Infant health and development program, phase iv, 20012004 united states. 2013. doi: 10.3886/ICPSR23580.v2.
 Medam et al. (2017) Sophie Medam, Laurent Zieleskiewicz, Gary Duclos, Karine Baumstarck, Anderson Loundou, Julie Alingrin, Emmanuelle Hammad, Coralie Vigne, François Antonini, and Marc Leone. Medicine, 96(50), 12 2017. doi: 10.1097/MD.0000000000009241.
 Morgan and Winship (2015) Stephen L Morgan and Christopher Winship. Counterfactuals and causal inference. Cambridge University Press, 2015.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Pearl (2012) Judea Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.
 Rey and Roth (2012) Mélanie Rey and Volker Roth. Metagaussian information bottleneck. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 1925–1933, 2012.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.  Rosenbaum and Rubin (1984) Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524, 1984.
 Rubin (1978) Donald B Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34–58, 1978.
 Schulam and Saria (2017a) Peter Schulam and Suchi Saria. Whatif reasoning using counterfactual gaussian processes. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1696–1706, 2017a.
 Schulam and Saria (2017b) Peter Schulam and Suchi Saria. Reliable decision support using counterfactual models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1697–1708. Curran Associates, Inc., 2017b.
 Shalit et al. (2017) Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3076–3085, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 SplawaNeyman (1923) Jerzy SplawaNeyman. Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych, 10:1–51, 1923.

SplawaNeyman (1990)
Jerzy SplawaNeyman.
On the application of probability theory to agricultural experiments. essay on principles. section 9.
Statistical Science, 5(4):465–472, 1990.  Studnek et al. (2012) Jonathan R Studnek, Melanie R Artho, Craymon L Garner Jr, and Alan E Jones. The impact of emergency medical services on the ed care of severe sepsis. The American journal of emergency medicine, 30(1):51–56, 2012.
 Sun et al. (2015) Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal inference via sparse additive models with application to online advertising. In AAAI, pages 297–303, 2015.
 Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
 Tishby and Zaslavsky (2015) Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. CoRR, abs/1503.02406, 2015.
 Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Wager and Athey (2017)
Stefan Wager and Susan Athey.
Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, 2017.  Wieczorek et al. (2018) Aleksander Wieczorek, Mario Wieser, Damian Murezzan, and Volker Roth. Learning Sparse Latent Representations with the Deep Copula Information Bottleneck. International Conference on Learning Representations (ICLR), 2018.
References

Abadi et al. (2015)
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal
Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  Alaa and van der Schaar (2017) Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multitask gaussian processes. CoRR, abs/1704.02801, 2017.
 Alemi et al. (2016) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin. Murphy. Deep Variational Information Bottleneck. ArXiv eprints, December 2016.
 Almond et al. (2005) Douglas Almond, Kenneth Y Chay, and David S Lee. The costs of low birth weight. The Quarterly Journal of Economics, 120(3):1031–1083, 2005.
 Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin QuiñoneroCandela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
 Chechik et al. (2005) Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. In Journal of Machine Learning Research, pages 165–188, 2005.
 Dawid (2007) Philip Dawid. Fundamentals of statistical causality. Technical report, Department of Statistical Science, University College London, 2007.
 Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
 Greenland and Lash (2008) Sander Greenland and Timothy Lash. Bias analysis. Modern Epidemiology, pages 345 – 380, 2008.
 Hernán and Robins (2006) Miguel A Hernán and James M Robins. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7):578–586, 2006.
 Hill (2011) Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
 Jang et al. (2017) E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with GumbelSoftmax. International Conference on Learning Representations (ICLR), 2017.
 Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661, 2016.
 Johansson et al. (2016) Fredrik D. Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 3020–3029. JMLR.org, 2016.
 Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. abs/1412.6980, 2014.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 3581–3589, 2014.
 Kuroki and Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6446–6456. Curran Associates, Inc., 2017.
 McCormick et al. (2013) Marie C. McCormick, Jeanne BrooksGunn, and Stephen L. Buka. Infant health and development program, phase iv, 20012004 united states. 2013. doi: 10.3886/ICPSR23580.v2.
 Medam et al. (2017) Sophie Medam, Laurent Zieleskiewicz, Gary Duclos, Karine Baumstarck, Anderson Loundou, Julie Alingrin, Emmanuelle Hammad, Coralie Vigne, François Antonini, and Marc Leone. Medicine, 96(50), 12 2017. doi: 10.1097/MD.0000000000009241.
 Morgan and Winship (2015) Stephen L Morgan and Christopher Winship. Counterfactuals and causal inference. Cambridge University Press, 2015.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Pearl (2012) Judea Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.
 Rey and Roth (2012) Mélanie Rey and Volker Roth. Metagaussian information bottleneck. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 1925–1933, 2012.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.  Rosenbaum and Rubin (1984) Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association, 79(387):516–524, 1984.
 Rubin (1978) Donald B Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34–58, 1978.
 Schulam and Saria (2017a) Peter Schulam and Suchi Saria. Whatif reasoning using counterfactual gaussian processes. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1696–1706, 2017a.
 Schulam and Saria (2017b) Peter Schulam and Suchi Saria. Reliable decision support using counterfactual models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1697–1708. Curran Associates, Inc., 2017b.
 Shalit et al. (2017) Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3076–3085, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 SplawaNeyman (1923) Jerzy SplawaNeyman. Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych, 10:1–51, 1923.

SplawaNeyman (1990)
Jerzy SplawaNeyman.
On the application of probability theory to agricultural experiments. essay on principles. section 9.
Statistical Science, 5(4):465–472, 1990.  Studnek et al. (2012) Jonathan R Studnek, Melanie R Artho, Craymon L Garner Jr, and Alan E Jones. The impact of emergency medical services on the ed care of severe sepsis. The American journal of emergency medicine, 30(1):51–56, 2012.
 Sun et al. (2015) Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal inference via sparse additive models with application to online advertising. In AAAI, pages 297–303, 2015.
 Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
 Tishby and Zaslavsky (2015) Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. CoRR, abs/1503.02406, 2015.
 Tishby et al. (2000) Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.

Wager and Athey (2017)
Stefan Wager and Susan Athey.
Estimation and inference of heterogeneous treatment effects using random forests.
Journal of the American Statistical Association, 2017.  Wieczorek et al. (2018) Aleksander Wieczorek, Mario Wieser, Damian Murezzan, and Volker Roth. Learning Sparse Latent Representations with the Deep Copula Information Bottleneck. International Conference on Learning Representations (ICLR), 2018.
Appendix A Infant Health and Development Program: Baselines
For our experiments, we compare the performance of CEIB for predicting the ACE against several existing baselines as in Louizos et al. (2017): OLS1 is a least squares regression; OLS2 uses two separate least squares regressions to fit the treatment and control groups respectively; TARnet is a feedforward neural network from Shalit et al. (2017); KNN is a nearest neighbours regression; RF is a random forest; BNN is a balancing neural network (Johansson et al., 2016)
; BLR is a balancing linear regression
(Johansson et al., 2016), and CFRW is a counterfactual regression that using the Wasserstein distance (Shalit et al., 2017).Appendix B Additional Experiments
b.1 Binary Treatment Outcome on Twins
Like Louizos et al. (2017), we apply CEIB to a benchmark task using the birth data of twins in the USA between 1989 and 1991 (Almond et al., 2005). Here, treatment is a binary indicator of being the heavier twin at birth, while outcome corresponds to the mortality within a year after birth. Since mortality is rare, we consider only same sex twins with weights less than 2 kg which results in 11 984 pairs of twins. Each twin has a set of 46 covariates including information about their parents such as their level of education, race, incidence of renal disease, diabetes, smoking etc. as well as whether the birth took place in hospital or at home and the number of gestation weeks prior to birth.
To simulate an observational study, we selectively hide one of the twins. To illustrate the ability of CEIB to be applied to Case II where we treat proxy variables as measured confounders, we base the treatment assignment on a single variable which is highly correlated with the outcome: GESTAT10, the number of gestation weeks prior to birth. This has values from 09 that correspond to the weeks of gestation before birth i.e. birth before 20 weeks gestation, 2027 weeks of gestation, etc. Analogous to Louizos et al. (2017) we set treatment to for , where is GESTAT10 and are the 45 remaining covariates. Since CEIB can account for incomplete covariates, we artificially exclude 3 covariates from at test time.
Like Louizos et al. (2017)
, proxies are created with a onehot encoding of
, replicated 3 times and randomly flipping the 30 bits, where the flipping probability varies from 0.05 to 0.15. There may also be additional proxy variables forin the data from the set of variables. Our task is to predict the ACE. Specifically, we compare the performance of CEIB to CEVAE (with a varying number of hidden layers), TARnet (with varying numbers of hidden layers) and logistic regression (LR). These results are shown in Figure
9. Here too, CEIB achieves close to stateoftheart performance on the Twins task.
Comments
There are no comments yet.