9:00am–6:00pm, Monday, June 12, 2017
Steven L. Scott, Director of Statistics Research, Google
This one-day short course will focus on two types of big data problems. The first is regression and time series problems with many predictor variables, the so-called “p > n problem”. The course will focus on stochastic search variable selection using spike-and-slab priors, with Google trends data serving as an example where many potential predictors are available. Attendees will learn how to use the
bsts R packages. Examples that we will consider include model assisted survey sampling, monitoring official statistics, and measuring the impact of market interventions.
The second part of the course focuses on big data problems where the data must be split across multiple machines, where communication between machines is costly. In this case, Consensus Monte Carlo can be used to minimize between-machine communication. Consensus Monte Carlo partitions the data into “shards” assigned to different workers. Each worker runs an independent posterior sampler conditional only on its data shard. The workers then combine their results to form a system-wide “consensus” posterior distribution that approximates the result that would have been obtained if the problem had been handled on a hypothetical single machine. The focus is on the logic of the Consensus Monte Carlo algorithm, for which an R package will be provided. We will not however discuss engineering aspects related to running jobs on multiple machines.
Prerequisites: This course is intended for graduate students and advanced undergraduates who have had some prior exposure to Bayesian statistics. Students should be familiar with Bayes’ rule in conjugate normal models, and posterior sampling using Gibbs and MCMC. Students should arrive with the
BoomSpikeSlab packages installed (both are available on CRAN).
Steven Scott is a Director of Statistics Research at Google, where he has worked since 2008. He received his PhD from the Harvard statistics department in 1998. He spent 9 years on the faculty of the Marshall School of Business at the University of Southern California. Between USC and Google he also had a brief tenure at Capital One Financial Corp, where he was a Director of Statistical Analysis.
Dr. Scott is a Bayesian statistician specializing in Monte Carlo computation. In his academic life he has written papers on Bayesian methods for hidden Markov models, multinomial logistic regression, item response models, and support vector machines. These methods have been applied to network intrusion detection, web traffic modeling, educational testing, health state monitoring, and brand choice, among others.
Since joining Google he has focused on models for time series with many contemporaneous predictors, on scalable Monte Carlo computation, and on Bayesian methods for the multi-armed bandit problem.
Short course registration can be found here.