Big Data (BUS 41201)
Course Description
BUS 41201 is a course about data mining: the analysis, exploration, and simplification of large high-dimensional datasets. Students will learn how to model and interpret complicated `Big Data' and become adept at building powerful models for prediction and classification.
Techniques covered include an advanced overview of linear and logistic regression, model choice and false discovery rates, multinomial and binary regression, classification, decision trees, factor models, clustering, the bootstrap and cross-validation. We learn both basic underlying concepts and practical computational skills, including techniques for analysis of distributed data.
Heavy emphasis is placed on analysis of actual datasets, and on
development of application specific methodology. Among other examples, we will consider consumer database mining, internet and social media tracking, network analysis, and text mining.
Syllabus
Teaching Assistants:
Ken McAlinn (kenmcalinn@gmail.com)
Wenxi Li (wenxi.li@chicagobooth.edu)
Jianfei Cao (jcao0@chicagobooth.edu)
Office Hours:
By appointment
Review Sessions:
Saturday at Gleacher.
Instructor:
Kenichiro (Ken) McAlinn (Senior Research Professional in
Econometrics
and
Statistics)
R Resources:
Dowload R ,
R Project Site ,
R Studio
Tutorials: Google developer ,
Princeton , TryR
code
school ,
Quick R
Books: R in a nutshell ,
Art
of R programming,
Library
E-Books ,
Introductory
Statistics with R
Piazza link
piazza.com/uchicago/spring2017/busn412010185bigdata/home
First Class Assignment:
Make yourself familiar with R! The course is a fast paced introduction to a
wide variety of statistical learning methods. Knowing the basics of R before you start will make your life much easier and allow you to concentrate your effort on learning data science tools and concepts.
As a start, I recommend going through R tutorials, such as the TryR
tutorial at
http://tryr.codeschool.com, to
people who are new to R.
Week 1 : Inference at scale
Slides
Datasets:
Trucks:
pickup.R ,
pickup.csv
Diabetes:
dm2_pvals.R
,
dm2_fdr.R ,
diabetes.csv
Cholesterol:
lipids.R ,
jointGwasMc_LDL.txt
Extra Code:
fdr.R
Week 2 : Regression
Slides
Datasets:
Orange juice:
oj.R ,
oj.csv
Spam:
spam.R
,
spam.csv
Extra Code:
deviance.R
Week 3 : Model Selection
Slides
Datasets:
Comscore:
comscore.R ,
CS2006demographics.csv
,
CS2006domains.csv.csv
,
CS2006sites.txt ,
CS2006totalspend.csv
Semiconductor:
semiconductor.R
,
semiconductor.csv
Extra Code:
naref.R
Week 4 : Treatment Effects
Slides
Datasets:
Abortion:
abortion.dat ,
abortion.R
,
us_cellphone.csv
Paidsearch:
paidsearch.csv ,
paidsearch.R
Extra Code:
mab.R
Week 5 : Classification
Slides
Datasets:
Credit:
credit.csv ,
credit.R ,
data_description
Glass:
glass.R
Extra Code:
roc.R
Week 6 : Networks
Slides
Datasets:
Marriage:
firenze.R ,
firenze.txt
Karate:
karate.R
Lastfm:
lastfm.R ,
lastfm.csv
Websearch:
CaliforniaEdges.csv ,
CaliforniaNodes.txt ,
websearch.R
Week 7 : Clustering
Slides
Datasets:
Protein:
protein.R ,
protein.csv
Wine:
wine.R ,
wine.csv
We8there:
we8there.R
Extra Code:
kIC.R
Week 8 : Factor Models
Slides
Datasets:
Protein:
protein.R ,
protein.csv
Rollcall:
rollcall_votes.R ,
rollcall.csv ,
rollcall-members.csv
NBC:
nbc_demographics.csv ,
nbc_pilotsurvey.csv ,
nbc_showdetails.csv ,
nbc.R
Gas:
gas.R ,
gasoline.csv
Week 9 : Trees
Slides
Datasets:
Prostate:
prostate_cancer.R ,
prostate.csv
Mcycle:
mcycle.R
Calhomes:
CAhousing.csv ,
calhomes.R